Slow speeds transferring data between my machines

Alright, I’m not sure what I’m missing, or where I’m not testing, so I’m appealing to the people who know more than me.

I’ve got a couple of machines. We’ll call them DB1 and DB2.

If I test the network connection between the two of them, it looks fine:

DB1 -> DB2
1024.75 Mbit/sec

DB2 -> DB1
895.13 Mbit/sec

When you convert those to Gb/s, I’m getting right around the theoretical max for the network.
(this was tested by using ttcp, by the way). So at least my cables aren’t broken.

Now, the next thing I thought of was that my disks were slow.

DB1 has an internal array. It’s fast enough:

[[email protected] ~]# dd if=/dev/zero of=/db/testout bs=1024 count=10000000
10000000+0 records in
10000000+0 records out
10240000000 bytes (10 GB) copied, 58.2554 seconds, 176 MB/s

DB2 is connected to the SAN, and is no slouch either:

[[email protected] ~]# dd if=/dev/zero of=/db/testout bs=1024 count=10000000
10000000+0 records in
10000000+0 records out
10240000000 bytes (10 GB) copied, 76.7791 seconds, 133 MB/s

When I read from the big array on DB1 to the mirrored disks, I get very fast speeds. Because my free space is small enough (< 4GB free) on my mirrored disks, I can't get a big enough file to make the transfer count. It reports 1,489.63Mb/s, which is baloney, but it lets me know that it's fast. Reading from the SAN to DB2’s local disks is, if not fast, passable
10240000000 bytes (10 GB) copied, 169.405 seconds, 60.4 MB/s
That works out 483.2Mb/s

Now, when I try to rsync from DB2 to DB1, I have issues. Big issues.

I tried to rsync across a 10GB file. Here were the results:

sent 10241250100 bytes received 42 bytes 10935664.86 bytes/sec
(10.93 MB/s or 87.49Mb/s)

Less than 100Mb/s.

I was alerted to this problem earlier, when it took all damned day to transfer my 330GB database image. Here’s the output from that ordeal:

sent 68605073418 bytes received 3998 bytes 2616367.39 bytes/se
(2.62 MB/s or 20.93Mb/s)

It only says 68GB because I used the -z flag on the rsync.

To prove that it isn’t some sort of bizarre combination of the SAN causing some problem when being read from rsync, here’s a mirrored-disk transfer from DB2 to the root partition on DB1:

sent 1024125084 bytes received 42 bytes 11192624.33 bytes/sec
(11.12MB/s or 89.54Mb/s)

I’m willing to say that maybe the network was congested earlier, or maybe the SAN was under stress, but on an otherwise unused network, I should be getting a lot damned more than 89Mb/s between two servers on the same Gb LAN.

Any ideas?

Figured it out. Stupid compression flag on rsync

  • Ian

    What type of network gear are these servers on? Same physical switch or different ones?

  • Matt


    Each machine has two interfaces, bonded into one virtual interface. The bonding mode is 5 (see this)

    There are two switches, called red and blue, from the color of the cables. One interface from each machine is going to each switch. The switches are standard Netgear 24port Gb. Two VLANs, but they won’t enter into this, since they’re port based, and all four connections are in the same, untagged vlan

  • Jack

    Could it be the compression? I’ve heard that compression can work against you on a fast network. Did you check top during the transfer?

  • Michael Janke

    TTcp checks out the network, but it doesn’t write to disk.

    Native disk writes are OK, but not with rsync.

    It sounds like rsync is using a small block size to transfer data, or perhaps a small receive window. You’ll only get new a Gig if you are streaming TCP. If you are ack’ing too often, I don’t think you can fill a gig.

    Netstat might show the receive window.

    A dump of the packet size counters might be interesting.

  • Kenny

    Shouldn't the speed of DB1 -> DB2 and DB2 -> DB1 be the same, or am I missing something?

  • Frank

    I’ve had similar situations on my network when the switch port is set to 1gb full duplex, and the server nic is set to auto-negotiate. Both have to have the same speed and duplex setting for the connection to work properly.

  • Ian

    Depending on the gear, as frank mentioned, you can have duplex problems. I’m not familiar with netgear, but Cisco equipment get get stupid with auto negotiation, especially when you’re linking up to non cisco equipment.

    What’s the subnet mask of the servers? How many devices on that vlan/network segment? I’m wondering if you’re having broadcast traffic issues.

    If it’s not that, what if you break the port channel groups and go with single links? I wonder if you’d actually see performance improvement gains.

  • Matt


    I don’t think it’s compression, since the 2nd transfer I did came across at 10Mb/s and it wasn’t compressed at all. Just zeros.

    You may be on to something. I’ll do more investigating in that direction, thnaks!

    Ideally yes, but it’s possible that the cable isn’t crimped exactly right, or something similar. They’re both fast “enough” for the moment, and I know the bottleneck isn’t the cables

    I’ll verify the duplex on the links, but I’m fairly sure it’s coming across right.

    The subnet mask is /24. If it weren’t for the bonding, it would be a simple network arrangement. I’m wondering whether there’s an issue with the bonding mechanism, and maybe the effect is only apparent on longer-lasting streams. That would explain why my 16MB ttcp test (which lasted less than a second) showed high, my mid-range test (10GB) showed slow, and my long-range test (68GB) showed ultra slow.

    I’ll be doing some investigations. I’m also in the middle of shipping new switches up there. Instead of 24 port Netgear switches, I’m going to be using 3com Baseline 2948+’s

    Since my new switches have a lot more capabilities, I’m hoping that I can configure the aggregate ports between them to perform better and not have this issue. I may have to change the bonding mode too.

    I’m going to be on site next week, which will make things much easier to debug!

    Thanks everyone for your input! I really do appreciate the suggestions, and if you think of anything else, please let me know.

  • M

    Your best bet is to enable jumbo frames. It seems like every time I have slow file transfers over a 1gig or higher network it disappears when enabling them. Best way to tell for sure is look at a graph of network traffic, if it’s spikey, that is a sure sign that the nic is sending out all it can and waiting for acks back. Jumbo frames will fix that.

  • steve

    Is rsync using ssh? If so, which cipher? You might try -c blowfish which is MUCH faster than the (normally default) 3des.

    Maybe don’t use rsync — how about netcat? that avoids the encryption altogether. If nothing else you could dd {data} | nc and measure the transfer rates to see if it’s rsync or the network.

    Or maybe mount the volumes with NFS or SMB then rsync the local mounts rather than the SSH transport.

    Is CPU maxing out during the slow transfer?

  • Bob

    Matt, are you using ssh/scp? Try looking at:

  • Matt


    Yes, I am, and I didn’t know about that. Thanks a bunch. That looks really interesting!

  • Nice article you got here. It would be great to read something more about this theme.