Warren Young's questions -server

Warren Young

Asked: 2014-08-12 07:22:52 +0800 CST

FreeBSD link aggregation no faster than single link

We put a 4 port Intel I340-T4 NIC in a FreeBSD 9.3 server¹ and configured it for link aggregation in LACP mode in an attempt to decrease the time it takes to mirror 8 to 16 TiB of data from a master file server to 2-4 clones in parallel. We were expecting to get up to 4 Gbit/sec aggregate bandwidth, but no matter what we've tried, it never comes out faster than 1 Gbit/sec aggregate.²

We're using iperf3 to test this on a quiescent LAN.³ The first instance nearly hits a gigabit, as expected, but when we start a second one in parallel, the two clients drop in speed to roughly ½ Gbit/sec. Adding a third client drops all three clients' speeds to ~⅓ Gbit/sec, and so on.

We've taken care in setting up the iperf3 tests that traffic from all four test clients comes into the central switch on different ports:

LACP test setup

We've verified that each test machine has an independent path back to the rack switch and that the file server, its NIC, and the switch all have the bandwidth to pull this off by breaking up the lagg0 group and assigning a separate IP address to each of the four interfaces on this Intel network card. In that configuration, we did achieve ~4 Gbit/sec aggregate bandwidth.

When we started down this path, we were doing this with an old SMC8024L2 managed switch. (PDF datasheet, 1.3 MB.) It wasn't the highest-end switch of its day, but it's supposed to be able to do this. We thought the switch might be at fault, due to its age, but upgrading to a much more capable HP 2530-24G did not change the symptom.

The HP 2530-24G switch claims the four ports in question are indeed configured as a dynamic LACP trunk:

# show trunks
Load Balancing Method:  L3-based (default)

  Port | Name                             Type      | Group Type    
  ---- + -------------------------------- --------- + ----- --------
  1    | Bart trunk 1                     100/1000T | Dyn1  LACP    
  3    | Bart trunk 2                     100/1000T | Dyn1  LACP    
  5    | Bart trunk 3                     100/1000T | Dyn1  LACP    
  7    | Bart trunk 4                     100/1000T | Dyn1  LACP

We've tried both passive and active LACP.

We've verified that all four NIC ports are getting traffic on the FreeBSD side with:

$ sudo tshark -n -i igb$n

Oddly, tshark shows that in the case of just one client, the switch splits the 1 Gbit/sec stream over two ports, apparently ping-ponging between them. (Both the SMC and HP switches showed this behavior.)

Since the clients' aggregate bandwidth only comes together in a single place — at the switch in the server's rack — only that switch is configured for LACP.

It doesn't matter which client we start first, or which order we start them in.

ifconfig lagg0 on the FreeBSD side says:

lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=401bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO>
    ether 90:e2:ba:7b:0b:38
    inet 10.0.0.2 netmask 0xffffff00 broadcast 10.0.0.255
    inet6 fe80::92e2:baff:fe7b:b38%lagg0 prefixlen 64 scopeid 0xa 
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    media: Ethernet autoselect
    status: active
    laggproto lacp lagghash l2,l3,l4
    laggport: igb3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: igb2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

We've applied as much of the advice in the FreeBSD network tuning guide as makes sense to our situation. (Much of it is irrelevant, such as the stuff about increasing max FDs.)

We've tried turning off TCP segmentation offloading, with no change in the results.

We do not have a second 4-port server NIC to set up a second test. Because of the successful test with 4 separate interfaces, we're going on the assumption that none of the hardware is damaged.³

We see these paths forward, none of them appealing:

~~Buy a bigger, badder switch, hoping that SMC's LACP implementation just sucks, and that the new switch will be better.~~ (Upgrading to an HP 2530-24G didn't help.)
Stare at the FreeBSD lagg configuration some more, hoping that we missed something.⁴
Forget link aggregation and use round-robin DNS to effect the load balancing instead.
Replace the server NIC and switch again, this time with 10 GigE stuff, at about 4× the hardware cost of this LACP experiment.

Footnotes

Why not move to FreeBSD 10, you ask? Because FreeBSD 10.0-RELEASE still uses ZFS pool version 28, and this server's been upgraded to ZFS pool 5000, a new feature in FreeBSD 9.3. The 10.x line won't get that until FreeBSD 10.1 ships about a month hence. And no, rebuilding from source to get onto the 10.0-STABLE bleeding edge isn't an option, since this is a production server.
Please don't jump to conclusions. Our test results later in the question tell you why this is not a duplicate of this question.
iperf3 is a pure network test. While the eventual goal is to try and fill that 4 Gbit/sec aggregate pipe from disk, we are not yet involving the disk subsystem.
Buggy or poorly designed, maybe, but no more broken than when it left the factory.
I've already gone cross-eyed from doing that.

Warren Young

Asked: 2011-08-31 09:43:50 +0800 CST

RAID-6: better to replace two dead drives at the same time, or one at a time?

We have a 16-drive RAID-6 that has three problem drives. Two are already dead, and the third is giving SMART warnings. (Nevermind how it got in such a bad state.)

Obviously we want to replace the dead drives before the one that is still working, but is it better to:

replace one dead drive, let the RAID rebuild, then replace the other, and let it rebuild again; or
replace both drives at once and let it rebuild both in parallel?

To put it another way, will we get back to a state of redundancy faster by reintroducing one drive or two? Does rebuilding two drives in parallel slow the rebuild process?

In case it matters, the controller is a 3ware 9650SE-16ML.

FreeBSD link aggregation no faster than single link

RAID-6: better to replace two dead drives at the same time, or one at a time?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?