In the Days of Wonder Paris Office (where is located our graphic studio, and incidentally where I work), we are using Bacula to perform the multi-terabyte backup of the laaaaarge graphic files the studio produces every day.
The setup is the following:
- one linux box as the Bacula director connected to
- an Overland Arcvault 24 LTO 4, and to
- an HP Network Server RS/12 scsi cabinet (with 4 15k RPM disks)
- one Apple Xserve which acts as the studio filer, connected through two FiberChannel links to
- one Apple Xserve RAID fully loaded with 500GB disks
- and in the middle a “small” Dell 5324 Gigabit Switch, which acts as a collapsed core for the office.
Both servers are connected to the switch through two gigabit ethernet copper links, each one forming a 802.3ad link. The Apple Xserve and the linux box uses a layer3 hash algorithm to spread the load between each slave.
OK, that’s the fine print.
Usually about network gears, I’m pretty Cisco only (sorry, but I never found anything better than IOS). When we installed this setup back in 2006, the management decided to not go the full cisco route for the office network because of the price (a Dell 5324 is about 800 EUR, compared to a 2960G-24 which is more around 2000 EUR).
So, this switch was installed there, and never received an update (if it ain’t broken don’t fix it is my motto). Until last saturday, when I noticed that in fact the switch with the 126.96.36.199 firmware uses only layer-2 hashing to select the outgoing slave in a 802.3ad channel bonding. As you might have understood, it ruins all the efforts of both servers, since they have a constant and unique MAC address, so always the same slave is selected to move data from the switch to any server.
Brave as I am, I download the new firmware revision (which needs a new boot image), and I remotely installs it. And that was the start of the nightmare…
The switch upgraded the configuration to the new version, but unfortunately both 802.3ad channel groups were not up after the restart. After enquiring I couldn’t find any valid reason why the peers wouldn’t form such group.
OK, so back to the previous firmware (so that at least the backup scheduled for the same night would succeed). Unfortunately, something I didn’t think about, was that the new boot image couldn’t boot the old firmware. And if it did, I was still screwed up because it wouldn’t have been possible to run the configuration since it had been internally converted to the newer format…
I already downgraded cisco gear, and I never had such failure… Back to the topic.
So the switch was bricked, sitting in the cabinet without switching any packets. Since we don’t have any remote console server (and I was at home), I left the switch as is until early Monday…
On Monday, I connected my helpful eeePC (and an USB/Serial converter), launched Minicom, and connected to the switch serial console. I rebooted the switch, erased the config, rebooted, reloaded the config from our tftp server and I was back to 188.8.131.52 with both 802.3ad channel groups working… but still no layer-3 hashing…
But since I’m someone that wants to understand why things are failing, I also tried again the move to firmware 184.108.40.206 to see where I was wrong. And still the same result: no more channel groups, so back to 220.127.116.11 (because some angry users wanted to actually work that day :-))
After exchanging a few forum posts with some people on the Dell Community forum (I don’t have any support for this switch), I was suggested to actually erase the configuration before moving to the new firmware.
And that did it. It seems that the process of upgrading the configuration to the newest version is buggy and gave a somewhat invalid configuration from which the switch was unable to recover.
In fact, the switch seems to compile the configuration in a binary form/structure it uses to talk to the hardware. And when it upgraded the previous binary version, certainly some bits flipped somewhere and the various ports although still in the channel groups were setup as INDIVIDUAL instead of AGGREGATABLE.
Now the switch is running with a layer-3 hash algorithm, but it doesn’t seem to work fine, as if I run two parallel netcats on 2 IP addresses on the first server, connected to two other netcats on the second server, everything goes on only one path. I think this part needs more testing…
How would you test 802.3ad hashing?