Masterzen’s Blog

Journey in a software world…

Archive for the ‘System Administration’ Category

In the Days of Wonder Paris Office (where is located our graphic studio, and incidentally where I work), we are using Bacula to perform the multi-terabyte backup of the laaaaarge graphic files the studio produces every day.

The setup is the following:

Both servers are connected to the switch through two gigabit ethernet copper links, each one forming a 802.3ad link. The Apple Xserve and the linux box uses a layer3 hash algorithm to spread the load between each slave.

OK, that’s the fine print.

Usually about network gears, I’m pretty Cisco only (sorry, but I never found anything better than IOS). When we installed this setup back in 2006, the management decided to not go the full cisco route for the office network because of the price (a Dell 5324 is about 800 EUR, compared to a 2960G-24 which is more around 2000 EUR).

So, this switch was installed there, and never received an update (if it ain’t broken don’t fix it is my motto). Until last saturday, when I noticed that in fact the switch with the 1.0.0.47 firmware uses only layer-2 hashing to select the outgoing slave in a 802.3ad channel bonding. As you might have understood, it ruins all the efforts of both servers, since they have a constant and unique MAC address, so always the same slave is selected to move data from the switch to any server.

Brave as I am, I download the new firmware revision (which needs a new boot image), and I remotely installs it. And that was the start of the nightmare…

The switch upgraded the configuration to the new version, but unfortunately both 802.3ad channel groups were not up after the restart. After enquiring I couldn’t find any valid reason why the peers wouldn’t form such group.

OK, so back to the previous firmware (so that at least the backup scheduled for the same night would succeed). Unfortunately, something I didn’t think about, was that the new boot image couldn’t boot the old firmware. And if it did, I was still screwed up because it wouldn’t have been possible to run the configuration since it had been internally converted to the newer format… I already downgraded cisco gear, and I never had such failure… Back to the topic.

So the switch was bricked, sitting in the cabinet without switching any packets. Since we don’t have any remote console server (and I was at home), I left the switch as is until early Monday…

On Monday, I connected my helpful eeePC (and an USB/Serial converter), launched Minicom, and connected to the switch serial console. I rebooted the switch, erased the config, rebooted, reloaded the config from our tftp server and I was back to 1.0.0.47 with both 802.3ad channel groups working… but still no layer-3 hashing…

But since I’m someone that wants to understand why things are failing, I also tried again the move to firmware 2.0.1.3 to see where I was wrong. And still the same result: no more channel groups, so back to 1.0.0.47 (because some angry users wanted to actually work that day :-) )

After exchanging a few forum posts with some people on the Dell Community forum (I don’t have any support for this switch), I was suggested to actually erase the configuration before moving to the new firmware.

And that did it. It seems that the process of upgrading the configuration to the newest version is buggy and gave a somewhat invalid configuration from which the switch was unable to recover.

In fact, the switch seems to compile the configuration in a binary form/structure it uses to talk to the hardware. And when it upgraded the previous binary version, certainly some bits flipped somewhere and the various ports although still in the channel groups were setup as INDIVIDUAL instead of AGGREGATABLE.

Now the switch is running with a layer-3 hash algorithm, but it doesn’t seem to work fine, as if I run two parallel netcats on 2 IP addresses on the first server, connected to two other netcats on the second server, everything goes on only one path. I think this part needs more testing…

How would you test 802.3ad hashing?

February Puppet Dev Call

Yesterday we had the February Puppet Dev Call with unfortunately poor audio, lots of Skype disconnections which for a non native English speaker like me rendered the call difficult to follow (what is strange is that the one I could hear the best was Luke)

Puppet, brought to you by Reductive Labs

But that was an important meeting, as we know how the development process will continue from now on. It was agreed (because it makes real sense) to have the master as current stable and fork a ‘next’ branch for on-going development of the next version.

The idea is that newcomers will just have to git clone the repository to produce a bug fix or stable feature, without having to wonder (or read the development process wiki page) where/how to get the code.

It was also decided that 0.25 was really imminent with a planned release date later this month. Arghhh, this doesn’t leave me lots of time to finish the Application Controller stuff I’m currently working on. The issue is that I procrastinated a little bit with the storeconfigs speed-up patch (which I hope will be merged for 0.25), and a few important 0.24.x bug fixes.

There was also a discussion about what should be part of the Puppet core and what shouldn’t (like the recent zenoss patch). Digression: I’m considering doing an OpenNMS type/provider like the Zenoss or Nagios one. Back to the real topic. It was proposed to have a repository of non-core features, but this essentially only creates more troubles, including but not limited to:

  • Versioning of interdependent modules
  • Modules dependencies
  • Modules distribution
  • Testing (how do you run exhaustive tests if everything is scattered ?)
  • Reponsability

Someone suggested (sorry can’t remember who) that we need a packaging system to fill this hole, but I don’t think it is satisfactory. I understand the issue, but have no immediate answer to this question (that’s why I didn’t comment on this topic during the call).

Second digression: if you read this and want to contribute to Puppet (because that’s a wonderful software, a great developer team, a nicely and well-done codebase), I can’t stress you too much to read the following wiki pages:

Also come by to #puppet and/or the puppet-dev google groups, we’re ready to help!

The curse of bad blocks (is no more)

If you like me are struggling with old disks (in my case SCSI 10k RPM Ultra Wide 2 HP disks) that exhibits bad blocks, here is a short survival howto. Those disks are placed in a refurbished HP Network RS/12 I use as a spool area for Bacula backups of our Apple XServe RAID which is used by Days of Wonder graphic Studio (and those guys knows how to produce huge files, trust me).

Since a couple of days, one of the disk exhibits read errors on some sectors (did I say they are old), so waiting to get replaced by other (old) disks, I had to find a way to have it working.

Of course the SCSI utility in the Adaptec SCSI card has a remapping tool, but you have reboot the server and have it offline during the verify, which can take a long time, so that wasn’t an option.

I then learnt about sg3_utils (sg3-utils for the debian package) thanks to the very good page of smartmontools bad blocks handling.

This set of tools directly address SCSI disks through mode page, to instruct the disk to do some things. What’s interesting is that it comes with two commands of great use (there might be more of course):

  • sg_verify: to check for the health of a sector
  • sg_reassign: to remap a dead sector to one from the good sector list

Here is the use case:

backup:~# dd if=/dev/sda iflag=direct of=/dev/zero skip=1915 bs=1M
dd: reading `/dev/sda': Input/output error
12+0 records in
12+0 records out
12582912 bytes (13 MB) copied, 1.41468 seconds, 8.9 MB/s

Something is wrong, we only read 13MB instead of the whole disk.
Let’s have look to the kernel log:

backup:~# dmesg | tail
[331709.192108] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
[331709.192108] sd 0:0:0:0: [sda] Sense Key : Medium Error [current]
[331709.192108] Info fld=0x3c3bb1
[331709.192108] sd 0:0:0:0: [sda] Add. Sense: Read retries exhausted
[331709.192108] end_request: I/O error, dev sda, sector 3947441

Indeed /dev/sda has a failed sector (at lba 3947441).
Let’s confirm it:

backup:~# sg_verify --lba=3947441 /dev/sda
verify (10):  Fixed format, current;  Sense key: Medium Error
 Additional sense: Read retries exhausted
  Info fld=0x3c3bb1 [3947441]
  Actual retry count: 0x003f
medium or hardware error, reported lba=0x3c3bb1

Check the defect list:

sg_reassign --grown /dev/sda
>> Elements in grown defect list: 0

And tell the disk firmware to reassign the sector

backup:~# sg_reassign --address=3947441 /dev/sda

Now verify that it was remapped:

backup:~# sg_reassign --grown /dev/sda
>> Elements in grown defect list: 1

Do we have a working sector?

backup:~# dd if=/dev/sda iflag=direct of=/dev/null bs=512 count=1 skip=3947441
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00780813 seconds, 65.6 kB/s

The sector could be read! The disk is now safe.

Of course, this tutorial might not work for every disks: PATA and SATA disks don’t respond to SCSI commands. For those disks, you have to write on the failed sector with dd and the disk firmware should automatically remap the sector. This can be proved by looking at the Reallocated_Sector_Ct output of smartctl -a.

Good luck :-)

Puppet Memory Leaks… Or not…

From time to time we get some complaints about so-called Puppet memory leaks either on #puppet, on the puppet-user list or in the Puppet redmine.

I tried hard to reproduce the issue on the Days of Wonder servers (mostly up-to-date debian), but never could. Starting from there I tried to gather from the various people I talked to on various channels what could be the cause, if they solved it and how.

You also can be sure there are no memory leaks in the Puppet source code. All of the identified memory leaks are either not memory leaks per-se or are caused by an out of control code base (ruby itself or a library).

Watch your Ruby

It is known that there are some ruby versions (around 1.8.5 and 1.8.6) exhibiting some leaks of some sort. This is especially true for RHEL 4 and 5 versions (and some Fedora ones too), as I found with the help of one Puppet user, or as others found.

Upgrading Ruby to 1.8.7-pl72 either from source or any repositories is usually enough to fix it.

Storeconfigs and MySQL

I also encountered some people that told me that storeconfigs with MySQL but without the real ruby-mysql gem, lead to some increasing memory footprint for their puppetmaster.

Storeconfigs and Rails < 2.1

It seems also to be a common advice to use Rails 2.1 if you use storeconfigs. I don’t know if Puppet uses this, but it seems that nested includes leaks in rails 2.0.

Is it really a leak?

The previous items I outlined above are real leaks. Some people (including myself) encountered a different issue: the puppetmaster is consuming lots of memory while doing file transfer to the clients.

In fact, up to Puppet 0.25 (not yet released at this time), Puppet is using XMLRPC as its communication protocol. Unfortunately this is not a transfer protocol, it is a Remote Procedure Call protocol. It means that to transfer binary files, Puppet has to load the whole file in memory, and then it escapes its content (same escaping as URL, which means every byte outside of 32-127 will take 3 bytes). Usually that means the master has to allocate roughly 2.5 times the size of the current transferred file. Puppet 0.25 will use REST (so native HTTP) to transfer files, which will bring speed and streaming to file serving.

Hopefully, if the Garbage Collector has a chance to trigger (because your ruby interpreter is not too much loaded), it will de-allocate all these memory used for files. If you are not so lucky, the ruby interpreter don’t have time to run a full garbage cycle, and the memory usage grows.

Some people running high-load puppetmaster have separated their file serving puppetmaster from their config serving puppetmaster to alleviate this issue.

Also, if like me you are using file recursive copy, you might encounter Bug #1469 File recursion with a remote source should not recurse locally.

I still have a leak you didn’t explain

Here is how you can find leaks in a ruby application:

I tried the three aforementioned techniques, and found that the GDB trick is the easier one to use and setup.

Another Ruby?

There’s also something that I think hasn’t been tried yet: running Puppet under a different Ruby interpreter (we’d say Virtual Machine in this case). For instance JRuby is running on top of the Java Virtual Machine which has more than 14 years of Garbage Collection development behind it.

You also can be sure than a different Ruby interpreter won’t have the same bug or memory leak as the regular one (the so called Matz Ruby interpreter from the name of his author).

There are some nice Ruby VM under development right now, and I’m sure I’ll blog about using Puppet on some of them soon :-)

Have you ever wondered why net-snmp doesn’t report a correct interface speed on Linux?

I was also wondering, until this morning… I tried to run net-snmp as root, and miracle, the right interface speed was detected for my interfaces.

In fact net-snmp uses the SIOCETHTOOL ioctl to access this information. Unfortunately the get settings variant of this ioctl needs to have the CAP_NET_ADMIN enabled. Of course root has this capability set, but when net-snmp drops its privileges to an unprivileged user, this capability is lost and the ioctl fails with EPERM

That’s too bad because getting this information is at most harmless and shouldn’t require special privileges to succeed. Someone even posted a Linux Kernel patch to remove CAP_NET_ADMIN check for SIOCETHTOOL which doesn’t seem to have been merged.
The fix could also be on the snmpd side before dropping privileges.

The workaround is to tell net-snmp how the interface are looking:

interface eth0 6 10000000
interface eth1 6 100000000

Here I defined eth0 as a 100mbit/s FastEthernet interface, and eth1 as a GigabitEthernet interface.

Since a few months we are monitoring our infrastructure at Days of Wonder with OpenNMS. Until this afternoon we were running the beta/final candidate version 1.5.93.

We are monitoring a few things with the JDBC Stored Procedure Poller, which is really great to monitor complex business operations without writing remote or GP scripts.

Unfortunately the migration to OpenNMS 1.6.1 led me to discover that the JDBC Stored Procedure poller was not working anymore, crashing with a NullPointerException in the MySQL JDBC Driver while trying to fetch the output parameter.

In fact it turned out I was plain wrong. I was using a MySQL PROCEDURE:

DELIMITER //
CREATE PROCEDURE `check_for_something`()
READS SQL DATA
BEGIN
SELECT ... as valid FROM ...
END //

But this OpenNMS poller uses the following JDBC procedure call:

{ ? = call check_for_something() }

After a few struggling, wrestling, and various MySQL JDBC Connector/J driver upgrades, I finally figured out what the driver was doing:

The driver rewrites the call I gave above to something like this:

SELECT check_for_something();

This means that the procedure should in fact be a SQL FUNCTION.

Here is the same procedure rewritten as a FUNCTION:

DELIMITER //
CREATE FUNCTION `check_for_something`()
RETURNS int(11)
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE valid INTEGER;
SELECT ... INTO valid FROM ...
RETURN valid;
END //

It now works. I’m amazed it even worked in the first time with 1.5.93 (it was for sure).

Masterzen's Pictures

Ticket to Ride World Championship 2010

Ticket to Ride World Championship 2010

Patrick (NL)
Ticket to Ride World Championship 2010

Patrick (NL) Ticket to Ride World Championship 2010

Ian Andrew (UK)
Ticket to Ride World Championship 2010

Ian Andrew (UK) Ticket to Ride World Championship 2010

masterzen's photo

masterzen's photo

Ticket to Ride World Championship 2010

Ticket to Ride World Championship 2010

Final
Ticket to Ride World Championship 2010

Final Ticket to Ride World Championship 2010

masterzen's photo

masterzen's photo

Ian Andrew (UK), semi-final
Ticket to Ride World Championship 2010

Ian Andrew (UK), semi-final Ticket to Ride World Championship...