Recent Posts

Help! Puppetd is eating my server!

4 minute read

This seems to be recurrent this last 3 or 4 days with a few #puppet, redmine or puppet-user requests, asking about why puppetd is consuming so much CPU and/or memory.

While I don’t have a definitive answer about why it could happen (hey all software components have bugs), I think it is important to at least know how to see what happens. I even include some common issues I myself have observed.

Know your puppetd

I mean, know what is puppetd doing. That’s easy, disable puppetd on the host where you have an issue, and try to run it manually in debug mode. I’m really astonished that almost nobody tries a debug run before complaining that something doesn’t work :-)

% puppetd --disable
% puppetd --test --debug --trace
... full output on the console ...

At the same time, monitor the CPU usage and look at the debug entries when most of the CPU is consumed.

If nothing is printed at this same moment, and it still uses CPU, CTRL-C the process, maybe it will print a useful stack trace that will help you (or us) understand what happens.

With this you will certainly catch things you didn’t intend (see below computing checksums when it is not necessary).

Inspect your ruby interpreter

I already mentioned this tip in my puppetmaster memory leak post a month ago. You can’t imagine how much useful information you can get with this tool.

Install as explained in the original article the ruby file into ~/.gdb/ruby, copy the following into your ~/.gdbinit:

define session-ruby
source ~/.gdb/ruby
end

Here I’m going to show how to do this with a puppetmasterd, but it is exactly the same thing with puppetd.

Basically, the idea is to attach gdb to the puppet process, halt it and look to the current stack trace:

% ps auxgww | grep puppetdpuppet
   28602  2.0  8.9 275508 184492 pts/3   Sl+  Feb19  65:25 ruby /usr/bin/puppetmasterd --debug
% gdb /usr/bin/ruby
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc....
(gdb) session-ruby
(gdb) attach 28602
Attaching to program: /usr/bin/ruby, process 28602...

Now our gdb is attached to our ruby interpreter.

Lets see where we stopped:

(gdb) rb_backtrace
$3 = 34

Note: the output is displayed by default on the stdout/stderr of the attached process, so in our case my puppetmasterd. Going to the terminal where it runs (actually the screen):

...
        from /usr/lib/ruby/1.8/webrick/server.rb:91:in `select'
        from /usr/lib/ruby/1.8/webrick/server.rb:91:in `start'
        from /usr/lib/ruby/1.8/webrick/server.rb:23:in `start'
        from /usr/lib/ruby/1.8/webrick/server.rb:82:in `start'
        from /usr/lib/ruby/1.8/puppet.rb:293:in `start'
        from /usr/lib/ruby/1.8/puppet.rb:144:in `newthread'
        from /usr/lib/ruby/1.8/puppet.rb:143:in `initialize'
        from /usr/lib/ruby/1.8/puppet.rb:143:in `new'
        from /usr/lib/ruby/1.8/puppet.rb:143:in `newthread'
        from /usr/lib/ruby/1.8/puppet.rb:291:in `start'
        from /usr/lib/ruby/1.8/puppet.rb:290:in `each'
        from /usr/lib/ruby/1.8/puppet.rb:290:in `start'
        from /usr/sbin/puppetmasterd:285

It works! It is now easy to see what puppetd is doing:

  1. introspect your running and eating puppetd
  2. stop it (issue CTRL-C in gdb)
  3. rb_backtrace, copy the backtrace in a file
  4. issue ‘continue’ in gdb to let the process run again
  5. go to 2. several times

Examining the stack traces should give you hints (or us) to what your puppetd is doing at this moment.

Possible causes of puppetd CPU consumption

A potential bug

You might have encountered a bug. Please report it in Puppet redmine, and enclose all the useful information you gathered by following the two points above.

A recursive file resource with checksum on

That’s the usual suspect, and one I encountered myself.

Let’s say you have something like this in your manifest:

File { checksum => md5 }
...
file {  "/path/to/so/many/files":
    owner => myself, mode => 0644, recurse => true
}

What does that mean?

You’re telling puppet that every file resource should compute checksum, and you have a recursive file operation managing owner and mode. What puppetd will do is to traverse the whole ‘/path/to/so/many/files’ and happily manage them changing owner and mode when needed.

What you might have forgotten, is that you requested checksum to be MD5, so puppetd instead of only doing a bunch of stat(3) on your files will also compute MD5 sums of their content. If you have tons of files in this hierarchy this can take quite some time. Since checksums are cached, it can also take quite some memory.

How to solve this issue:

File { checksum => md5 }
...
file {  
  "/path/to/so/many/files":
      owner => myself, mode => 0644, recurse => true, checksum => undef
}

Sometimes, it isn’t possible to solve this issue, if your file {} resource is a retrieve file (ie there is a source parameter), because you need to have checksum to manage the files. In this case, just byte the bullet, change the checksum to mtime, limit recursion or wait for my fix of Puppet bug #1469.

Simply no reason

Actually it is in your interest that puppetd is taking 100% of CPU while applying the configuration the puppetmaster has given. That just means it’ll do its job faster than if it was consuming 10% of CPU :-)

I mean, puppetd has a fixed amount of things to perform, some are CPU bound, some are I/O bound (actually most are I/O bound), so it is perfectly normal that it takes wall clock time and consume resources to play your manifests.

What is not normal is consuming CPU or memory between configuration run. But you already know how to diagnose such issues if you read the start of this post :-)

Conclusion

Not all resource consumption are bad. We’re all dreaming of a faster puppetd.

And at this subject, I think it should be possible (provided ruby supports native thread (maybe a task for JRuby)) to apply the catalog in a multi-threaded way. I never really thought about this (I mean technically), but I don’t see why it couldn’t be possible. That would allow puppetd to do several I/O bound operations in parallel (like installing packages and managing files at the same time).

Failed upgrade, impossible to downgrade… Oh my…

3 minute read

In the Days of Wonder Paris Office (where is located our graphic studio, and incidentally where I work), we are using Bacula to perform the multi-terabyte backup of the laaaaarge graphic files the studio produces every day.

The setup is the following:

Both servers are connected to the switch through two gigabit ethernet copper links, each one forming a 802.3ad link. The Apple Xserve and the linux box uses a layer3 hash algorithm to spread the load between each slave.

OK, that’s the fine print.

Usually about network gears, I’m pretty Cisco only (sorry, but I never found anything better than IOS). When we installed this setup back in 2006, the management decided to not go the full cisco route for the office network because of the price (a Dell 5324 is about 800 EUR, compared to a 2960G-24 which is more around 2000 EUR).

So, this switch was installed there, and never received an update (if it ain’t broken don’t fix it is my motto). Until last saturday, when I noticed that in fact the switch with the 1.0.0.47 firmware uses only layer-2 hashing to select the outgoing slave in a 802.3ad channel bonding. As you might have understood, it ruins all the efforts of both servers, since they have a constant and unique MAC address, so always the same slave is selected to move data from the switch to any server.

Brave as I am, I download the new firmware revision (which needs a new boot image), and I remotely installs it. And that was the start of the nightmare…

The switch upgraded the configuration to the new version, but unfortunately both 802.3ad channel groups were not up after the restart. After enquiring I couldn’t find any valid reason why the peers wouldn’t form such group.

OK, so back to the previous firmware (so that at least the backup scheduled for the same night would succeed). Unfortunately, something I didn’t think about, was that the new boot image couldn’t boot the old firmware. And if it did, I was still screwed up because it wouldn’t have been possible to run the configuration since it had been internally converted to the newer format…

I already downgraded cisco gear, and I never had such failure… Back to the topic.

So the switch was bricked, sitting in the cabinet without switching any packets. Since we don’t have any remote console server (and I was at home), I left the switch as is until early Monday…

On Monday, I connected my helpful eeePC (and an USB/Serial converter), launched Minicom, and connected to the switch serial console. I rebooted the switch, erased the config, rebooted, reloaded the config from our tftp server and I was back to 1.0.0.47 with both 802.3ad channel groups working… but still no layer-3 hashing…

But since I’m someone that wants to understand why things are failing, I also tried again the move to firmware 2.0.1.3 to see where I was wrong. And still the same result: no more channel groups, so back to 1.0.0.47 (because some angry users wanted to actually work that day :-))

After exchanging a few forum posts with some people on the Dell Community forum (I don’t have any support for this switch), I was suggested to actually erase the configuration before moving to the new firmware.

And that did it. It seems that the process of upgrading the configuration to the newest version is buggy and gave a somewhat invalid configuration from which the switch was unable to recover.

In fact, the switch seems to compile the configuration in a binary form/structure it uses to talk to the hardware. And when it upgraded the previous binary version, certainly some bits flipped somewhere and the various ports although still in the channel groups were setup as INDIVIDUAL instead of AGGREGATABLE.

Now the switch is running with a layer-3 hash algorithm, but it doesn’t seem to work fine, as if I run two parallel netcats on 2 IP addresses on the first server, connected to two other netcats on the second server, everything goes on only one path. I think this part needs more testing…

How would you test 802.3ad hashing?

February Puppet Dev Call

1 minute read

Yesterday we had the February Puppet Dev Call with unfortunately poor audio, lots of Skype disconnections which for a non native English speaker like me rendered the call difficult to follow (what is strange is that the one I could hear the best was Luke)

Puppet, brought to you by Reductive Labs

But that was an important meeting, as we know how the development process will continue from now on. It was agreed (because it makes real sense) to have the master as current stable and fork a ‘next’ branch for on-going development of the next version.

The idea is that newcomers will just have to git clone the repository to produce a bug fix or stable feature, without having to wonder (or read the development process wiki page) where/how to get the code.

It was also decided that 0.25 was really imminent with a planned release date later this month.

Arghhh, this doesn’t leave me lots of time to finish the Application Controller stuff I’m currently working on. The issue is that I procrastinated a little bit with the storeconfigs speed-up patch (which I hope will be merged for 0.25), and a few important 0.24.x bug fixes.

There was also a discussion about what should be part of the Puppet core and what shouldn’t (like the recent zenoss patch). Digression: I’m considering doing an OpenNMS type/provider like the Zenoss or Nagios one.

Back to the real topic. It was proposed to have a repository of non-core features, but this essentially only creates more troubles, including but not limited to:

  • _Versioning _of interdependent modules
  • Modules dependencies
  • Modules distribution
  • Testing (how do you run exhaustive tests if everything is scattered ?)
  • Reponsability

Someone suggested (sorry can’t remember who) that we need a packaging system to fill this hole, but I don’t think it is satisfactory. I understand the issue, but have no immediate answer to this question (that’s why I didn’t comment on this topic during the call).

Second digression: if you read this and want to contribute to Puppet (because that’s a wonderful software, a great developer team, a nicely and well-done codebase), I can’t stress you too much to read the following wiki pages:

Also come by to #puppet and/or the puppet-dev google groups, we’re ready to help!