Recent Posts

PLD to the rescue!

2 minute read

There is something I used to hate to do. And I think all admins also hate to do that.

It’s when you need to reboot a server on a rescue environment to perform an administration task (i.e. fixing unbootable servers, fixing crashed root filesystems, and so on).

The commonly found problems with rescue environment are:

  • they’re not always remotely usable
  • they’re not always updated to your specific kernel version or tool
  • they can be difficult to use
  • some are CD or DVD only (no netboot, no usb keys…)
  • they don’t recognize your dumb azerty keyboard (argh, too much time spent looking for / or .) OK, so a long time ago, I had a crashed server refusing to start on a reboot, and I had to chose a rescue environment for linux servers, other than booting on the Debian CD once again.

That’s how I discovered PLD Linux rescue CD: PLD Rescue

and GRML:

GRML

My heart still goes to PLD rescue (because it’s really light), but I must admit that GRML has a really good zsh configuration (I even used some of their configuration ideas for my day to day zsh).

On that subject, if you don’t use zsh or don’t even know it and still want to qualify as a knowledgeable Unix admin, then please try it (preferably with GRML so that you’ll have an idea of what’s possible, and they even have a good documentation), another solution is to buy of course this really good book: “From Bash to Z Shell: Conquering the Command Line

That makes me think I should do a whole blog post on zsh.

OK, so let’s go back to our sheep (yes that’s a literally French translated expression, so I don’t expect anyone to grasp the funny part except the occasional French guys reading me :-)).

So what’s so good about PLD Rescue:

  • it supports serial console (and that’s invaluable if you like me use a console server, and you should)
  • it can be booted:
  • through PXE
  • with an USB key
  • with a CD/DVD
  • directly with an image on an harddrive
  • it’s fully packed with only sysadmin tools - that’s the perfect sysadmin swiss-knife
  • it always stay up to date (currently kernel 2.6.28)
  • it works on x86 and amd64 servers

So my basic usage is to have a PXE netboot environment in our remote colocation, a console server (it is a real damn good Opengear CM4116).

With this setup I can netboot remotely any server to a PLD Rescue image with serial support, and then rescue my servers without going to the datacenter (it’s not that it is far from home or the office, but at 3AM, you don’t usually want to go out).

If you have a preferred rescue setup, please share it!

OMG!! storeconfigs killed my database!

7 minute read

When I wrote my previous post titled all about storedconfigs, I was pretty confident I explained everything I could about storedconfigs… I was wrong of course :-)

A couple of days ago, I was helping some USG admins who were facing an interesting issue. Interesting for me, but I don’t think they’d share my views on this, as their servers were melting down under the database load.But first let me explain the issue.

The issue

The thing is that when a client checks in to get its configuration, the puppetmaster compiles its configuration to a digestible format and returns it. This operation is the process of transforming the AST built by parsing the manifests to what is called the catalog in Puppet. This is this catalog (which in fact is a graph of resources) which is later played by the client.

When the compilation process is over, and if storedconfigs is enabled on the master, the master connects to the RDBMS, and retrieves all the resources, parameters, tags and facts. Those, if any, are compared to what has just been compiled, and if some resources differs (by value/content, or if there are some missing or new ones), they get written to the database.

Pretty straightforward, isn’t it?

As you can see, this process is synchronous and while the master processes the storedconfigs operations, it doesn’t serve anybody else.

Now, imagine you have a large site (ie hundreds of puppetd clients), and you decide to turn on storedconfigs. All the clients checking in will see their current configuration stored in the database.

Unfortunately the first run of storedconfigs for a client, the database is empty, so the puppetmaster has to send all the information to the RDBMS which in turns as to write it to the disks. Of course on subsequent runs only what is modified needs to reach the RDBMS which is much less than the first time (provided you are running 0.24.8 or applied my patch).

But if your RDBMS is not correctly setup or not sized for so much concurrent write load, the storedconfigs process will take time. During this time this master is pinned to the database and can’t serve clients. So the immediate effect is that new clients checking in will see timeouts, load will rise, and so on.

The database

If you are in the aforementioned scenario you must be sure your RDBMS hardware is properly sized for this peak load, and that your database is properly tuned.I’ll soon give some generic MySQL tuning advices to let MySQL handle the load, but remember those are generic so YMMV.### Size the I/O subsystem

What people usually forget is that disk (ie those with rotating plates, not SSDs) have a maximum number of I/O operations per seconds. This value is for professional high-end disks about 250 IOP/s.

Now, to simplify, let’s say your average puppet client has 500 resources with an average of 4 parameters each. That means the master will have to perform at least 500 * 4 + 500 = 2500 writes to the database (that’s naive since there are indices to modify, and transactions can be grouped, etc.. but you see the point).

Add to this the tags, hmm let’s say an average of 4 tags per resources, and we have 500 * 4 + 500 + 500 * 4 = 4500 writes to perform to store the configuration of a given host.

Now remember our 250 IOP/s, how many seconds does the disk need to performs 4500 writes?The answer is 18s!! Which is a high value. During this time you can’t do anything else. Now add concurrency to the mix, and imagine what that means.

Of course this supposes we have to wait for the disk to have finished (ie synchronous writing), but in fact that’s pretty how RDBMS are working if you really want to trust your data.So the result is that if you want a fast RDBMS you must be ready to pay for an expensive I/O subsystem.

Size the I/O subsystem

That’s certainly the most important part of your server.

You need:

  • fast disks (15k RPM, because they is a real latency benefit compared to 10k )
  • the more spindle possible grouped in a sane RAID array like RAID10. Please forget RAID5 if you want your data to be safe (and fast writes). I saw too much horror stories with RAID5. I should really join the BAARF.
  • a Battery Backed RAID Cache unit (that will absorb the fsyncs gracefully).
  • Tune the RAID for the largest stripe size. Remove the RAID read cache if possible (innodb will take care of the READ cache with the innodb buffer pool).

If you don’t have this, do not even think turning on storedconfigs for a large site.### Size the RDBMS server Of course other things matters. If the database can fit in RAM (the best if you don’t want to be I/O bound), then you obviously need RAM. Preferably ECC Registered RAM. Use 64 bits hardware with a 64 bits OS.Then you need some CPU. Nowadays they’re cheap, but beware of InnoDB scaling issues on multi-core/multi-CPU systems (see below).

Tune the database configuration

Here is a checklist on how to tune MySQL for a mostly write load:

InnoDB of course

For concurrency, stability and durability reasons InnoDB is mandatory. MyISAM is at best usable for READ workload but suffers concurrency issues so it is a no-no for our topic

Tuned InnoDB

The default InnoDB settings are tailored to very small 10 years old servers…

Things to look to:

  • innodb_buffer_pool_size. Usual advice says 70% to 80% of physical RAM of the server if MySQL is the only running application. I’d say that it depends on the size of the database. If you know you’ll store only a few MiB, no need to allocate 2 GiB :-). More information with this useful and intersting blog post from Percona guys.
  • innodb_log_file_size. We want those to be the largest we can to ease the mostly write log we have. Once all the clients will be stored in the database we’ll reduce this to a something lower. The trade-off with large logs is the recovery time in case of crash. It isn’t uncommon to see several hundreds of MiB, or even GiB.
  • innodb_flush_method = O_DIRECT on Linux. This is to prevent the OS to cache the innodb_buffer_pool content (thus ending with a double cache).**
  • innodb_flush_log_at_trx_commit=2. If your MySQL server doesn’t have any other use than for storedconfigs or you don’t care about the D in ACID. Otherwise use 0. It is also possible to temporarily change it to 2, and then move back to 0 when all clients have their configs stored.
  • transaction-isolation=READ-COMMITTED. This one can help also, although I never tested it myself

Patch MySQL

The fine people at Percona or Ourdelta produces some patched builds of MySQL that removes some of the MySQL InnoDB scalability issues. This is more important on high concurrency workload on multi-core/multi-cpu systems.

It can also be good to run MySQL with Google’s perftools TCMalloc. TCMalloc is a memory allocator which scales way better than the Glibc one.## On the Puppet side

The immediate and most straightforward idea is to limit the number of clients that can check in at the same time. This can be done by disabling puppetd on each client (puppetd –disable), blocking network access, or any other creative mean…

When all the active hosts have checked in, you can then enable the other ones. This can be done hundreds of hosts at a time, until all hosts have a configuration stored.

Another solution is to direct some hosts to a special puppetmaster with storeconfigs on (the regular one still has storeconfigs disabled), by playing with DNS or by configuration, whatever is simplest in your environment. Once those hosts have their config stored, move them back to their regular puppetmaster and move newer hosts there.Since that’s completely manual, it might be unpractical for you, but that’s the simplest method.

And after that?

As long as your manifests are only slightly changing, subsequent runs will see only a really limited database activity (if you run a puppetmaster >= 0.24.8). That means the tuning we did earlier can be undone (for instance you can lower the innodb_log_file_size for instance, and adjust the innodb_buffer_pool_size to the size of the hot set).

But still storeconfigs can double your compilation time. If you are already at the limit compared to the number of hosts, you might see some client timeouts.

The Future

Today Luke announced on the puppet-dev list that they were working on a queuing system to defer storeconfigs and smooth out the load by spreading it on a longer time. But still, tuning the database is important.The idea is to offload the storeconfigs to another daemon which is hooked behind a queuing system. After the compilation the puppetmaster queues the catalog, where it will be unqueued by the puppet queue daemon which will in turn execute the storedconfigs process.

I don’t know the ETA for this interesting feature, but meanwhile I hope the tips I provided here can be of any help to anyone :-)

Stay tuned for more puppet stories!

All about Puppet storeconfigs

5 minute read

Since a long time people (including me) complained that storeconfigs was a real resource hog. Unfortunately for us, this option is so cool and useful.

What’s storeconfigs

Storeconfigs is a puppetmasterd option that stores the nodes actual configuration to a database. It does this by comparing the result of the last compilation against what is actually in the database, resource per resource, then parameter per parameter, and so on.T

he actual implementation is based on Rails’ Active Record, which is a great way to abstract the gory details of the database, and prototype code easily and quickly (but has a few shortcomings).

Storeconfigs uses

The immediate use of storeconfigs is exported resources. Exported resources are resources which are prefixed by @@. Those resources are marked specially so that they can be collected on several other nodes.

A little completely dumb example speaks by itself:

class exporter {  
  @@file {    
    "/var/lib/puppet/nodes/$fqdn": content => "$ipaddress\n", tag => "ip"  
  }
}

node "export1.daysofwonder.com" {  
  include exporter
}

node "export2.daysofwonder.com" {  
  include exporter
}

node "collector.daysofwonder.com" {  
  File <<| tag == "ip" |>>
}

What does this example do?

That’s simple, all the exporter nodes creates a file in /var/lib/puppet/nodes whose name is the node name and whose content is its primary IP address.

What is interesting is that the node “collector.daysofwonder.com” collects all files tagged by “ip”, that is all the exported files. In the end, after exporter1, exporter2 and collector have run a compilation, the collector host will have the /var/lib/puppet/nodes/exporter1.daysofwonder.com and /var/lib/puppet/nodes/exporter2.daysofwonder.com and their respective content.

Got it?

That’s the perfect tool for instance to automatically:

  • share/distribute public keys (ssh or openssl or other types)
  • build list of hosts running some services (for monitoring)
  • build configuration files which requires multiple hosts (for instance /etc/resolv.conf can be the concatenation of files exported by your dns cache hosts
  • and certainly other creative use

Still there is another use, since the whole configuration of your nodes is in an RDBMS, you can use that to perform some data-mining about your hosts configuration. That’s what puppetshow does.

Shortcomings

The storeconfigs issue its current incarnation (ie 0.24.7) is that it is a slow feature (it usually doubles the compilation time), and imposes an higher load on the puppetmaster and the database engine.

For large installation it might not possible to be able to run with this feature on. There were also some reports of high memory usage or leak with this feature on (see my recommendation about this in my puppetmaster memory leak post).

Recommendations

Here my usual puppet and storeconfigs recommendations:

  • use a fairly new ruby interpreter (at least one that is known to be memory leak free)
  • use a fairly new Rails (I’m currently using rails 2.1.0 on my master without any issues)
  • use the mysql ruby connector if you use mysql (otherwise rails will use a pure ruby implementation which is reported to not be stable)
  • use a powerful database engine (not sqlite), and for large deployements use a dedicated server (or cluster of servers). If you are using mysql and you want to trust your data, use InnoDB of course.
  • properly tune your database engine for a mix of writes and reads (for InnoDB a properly sized buffer pool and logs is mandatory).
  • make sure your manifests are determinists

I think the last point deserves a little bit more explanation:

I had the following schematized pattern in some of my manifests, that I took from David Schmitt excellent modules:

in one class:
if defined(File["/var/lib/puppet/modules/djbdns.d/"]) {  
  warn("already defined")
} else {  
  file {
    "/var/lib/puppet/modules/djbdns.d/": ...  
  }
}

and in another class the exact same code:

if defined(File["/var/lib/puppet/modules/djbdns.d/"]) {  
  warn("already defined")
} else {  
  file {    
    "/var/lib/puppet/modules/djbdns.d/": ...  
  }
}

What happens is that from run to run the evaluation order could change, and the defined resource could be the one in the first class and another time it could be the one in the second class, which meant the storeconfigs code had to remove the resources from the database and re-create them again. Clearly not the best way to have less database workload :-)

What’s cooking

I contributed for 0.24.8 a partial rewrite of some parts of the storeconfigs feature to increase its performance.

My analysis is that what was slow in the feature is threefold:

  1. creating tons of Active Record objects is slow (one object per resource parameters)
  2. although the code was clearly rails optimized code (ie using association prefetching and so), there was still a large number of read operations for all the tags and parameters
  3. there are still a large number of writes to the database on successive runs because the order of tags evaluation is not guaranteed.

I fixed the first two points by attacking directly the database to fetch the parameters and tags, keeping them in hash instead of objects. This saves a large number of database requests and at the same time it prevents a large number of ruby objects to be created (it should even save some memory).

The last point was fixed by imposing a strict order (although not completely correct, but still better that how it was) in the way the tags are assigned to resources.

Both patches have been merged for 0.24.8, and some people reported some performance improvements.

On the Days of Wonder infrastructure I found that with a 562 resources node, on a tuned mysql database:

  • 0.24.7:
    info: Stored catalog for corp2.daysofwonder.com in 4.05 seconds
    notice: Compiled catalog for corp2.daysofwonder.com in 6.31 seconds
    
  • 0.24.7 with the patch:
    info: Stored catalog for corp2.daysofwonder.com in 1.39 seconds
    notice: Compiled catalog for corp2.daysofwonder.com in 3.80 second
    

That’s a nice improvement, isn’t it :-)

The future?

Luke and I discussed about this, it was also discussed on the puppet-dev list a few times. I think that a RDBMS might not be the right storage choice for this feature, because clearly there is almost no random keyed access to the individual parameters of a resource (so having a table dedicated to parameters is of almost no use).

I know Luke’s plan is to abstract the storeconfigs feature from the current implementation (certainly through the indirector), so that we can use different storeconfigs engines.

I also know that someone is working on a promising CouchDB implementation. I myself can see a memcached implementation (which I’d really like to start working on). Maybe even the filesystem would be enough.

Of course, I’m open to any other improvements or storage engine ideas :-)