Masterzen's Blog

Masterzen

You’ll find here my current and past thoughts about my work and hobbies, including but not limited to system administration, programming, photography ,boardgames or mechanical keyboards.

The definitive recipe for Wordpress Gengo to WPML conversion

2 minute read

The Days of Wonder News Center is running Wordpress which until a couple of days used Gengo for multilingual stuff. Back when we started using Wordpress for our news, we wanted to be able to have those in three (and maybe more) languages.

At that time (in 2007, wordpress 2.3), only Gengo was available. During the last years, Gengo was unfortunately not maintained anymore, and it was difficult to upgrade Wordpress to new versions.

Recently we took the decision to upgrade our Wordpress installation, and at the same time ditch Gengo and start over using WPML, which is actively maintained (and looks superior to Gengo).

So, I started thinking about the conversion, then looked on the web and found how to convert posts, with the help of those two blog posts:

Those two posts were invaluable for the conversion of posts, but unfortunately nobody solved the conversion of translated categories… until I did :)

So here is the most complete recipe to convert from Gengo 2.5 to WPML 1.8, with updated and working SQL requests.

Pre-requisites

You might want to stop the traffic to your blog during all this procedure. One way to do that is to return an HTTP error code 503 by modifying your Apache/Nginx/Whatever configuration.

Log-in as an administrator in the Wordpress back-end, and deactivate Gengo.
Install WPML 1.8, and activates it to create the necessary tables. I had to massage WPML a little bit to let it create the tables, YMMV.
In the WPML settings, define the same languages as in Gengo (in my case English (primary), French and German)
Finish the WPML configuration.
If you had a define(WP_LANG,…) in your wordpress config, get rid of it.

Converting Posts

Connect to your MySQL server and issue the following revised SQL requests (thanks for the above blog posts for them):

Converting Pages

This is the same procedure, except we track ‘post_page’ instead of ‘post_post’:

Category conversion

This part is a little bit tricky. In Gengo, we translated the categories without creating new categories, but in WPML we have to create new categories that would be translations of a primary category. To do this, I created the following SQL procedure that simplifies the creation of a translated category:

Then we need to create translated categories with this procedure (this can be done with the Wordpress admin interface, but if you have many categories it is simpler to do this with a bunch of SQL statements):

Bind translated categories to translated posts

And this is the last step, we need to make sure our posts translations have the correct translated categories (for the moment they use the English primary categories).

To do this, I created the following SQL request:

The request is in two parts. The first one will list all the French translations posts IDs that we will report in the second request to update the categories links.

More Puppet Offloading

5 minute read

Puppet really shines at configuration management, but there are some things it is not good at, for instance file sourcing of large files, or managing deep hierarchies.

Fortunately, most of this efficiency issues will be addressed in a subsequent major version (thanks to some of my patches and other refactorings).

Meanwhile it is interesting to work-around those bugs. Since most of us are running our masters as part of a more complete stack and not isolated, we can leverage the power of this stack to address some of the issues.

In this article, I’ll expose two techniques to help your overloaded masters to serve more and more clients.

Offloading file sourcing

I already talked about offloading file sourcing in a previous blog post about puppet memory consumption. Here the idea is to prevent our puppetmasters to read the whole content of files in memory at once to serve them. Most of the installation of puppetmasterd out there are behind an http reverse proxy of some sort (ie Apache or Nginx).

The idea is that file serving is an activity that a small static server is better placed to do than puppet itself (that might change when #3373 will be fully addressed). Note: I produced an experimental patch pending review to stream puppet file sourcing on the client side, which this tip doesn’t address.

So I did implement this in Nginx (which is my favorite http server of course, but that can be ported to any other webserver quite easily, which is an exercise left to the reader):

And if you use multiple module paths (for instance to separate common modules to other modules), it is still possible to use this trick with some use of nginx try_files directive.

The try_files directive allows puppet to try several physical path (the first matching one will be served), and if none match you can use the generic location that proxies to the master which certainly will know what to do.

Something that can be useful would be to create a small script to generate the nginx config from your fileserver.conf and puppet.conf. Since mine is pretty easy, I did it manually.

Optimize Catalog Compilation

The normal process of puppet is to contact the puppetmaster at some time interval asking for a catalog. The catalog is a byproduct of the compilation of the parsed manifests in which are injected the node facts. This operation takes some times depending on the manifest complexity and the server capacity or current load.

Most of the time an host requires a catalog while the manifests didn’t change at all. In my own infrastructure I rarely change my manifests once a kind of host become stable (I might do a change every week at most when in production).

Since 0.25, puppet is now fully RESTful, that means to get a catalog puppetd contacts the master under its SSL protected links and asks for this url:

In return the puppetmaster responds by a json-encoded catalog. The actual compilation of a catalog for one of my largest host takes about 4s (excluding storeconfigs). During this 4s one ruby thread inside the master is using the CPU. And this is done once every 30 minutes, even if the manifests don’t change.

What if we could compile only when something changes? This would really free our masters!

Since puppet uses HTTP, it is easy to add a front-most HTTP cache in front of our master to actually cache the catalog the first time it is compiled and serve this one on the subsequent requests.

Although we can do it with any HTTP Cache (ie Varnish), this is really easy to add this with Nginx (which is already running in my own stack):

Puppet currently doesn’t return any http caching headers (ie Cache-Control or Expires), so we use nginx ability to cache despite it (see proxy_cache_valid). Of course I have a custom puppet branch that introduces a new parameter called –catalog_ttl which allows puppet to set those cache headers.

One thing to note is that the cache expiration won’t coincide with when you change your manifests. So we need some ways to purge the cache when you deploy new manifests.

With Nginx this can be done with:

removing the nginx cache directory: rm -rf /var/cache/nginx/cache && killall -HUP nginx
selectively purge with: the Nginx proxy cache purge module.

It’s easy to actually add one of those methods to any svn hook or git post-receive hook so that deploying manifests actually purge the cache.

Note: I think that ReductiveLabs has some plan to add catalog compilation caching directly to Puppet (which would make sense). This method is the way to go before this features gets added to Puppet. I have no doubt that caching inside Puppet will be much better than outside caching, mainly because Puppet would be able to expire the cache when the manifests change.

There a few caveats to note:

any host with a valid certificate can request another cached catalog, unlike with the normal puppetmaster which makes sure to serve catalogs only to the correct host. It’s something that can be a problem for some configurations
if your manifests rely on “dynamic” facts (like uptime or free memory), obviously you shouldn’t cache the catalog at all.
the above nginx configuration doesn’t include the facts as part of the cache key. That means the catalog won’t be re-generated when any facts change and the cached catalog will always be served. If that’s an issue, you need to purge the cache when the host itself change.

I should also mention that caching is certainly not the panacea of reducing the master load.

Some other people are using clever methods to smooth out master load. One notable example is the MCollective puppet scheduler, R.I Pienaar has written. In essence he wrote a puppet run scheduler running on top of MCollective that schedule puppet runs (triggered through MCollective) when the master load is appropriate. This allows for the best use of the host running the master.

If you also have some tricks or tips for running puppet, do not hesitate to contact me (I’m masterzen on freenode’s #puppet or @masterzen on twitter).

Puppet Memory Usage - not a fatality

7 minute read

As every reader of this blog certainly know, I’m a big fan of Puppet, using it in production on Days of Wonder servers, up to the point I used to contribute regularly bug fixes and new features (not that I stopped, it’s just that my spare time is a scarce resource nowadays).

Still, I think there are some issues in term of scalability or resource consumption (CPU or memory), for which we can find some workarounds or even fixes. Those issues are not a symptom bad programming or bad design. No, most of the issues come either from ruby itself or some random library issues.

Let’s review the things I have been thinking about lately.

Memory consumption

This is by far one of the most seen issues both on the client side and the server side. I’ve mainly seen this problem on the client side, up to the point that most people recommend running puppetd as cronjobs, instead of being a long lived process.

Ruby allocator

All boils down to the ruby (at least the the MRI 1.8.x version) allocator. This is the part in the ruby interpreter that deals with memory allocations. Like in many dynamic languages, the allocator manages a memory pool that is called a heap. And like some other languages (among them Java), this heap can never shrink and always grows when more memory is needed. This is done this way because it is simpler and way faster. Usually applications ends using their nominal part of memory and no more memory has to be allocated by the kernel to the process, which gives faster applications.

The problem is that if the application needs transiently a high amount of memory that will be trashed a couple of millisecond after, the process will pay this penalty all its life, even though say 80% of the memory used by the process is free but not reclaimed by the OS.

And it’s even worst. The ruby interpreter when it grows the heap, instead of allocating bytes per bytes (which would be really slow) does this by chunk. The whole question is what is the proper size of a chunk?

In the default implementation of MRI 1.8.x, a chunk is the size of the previous heap times 1.8. That means at worst a ruby process might end up allocating 1.8 times more than what it really needs at a given time. (This is a gross simplification, read the code if you want to know more).

Yes but what happens in Puppet?

So how does it apply to puppetd?

It’s easy, puppetd uses memory for two things (beside maintaining some core data to be able to run):

the catalog (which contains all resources, along with all templates) as shipped by the puppetmaster (i.e. serialized) and live as ruby objects.
the content of the sourced files (one at a time, so it’s the biggest transmitted file that imposes it’s high watermark for puppetd). Of course this is still better than in 0.24 where the content was transmitted encoded in XMLRPC adding the penalty of escaping everything…

Hopefully, nobody distributes large files with Puppet :-) If you’re tempted to do so, see below…

But again there’s more, as Peter Meier (known as duritong in the community) discovered a couple of month ago: when puppetd gets its catalog (which by the way is transmitted in json nowadays), it also stores it as a local cache to be able to run if it can’t contact the master for a subsequent run. This operation is done by unserializing the catalog from json to ruby live objects, and then serializing the laters to YAML. Beside the evident loss of time to do that on large catalog, YAML is a real memory hog. Peter’s experience showed that about 200MB of live memory his puppetd process was using came from this final serialization!

So I had the following idea: why not store the serialized version of the catalog (the json one) since we already have it in a serialized form when we receive it from the master (it’s a little bit more complex than that of course). This way no need to serialize it again in YAML. This is what ticket #2892 is all about. Luke is committed to have this enhancement in Rowlf, so there’s good hope!

Some puppet solutions?

So what can we do to help puppet not consume that many memory?

In theory we could play on several factors:

Transmit smaller catalogs. For instance get rid of all those templates you love (ok that’s not a solution)
Stream the serialization/deserialization with something like Yajl-Ruby
Use another ruby interpreter with a better allocator (like for instance JRuby)
Use a different constant for resizing the heap (ie replace this 1.8 by 1.0 or less on line 410 of gc.c). This can be done easily when using Rails machine GC patches or Ruby Enterprise Edition, in which case setting the environment variable RUBY_HEAP_SLOTS_GROWTH_FACTOR is enough. Check the documentation for more information.
Stream the sourced file on the server and the client (this way only a small buffer is used, and the total size of the file is never allocated). This one is hard.

Note that the same issues apply to the master too (especially for the file serving part). But it’s usually easier to run a different ruby interpreter (like REE) on the master than on all your clients.

Streaming HTTP requests is promising but unfortunately would require large change to how Puppet deals with HTTP. Maybe it can be done only for file content requests… This is something I’ll definitely explore.

This file serving thing let me think about the following which I already discussed several time with Peter…

File serving offloading

One of the mission of the puppetmaster is to serve sourced file to its clients. We saw in the previous section that to do that the master has to read the file in memory. That’s one reason it is recommended to use a dedicated puppetmaster server to act as a pure fileserver.

But there’s a better way, provided you run puppet behind nginx or apache. Those two proxies are also static file servers: why not leverage what they do best to serve the sourced files and thus offload our puppetmaster?

This has some advantages:

it frees lots of resources on the puppetmaster, so that they can serve more catalogs by unit time
the job will be done faster and by using less resources. Those static servers have been created to spoon-feed our puppet clients…

In fact it was impossible in 0.24.x, but now that file content serving is RESTful it becomes trivial.

Of course offloading would give its best if your clients requires lots of sourced files that change often, or if you provision lots of new hosts at the same time because we’re offloading only content, not file metadata. File content is served only if the client hasn’t the file or the file checksum on the client is different.

An example is better than thousand words

Imagine we have a standard manifest layout with:

some globally sourced files under /etc/puppet/files and
some modules files under /etc/puppet/modules//files.

Here is what would be the nginx configuration for such scheme:

server {
    listen 8140;

    ssl                     on;
    ssl_session_timeout     5m;
    ssl_certificate         /var/lib/puppet/ssl/certs/master.pem;
    ssl_certificate_key     /var/lib/puppet/ssl/private_keys/master.pem;
    ssl_client_certificate  /var/lib/puppet/ssl/ca/ca_crt.pem;
    ssl_crl                 /var/lib/puppet/ssl/ca/ca_crl.pem;
    ssl_verify_client       optional;

    root                    /etc/puppet;

    # those locations are for the "production" environment
    # update according to your configuration

    # serve static file for the [files] mountpoint
    location /production/file_content/files/ {
        # it is advisable to have some access rules here
        allow   172.16.0.0/16;
        deny    all;

        # make sure we serve everything
        # as raw
        types { }
        default_type application/x-raw;

        alias /etc/puppet/files/;
    }

    # serve modules files sections
    location ~ /production/file_content/[^/]+/files/ {
        # it is advisable to have some access rules here
        allow   172.16.0.0/16;
        deny    all;

        # make sure we serve everything
        # as raw
        types { }
        default_type application/x-raw;

        root /etc/puppet/modules;
        # rewrite /production/file_content/module/files/file.txt
        # to /module/file.text
        rewrite ^/production/file_content/([^/]+)/files/(.+)$  $1/$2 break;
    }

    # ask the puppetmaster for everything else
    location / {
        proxy_pass          http://puppet-production;
        proxy_redirect      off;
        proxy_set_header    Host             $host;
        proxy_set_header    X-Real-IP        $remote_addr;
        proxy_set_header    X-Forwarded-For  $proxy_add_x_forwarded_for;
        proxy_set_header    X-Client-Verify  $ssl_client_verify;
        proxy_set_header    X-SSL-Subject    $ssl_client_s_dn;
        proxy_set_header    X-SSL-Issuer     $ssl_client_i_dn;
        proxy_buffer_size   16k;
        proxy_buffers       8 32k;
        proxy_busy_buffers_size    64k;
        proxy_temp_file_write_size 64k;
        proxy_read_timeout  65;
    }
}

EDIT: the above configuration was missing the only content-type that nginx can return for Puppet to be able to actually receive the file content (that is raw).

I leave as an exercise to the reader the apache configuration.

It would also be possible to write some ruby/sh/whatever to generate the nginx configuration from the puppet fileserver.conf file.

And that’s all folks, stay tuned for more Puppet (or even different) content.

Masterzen’s Blog