Journey in a software world…
15 Oct
At Days of Wonder we are huge fans of MySQL (and since about a year of the various Open Query, Percona, Google or other community patches), up to the point we’re using MySQL for about everything in production.
But since we moved to 5.0, back 3 years ago our production databases which hold our website and online game systems has a unique issue: the mysqld process uses more and more RAM, up to the point where the kernel OOM decide to kill the process.
You’d certainly think we are complete morons because we didn’t do anything in the last 3 years to fix the issue
Unfortunately, I never couldn’t replicate the issue in the lab, mainly because it is difficult to replicate the exact same load the production server sees (mainly because of the online games activity).
During those 3 years, I tried everything I could, from using other allocators, valgrind, debug builds and so on, without any success.
What is nice, is that we moved to an OurDelta build about a year ago, where InnoDB is able to print more memory statistics than the default MySQL version.
For instance it shows
Internal hash tables (constant factor + variable factor) Adaptive hash index 1455381240 (118999688 + 1336381552) Page hash 7438328 Dictionary cache 281544240 (89251896 + 192292344) File system 254712 (82672 + 172040) Lock system 18597112 (18594536 + 2576) Recovery system 0 (0 + 0) Threads 408056 (406936 + 1120) innodb_io_pattern 0 (0 + 0)
Back several month ago, I analyzed this output just to see what figures were growing, and found that the Dictionary Cache variable part was increasing (slowly but definitely).
Sure fine MySQL experts would have been able to tell me exactly what, when and where the problem was, but since I’m not familiar with the code-base, I looked up what this number was and where it was increased (all in dict0dict.c) and added some logs each time it was increased.
I then installed this version for a quite long time (just to check it wouldn’t crash on production) on a slave server. But this server didn’t print anything interesting because it doesn’t see the exact same load the production masters.
A couple of months after that, I moved this code to one of the master and bingo! I found the operation and the tables exhibiting an increase:
mysqld[8131]: InnoDB: dict_table_rename_in_cache production/rank_tmp2 193330680 + 8112 mysqld[8131]: InnoDB: dict_table_rename_in_cache production/rank 193338792 + 8112
As soon as I saw the operation and table (ie rank), I found what the culprit is. We have a daemon that every 10s computes the player ranks for our online games.
To do this, we’re using the following pattern:
-- compute the ranks SELECT NULL, playerID FROM game_score as g ORDER BY g.rankscore DESC INTO OUTFILE "/tmp/rank_tmp.tmp" -- load back the scores LOAD DATA INFILE "/tmp/rank_tmp.tmp" INTO TABLE rank_tmp -- swap tables so that clients see new ranks atomatically RENAME TABLE rank TO rank_tmp2 , rank_tmp TO rank, rank_tmp2 TO rank_tmp -- truncate the old ranks for a new pass TRUNCATE TABLE rank_tmp -- go back to the select above
You might ask why I’m doing a so much convoluted system, especially the SELECT INTO OUTFILE and the LOAD DATA. It’s just because INSERT … SELECT with innodb and binlog enabled can produce transactions abort (which we were getting tons of).
Back to the original issue, apparently the issue lies in the RENAME part of the daemon.
Looking at the dict0dict.c dict_table_rename_in_cache function we see:
ibool dict_table_rename_in_cache(...) ... old_name = mem_heap_strdup(table->heap, table->name); table->name = mem_heap_strdup(table->heap, new_name); ... }
Looking to mem_heap stuff, I discovered that each table has a heap associated in which InnoDB allocates various things. This heap can only grow (by block of 8112 bytes it seems), since the allocator is not a real one. This is done for performance reasons.
So each time we rename a table, the old name (why? since it is already allocated) is duplicated, along with the new name. Each time.
This heap is freed when the table is dropped, so there is a possibility to reclaim the used memory. That means this issue is not a memory leak per-se.
By the way, I’ve filed this bug on mysql bug system.
One work-around, beside fixing the code itself, would be to drop the rank table instead of truncating it. The issue with dropping/creating InnoDB table on a fast pace is that the dictionary cache itself will grow, because it can only grow as there is no way to purge it from old tables (except running one of the Percona patches). So the more tables we create the more we’ll use memory - back to square 0, but worst.
So right now, I don’t really have any idea on how to really fix the issue. Anyone having an idea, please do not hesitate to comment on this blog post
And please, don’t tell me to move to MyISAM…
12 Apr
There is something I used to hate to do. And I think all admins also hate to do that.
It’s when you need to reboot a server on a rescue environment to perform an administration task (i.e. fixing unbootable servers, fixing crashed root filesystems, and so on).
The commonly found problems with rescue environment are:
OK, so a long time ago, I had a crashed server refusing to start on a reboot, and I had to chose a rescue environment for linux servers, other than booting on the Debian CD once again.
That’s how I discovered PLD Linux rescue CD:
and GRML:
My heart still goes to PLD rescue (because it’s really light), but I must admit that GRML has a really good zsh configuration (I even used some of their configuration ideas for my day to day zsh).
On that subject, if you don’t use zsh or don’t even know it and still want to qualify as a knowledgeable Unix admin, then please try it (preferably with GRML so that you’ll have an idea of what’s possible, and they even have a good documentation), another solution is to buy of course this really good book: “From Bash to Z Shell: Conquering the Command Line”
That makes me think I should do a whole blog post on zsh.
OK, so let’s go back to our sheep (yes that’s a literally French translated expression, so I don’t expect anyone to grasp the funny part except the occasional French guys reading me :-)).
So what’s so good about PLD Rescue:
So my basic usage is to have a PXE netboot environment in our remote colocation, a console server (it is a real damn good Opengear CM4116).
With this setup I can netboot remotely any server to a PLD Rescue image with serial support, and then rescue my servers without going to the datacenter (it’s not that it is far from home or the office, but at 3AM, you don’t usually want to go out).
If you have a preferred rescue setup, please share it!
15 Feb
In the Days of Wonder Paris Office (where is located our graphic studio, and incidentally where I work), we are using Bacula to perform the multi-terabyte backup of the laaaaarge graphic files the studio produces every day.
The setup is the following:
Both servers are connected to the switch through two gigabit ethernet copper links, each one forming a 802.3ad link. The Apple Xserve and the linux box uses a layer3 hash algorithm to spread the load between each slave.
OK, that’s the fine print.
Usually about network gears, I’m pretty Cisco only (sorry, but I never found anything better than IOS). When we installed this setup back in 2006, the management decided to not go the full cisco route for the office network because of the price (a Dell 5324 is about 800 EUR, compared to a 2960G-24 which is more around 2000 EUR).
So, this switch was installed there, and never received an update (if it ain’t broken don’t fix it is my motto). Until last saturday, when I noticed that in fact the switch with the 1.0.0.47 firmware uses only layer-2 hashing to select the outgoing slave in a 802.3ad channel bonding. As you might have understood, it ruins all the efforts of both servers, since they have a constant and unique MAC address, so always the same slave is selected to move data from the switch to any server.
Brave as I am, I download the new firmware revision (which needs a new boot image), and I remotely installs it. And that was the start of the nightmare…
The switch upgraded the configuration to the new version, but unfortunately both 802.3ad channel groups were not up after the restart. After enquiring I couldn’t find any valid reason why the peers wouldn’t form such group.
OK, so back to the previous firmware (so that at least the backup scheduled for the same night would succeed). Unfortunately, something I didn’t think about, was that the new boot image couldn’t boot the old firmware. And if it did, I was still screwed up because it wouldn’t have been possible to run the configuration since it had been internally converted to the newer format… I already downgraded cisco gear, and I never had such failure… Back to the topic.
So the switch was bricked, sitting in the cabinet without switching any packets. Since we don’t have any remote console server (and I was at home), I left the switch as is until early Monday…
On Monday, I connected my helpful eeePC (and an USB/Serial converter), launched Minicom, and connected to the switch serial console. I rebooted the switch, erased the config, rebooted, reloaded the config from our tftp server and I was back to 1.0.0.47 with both 802.3ad channel groups working… but still no layer-3 hashing…
But since I’m someone that wants to understand why things are failing, I also tried again the move to firmware 2.0.1.3 to see where I was wrong. And still the same result: no more channel groups, so back to 1.0.0.47 (because some angry users wanted to actually work that day :-))
After exchanging a few forum posts with some people on the Dell Community forum (I don’t have any support for this switch), I was suggested to actually erase the configuration before moving to the new firmware.
And that did it. It seems that the process of upgrading the configuration to the newest version is buggy and gave a somewhat invalid configuration from which the switch was unable to recover.
In fact, the switch seems to compile the configuration in a binary form/structure it uses to talk to the hardware. And when it upgraded the previous binary version, certainly some bits flipped somewhere and the various ports although still in the channel groups were setup as INDIVIDUAL instead of AGGREGATABLE.
Now the switch is running with a layer-3 hash algorithm, but it doesn’t seem to work fine, as if I run two parallel netcats on 2 IP addresses on the first server, connected to two other netcats on the second server, everything goes on only one path. I think this part needs more testing…
How would you test 802.3ad hashing?
1 Feb
If you like me are struggling with old disks (in my case SCSI 10k RPM Ultra Wide 2 HP disks) that exhibits bad blocks, here is a short survival howto. Those disks are placed in a refurbished HP Network RS/12 I use as a spool area for Bacula backups of our Apple XServe RAID which is used by Days of Wonder graphic Studio (and those guys knows how to produce huge files, trust me).
Since a couple of days, one of the disk exhibits read errors on some sectors (did I say they are old), so waiting to get replaced by other (old) disks, I had to find a way to have it working.
Of course the SCSI utility in the Adaptec SCSI card has a remapping tool, but you have reboot the server and have it offline during the verify, which can take a long time, so that wasn’t an option.
I then learnt about sg3_utils (sg3-utils for the debian package) thanks to the very good page of smartmontools bad blocks handling.
This set of tools directly address SCSI disks through mode page, to instruct the disk to do some things. What’s interesting is that it comes with two commands of great use (there might be more of course):
Here is the use case:
backup:~# dd if=/dev/sda iflag=direct of=/dev/zero skip=1915 bs=1M dd: reading `/dev/sda': Input/output error 12+0 records in 12+0 records out 12582912 bytes (13 MB) copied, 1.41468 seconds, 8.9 MB/s
Something is wrong, we only read 13MB instead of the whole disk.
Let’s have look to the kernel log:
backup:~# dmesg | tail [331709.192108] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK [331709.192108] sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [331709.192108] Info fld=0x3c3bb1 [331709.192108] sd 0:0:0:0: [sda] Add. Sense: Read retries exhausted [331709.192108] end_request: I/O error, dev sda, sector 3947441
Indeed /dev/sda has a failed sector (at lba 3947441).
Let’s confirm it:
backup:~# sg_verify --lba=3947441 /dev/sda verify (10): Fixed format, current; Sense key: Medium Error Additional sense: Read retries exhausted Info fld=0x3c3bb1 [3947441] Actual retry count: 0x003f medium or hardware error, reported lba=0x3c3bb1
Check the defect list:
sg_reassign --grown /dev/sda >> Elements in grown defect list: 0
And tell the disk firmware to reassign the sector
backup:~# sg_reassign --address=3947441 /dev/sda
Now verify that it was remapped:
backup:~# sg_reassign --grown /dev/sda >> Elements in grown defect list: 1
Do we have a working sector?
backup:~# dd if=/dev/sda iflag=direct of=/dev/null bs=512 count=1 skip=3947441 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.00780813 seconds, 65.6 kB/s
The sector could be read! The disk is now safe.
Of course, this tutorial might not work for every disks: PATA and SATA disks don’t respond to SCSI commands. For those disks, you have to write on the failed sector with dd and the disk firmware should automatically remap the sector. This can be proved by looking at the Reallocated_Sector_Ct output of smartctl -a.
Good luck
Recent Comments