This category contains posts about system administration. There are a lot of things that people discover but don't really share with others. This is where I try and share my knowledge.

This also serves as my personal memo for all the stuff I should remember, but probably forgot over time.

I've been working on setting up OpenERP for my needs and today I decided it was time to work on backing up the beast. Since I've been running bacula at home to backup my environment, it was time to tweak it so that it made reasonable backups of OpenERP too.

In the end I was able to build a really elegant solution for backing it all up. I decided to go for the bpipe plugin that allows one to pipe programs directly to the bacula file daemon. This allowed me to do a live dump of the database with pg_dump and store it directly to the backup set without writing it to the disk.

Since the other examples in bacula wiki define methods that either use files or FIFO to do the backup, I documented my setup there too.

The only thing that was left was to add the directories specific for OpenERP to the backup and I was all set.

Posted Mon Jun 14 21:13:05 2010 Tags: sysadmin

Since I enabled comments in this blog, I finally needed to configure a split DNS for my network.

There are various reasons why one needs a split DNS and as it's usually pointed out, the reasons are usually non-technical. In my case the reasons are technical: I have a NAT in my local network that allows me to host this website locally. What causes problems is that the domain name ressukka.net points to the external IP address and that doesn't work from the inside. So split DNS it is.

There are various ways of building a split DNS, one can use the views feature in bind9 or you can set up 2 separate DNS servers that provide different information (and redirect your local resolver to use the internal server). The latter is more secure if the internal zone is sensitive.

I decided to use a hybrid solution. I already knew that PowerDNS Recursor was capable of serving authoritative zones (think pre-cached) so I decided to leverage on that. Setting this up turned out to be simpler than I expected.

First I made a copy of the existing zone and edited it to fit my needs. I changed the IP address of ressukka.net to point to the IP address on the local network. I also adjusted some other entries that pointed to the local network.

Next I modified bind to listen on the external IP address. This can be accomplished by adding a listen-on { 1.2.3.4; }; to the options in the configuration. I also disabled the resolver by adding recursion no;, this forces the bind to work as authoritative only.

Then I installed the PowerDNS Recursor (pdns-recursor package in debian) and configured it to listen on the internal address only (local-address=10.0.0.1) and added the pre-cached zone to the configuration with auth-zones=ressukka.net=/path/to/internal-zone

Now, after restarting both daemons, I had a working split DNS with minimal configuration. I was also able to change the external DNS to authoritative only mode, which is a good idea in any case.

Posted Mon Mar 23 22:38:37 2009 Tags: sysadmin

For some time I've suffered from the infamous clocksource problem with all Linux hosts that aren't running the Citrix provided kernels. I'm bit old fashioned and I want to run Debian provided kernels instead the Citrix ones, mostly because the Debian kernel receives security updates.

During the fight with my own server last night, it finally dawned to me.

The clocksource problem appears after you suspend a Linux host and the kernel in the virtual machine starts spewing this:

Mar  5 09:24:17 co kernel: [461562.007153] clocksource/0: Time went backwards: ret=f03d318c7db9 delta=-200458290723043 shadow=f03d1d566f4a offset=143675d9

I've been trying to figure out what is different with Citrix and Debian kernels, because the problem doesn't occur with the Citrix provided kernel.

The final hint to solving this problem came from Debian wiki. The same issue is mentioned there, but the workaround is not something I like. I perfer making sure that the host server has the correct time and the virtual machine just follows that time.

But the real clue was the clocksource line. It turns out that the Citrix kernel uses jiffies as the clocksource per default, while Debian uses the xen clocksource. It would make sense that the xen clocksource is more accurate since it's native to the hypervisor.

So by just running this on the domU fixes the problem:

echo "jiffies"> /sys/devices/system/clocksource/clocksource0/current_clocksource

There is no need to decouple the clock from the host, which is exactly what I needed. To make this change permanent, you need to add clocksource=jiffies to the bootparameters of your domU kernel.

You can do this by modifying grub configuration and adding clocksource=jiffies to the kopt line and running update-grub. Or you can use XenCenter and modify the virtual machine parameters and clocksource=jiffies to boot parameters.

It's also worth noting that this problem does apply to plain vanilla Debian installations as well, so reading that whole wiki page is a good idea.

Posted Thu Mar 5 17:26:28 2009 Tags: sysadmin

I finally decided that it's time for me to upgrade my Xen installation. It used to run etch with backported Xen, because the etch version was increasingly difficult to work with.

I also acknowledge that some of the issues I've been having are simply caused by yours truly, but even still the Debian Xen installation is way too fragile to my taste. I've already considered installing XenServer Express locally and running the hosts on it. The big drawback has been that XenCenter (the tool that is used to manage XenServer) is windows only and it doesn't work with wine.

So you can imagine my desperation...

Anyway, the latest upgrade from etch to lenny was painful as usual. The first part went smoothly, bit of sed magic on sources.list and a few upgrade commands (carefully picking the Xen packages out of the upgrade set). So in the end I had a working lenny installation with backported Xen.

Next I made sure that there was nothing major going on in my network (one of the virtual machines acts as my local firewall) and took a deep breath before upgrading the rest of the packages. I knew to be careful about xendomains -script which has reliably restored my virtual machines after reboot to a broken host so I had always ended up restarting my virtual machines after reboot.

I carefully cleared XENDOMAINS_AUTO and set XENDOMAINS_RESTORE to false in /etc/default/xendomains so that the virtual machines would be saved but not restored or restarted on reboot.

After the normal pre-boot checks I went for it.

Oddly enough everything worked normally and the system came up after a bit of waiting. I checked the bridges and everything appeared normal, so it was time to try and restore a single domain to see that everything actually did work as planned.

Hydrogen:~# xm restore /var/lib/xen/save/Aluminium
Error: Device 0 (vif) could not be connected. Hotplug scripts not working.

Oof, Googling for the issue revealed that there were others that had suffered from the same problem on various different platforms the problems were caused by different things. One would assume that the problem is in the vif-bridge script that is mentioned in the xend-config.sxp file as the script that brings up the vif, but after many hours of tial and error and pointless googling (over gprs connection), I couldn't find any solution to the problem. It was time to call it a day (it was almost 3 am already...)

During the night I had a new idea about the possible cause. What if the problem isn't in xend, but somewhere else. I fired up udevadm monitor to see what udev saw and it wasn't much. I'm not an expert with udev, but from previous encounters I had a vague feeling that there was supposed to be more events flying around.

I wasn't able to pinpoint what was wrong so I decided to purge xen-utils, of which I had 2 versions installed: 3.2-1 and 3.0.2. I also removed everything related to xenstore. After reinstalling the current versions and restoring my configuration files the first host came up just fine.

I still had problems resuming the virtual machines and I ended up rebooting them again, which was nothing new, but at least they were running again.

In the end I don't know what was the actual cause for udev not handling the devices properly, but I'm happy to have them all running again. And I learned a valuable lesson of all this: udev is an important part of Xen, make sure it works properly.

Posted Thu Mar 5 17:11:39 2009 Tags: sysadmin

I've been bitten by grub upgrades and installations on Debian family domU servers. Apparently there are others out there who have been bitten too.

The bug itself is caused by a missing device entry, probably because of udev. Anyway, grub-probe tries to discover the root device so that update-grub can properly generate a menu.lst. In certain scenarios the root device itself doesn't exist. Here is an example from a configuration generated with xen-tools:

Hydrogen:/etc/xen# grep phy Neon.cfg 
disk    = [ 'phy:Local1/Neon-disk,sda1,w', 'phy:Local1/Neon-swap,sda2,w' ]

While this is a valid configuration, the device sda doesn't exists within the virtual machine. As a workaround the above blog entry suggests manually adding the sda device and the device entry in device.map.

This solution does work, but it will fail with the next upgrade. The proper solution is to adjust the Xen configuration so that the root device is created. And since Xen uses different naming scheme for devices we can upgrade to that too. So the above example becomes:

Hydrogen:/etc/xen# grep phy Neon.cfg 
disk    = [ 'phy:Local1/Neon-disk,xvda,w', 'phy:Local1/Neon-swap,xvdb,w' ]

You also need to adjust the existing grub configuration and fstab within the domU. It's a bit more work and requires an additional reboot, but it gives you a peace of mind that the next upgrade will work without a hitch.

Posted Tue Feb 17 07:57:53 2009 Tags: sysadmin

As an obligatory note, Debian Lenny was released earlier today. Which means that sysadmins all over the world are starting to upgrade their servers.

There is an oddly little known tool that each and every sysadmin should install on at least one server they maintain, called apt-listchanges. It lists changes made to packages since the currently installed version. Sure that information will be overwhelming on major upgrades, but what is useful even on major upgrades is the capability to parse News files in the same way.

News files contain important information about the package in question. For example a maintainer could list known upgrade problems there, like is done in the lighttpd package. Or list changes in package specific default behaviour, like is done in Vim package.

Sure, you will notice these in time, but it's nice to get a heads up before a problem bites you.

Posted Sun Feb 15 22:35:57 2009 Tags: sysadmin

Since I keep ending up in situations where I need to clean up postfix queue from mails sent by a single host and always forget the command, I'm posting it here. Maybe someone else will find it useful as well.

To begin with, you need to determine the IP address of the culprit you want to eliminate. How you do this, is up to you. Grepping logs or examining the files in the queue both work. But for some reason there doesn't appear to be a good tool to get statistics on the sending IP addresses, only the origin and destination domains.

Once you have determined the IP address which you want to purge, you can use the following spell. You might have to repeat the same line for active and incoming queues as well, but usually deferred is the queue I have the most mails.

grep -lrE '10.20.30.4' /var/spool/postfix/deferred | xargs -r -n1 basename | postsuper -d -

It's important that the IP address has escaped dots, because dots can account for any character. In the worst case it will end up matching a lot of wrong IP addresses. Another important bits are the '[^0-9]' groups in both ends of the pattern. Those make the IP address only match that particular IP address. Without that extra limitation 1.1.1.1 would match anything that has 1 as the last number in the first octet and 1 as the first number of the last octet. For example: 211.1.1.154 would be a valid match.

The other important bit, yet oddly unknown, is the postsuper command. Postsuper modifies the queue and -d flag makes it delete files in the queue by QueueID. For some reason I keep on seeing all sorts of find -exec rm {} spells all over, which isn't really that nice for the daemon itself.

So here it is, one more tidbit I've been meaning to write up for quite some time now. Enjoy!

Posted Sun Jan 25 11:23:36 2009 Tags: sysadmin