This category contains posts about system administration. There are a lot of things that people discover but don't really share with others. This is where I try and share my knowledge.
This also serves as my personal memo for all the stuff I should remember, but probably forgot over time.
I've been working on setting up OpenERP for my needs and today I decided it was time to work on backing up the beast. Since I've been running bacula at home to backup my environment, it was time to tweak it so that it made reasonable backups of OpenERP too.
In the end I was able to build a really elegant solution for backing it all up. I decided to go for the bpipe plugin that allows one to pipe programs directly to the bacula file daemon. This allowed me to do a live dump of the database with pg_dump and store it directly to the backup set without writing it to the disk.
Since the other examples in bacula wiki define methods that either use files or FIFO to do the backup, I documented my setup there too.
The only thing that was left was to add the directories specific for OpenERP to the backup and I was all set.
Since I enabled comments in this blog, I finally needed to configure a split DNS for my network.
There are various reasons why one needs a split DNS and as it's
usually pointed out, the reasons are usually non-technical. In my
case the reasons are technical: I have a NAT in my local network
that allows me to host this website locally. What causes problems
is that the domain name ressukka.net points to the
external IP address and that doesn't work from the inside. So split
DNS it is.
There are various ways of building a split DNS, one can use the views feature in bind9 or you can set up 2 separate DNS servers that provide different information (and redirect your local resolver to use the internal server). The latter is more secure if the internal zone is sensitive.
I decided to use a hybrid solution. I already knew that PowerDNS Recursor was capable of serving authoritative zones (think pre-cached) so I decided to leverage on that. Setting this up turned out to be simpler than I expected.
First I made a copy of the existing zone and edited it to fit my
needs. I changed the IP address of ressukka.net to
point to the IP address on the local network. I also adjusted some
other entries that pointed to the local network.
Next I modified bind to listen on the external IP address. This
can be accomplished by adding a listen-on { 1.2.3.4;
}; to the options in the configuration. I also disabled the
resolver by adding recursion no;, this forces the bind
to work as authoritative only.
Then I installed the PowerDNS Recursor
(pdns-recursor package in debian) and configured it to
listen on the internal address only
(local-address=10.0.0.1) and added the pre-cached zone
to the configuration with
auth-zones=ressukka.net=/path/to/internal-zone
Now, after restarting both daemons, I had a working split DNS with minimal configuration. I was also able to change the external DNS to authoritative only mode, which is a good idea in any case.
For some time I've suffered from the infamous clocksource problem with all Linux hosts that aren't running the Citrix provided kernels. I'm bit old fashioned and I want to run Debian provided kernels instead the Citrix ones, mostly because the Debian kernel receives security updates.
During the fight with my own server last night, it finally dawned to me.
The clocksource problem appears after you suspend a Linux host and the kernel in the virtual machine starts spewing this:
Mar 5 09:24:17 co kernel: [461562.007153] clocksource/0: Time went backwards: ret=f03d318c7db9 delta=-200458290723043 shadow=f03d1d566f4a offset=143675d9
I've been trying to figure out what is different with Citrix and Debian kernels, because the problem doesn't occur with the Citrix provided kernel.
The final hint to solving this problem came from Debian wiki. The same issue is mentioned there, but the workaround is not something I like. I perfer making sure that the host server has the correct time and the virtual machine just follows that time.
But the real clue was the clocksource line. It turns out that the Citrix kernel uses jiffies as the clocksource per default, while Debian uses the xen clocksource. It would make sense that the xen clocksource is more accurate since it's native to the hypervisor.
So by just running this on the domU fixes the problem:
echo "jiffies"> /sys/devices/system/clocksource/clocksource0/current_clocksource
There is no need to decouple the clock from the host, which is
exactly what I needed. To make this change permanent, you need to
add clocksource=jiffies to the bootparameters of your
domU kernel.
You can do this by modifying grub configuration and adding
clocksource=jiffies to the kopt line and running
update-grub. Or you can use XenCenter and modify the virtual
machine parameters and clocksource=jiffies to boot
parameters.
It's also worth noting that this problem does apply to plain vanilla Debian installations as well, so reading that whole wiki page is a good idea.
I finally decided that it's time for me to upgrade my Xen installation. It used to run etch with backported Xen, because the etch version was increasingly difficult to work with.
I also acknowledge that some of the issues I've been having are simply caused by yours truly, but even still the Debian Xen installation is way too fragile to my taste. I've already considered installing XenServer Express locally and running the hosts on it. The big drawback has been that XenCenter (the tool that is used to manage XenServer) is windows only and it doesn't work with wine.
So you can imagine my desperation...
Anyway, the latest upgrade from etch to lenny was painful as usual. The first part went smoothly, bit of sed magic on sources.list and a few upgrade commands (carefully picking the Xen packages out of the upgrade set). So in the end I had a working lenny installation with backported Xen.
Next I made sure that there was nothing major going on in my network (one of the virtual machines acts as my local firewall) and took a deep breath before upgrading the rest of the packages. I knew to be careful about xendomains -script which has reliably restored my virtual machines after reboot to a broken host so I had always ended up restarting my virtual machines after reboot.
I carefully cleared XENDOMAINS_AUTO and set
XENDOMAINS_RESTORE to false in
/etc/default/xendomains so that the virtual machines
would be saved but not restored or restarted on reboot.
After the normal pre-boot checks I went for it.
Oddly enough everything worked normally and the system came up after a bit of waiting. I checked the bridges and everything appeared normal, so it was time to try and restore a single domain to see that everything actually did work as planned.
Hydrogen:~# xm restore /var/lib/xen/save/Aluminium
Error: Device 0 (vif) could not be connected. Hotplug scripts not working.
Oof, Googling for the issue revealed that there were others that
had suffered from the same problem on various different platforms
the problems were caused by different things. One would assume that
the problem is in the vif-bridge script that is
mentioned in the xend-config.sxp file as the script
that brings up the vif, but after many hours of tial and error and
pointless googling (over gprs connection), I couldn't find any
solution to the problem. It was time to call it a day (it was
almost 3 am already...)
During the night I had a new idea about the possible cause. What
if the problem isn't in xend, but somewhere else. I fired up
udevadm monitor to see what udev saw and it wasn't
much. I'm not an expert with udev, but from previous encounters I
had a vague feeling that there was supposed to be more events
flying around.
I wasn't able to pinpoint what was wrong so I decided to purge xen-utils, of which I had 2 versions installed: 3.2-1 and 3.0.2. I also removed everything related to xenstore. After reinstalling the current versions and restoring my configuration files the first host came up just fine.
I still had problems resuming the virtual machines and I ended up rebooting them again, which was nothing new, but at least they were running again.
In the end I don't know what was the actual cause for udev not handling the devices properly, but I'm happy to have them all running again. And I learned a valuable lesson of all this: udev is an important part of Xen, make sure it works properly.
I've been bitten by grub upgrades and installations on Debian family domU servers. Apparently there are others out there who have been bitten too.
The bug itself is caused by a missing device entry, probably because of udev. Anyway, grub-probe tries to discover the root device so that update-grub can properly generate a menu.lst. In certain scenarios the root device itself doesn't exist. Here is an example from a configuration generated with xen-tools:
Hydrogen:/etc/xen# grep phy Neon.cfg
disk = [ 'phy:Local1/Neon-disk,sda1,w', 'phy:Local1/Neon-swap,sda2,w' ]
While this is a valid configuration, the device sda doesn't exists within the virtual machine. As a workaround the above blog entry suggests manually adding the sda device and the device entry in device.map.
This solution does work, but it will fail with the next upgrade. The proper solution is to adjust the Xen configuration so that the root device is created. And since Xen uses different naming scheme for devices we can upgrade to that too. So the above example becomes:
Hydrogen:/etc/xen# grep phy Neon.cfg
disk = [ 'phy:Local1/Neon-disk,xvda,w', 'phy:Local1/Neon-swap,xvdb,w' ]
You also need to adjust the existing grub configuration and fstab within the domU. It's a bit more work and requires an additional reboot, but it gives you a peace of mind that the next upgrade will work without a hitch.
As an obligatory note, Debian Lenny was released earlier today. Which means that sysadmins all over the world are starting to upgrade their servers.
There is an oddly little known tool that each and every sysadmin should install on at least one server they maintain, called apt-listchanges. It lists changes made to packages since the currently installed version. Sure that information will be overwhelming on major upgrades, but what is useful even on major upgrades is the capability to parse News files in the same way.
News files contain important information about the package in question. For example a maintainer could list known upgrade problems there, like is done in the lighttpd package. Or list changes in package specific default behaviour, like is done in Vim package.
Sure, you will notice these in time, but it's nice to get a heads up before a problem bites you.
Since I keep ending up in situations where I need to clean up postfix queue from mails sent by a single host and always forget the command, I'm posting it here. Maybe someone else will find it useful as well.
To begin with, you need to determine the IP address of the culprit you want to eliminate. How you do this, is up to you. Grepping logs or examining the files in the queue both work. But for some reason there doesn't appear to be a good tool to get statistics on the sending IP addresses, only the origin and destination domains.
Once you have determined the IP address which you want to purge, you can use the following spell. You might have to repeat the same line for active and incoming queues as well, but usually deferred is the queue I have the most mails.
grep -lrE '10.20.30.4' /var/spool/postfix/deferred | xargs
-r -n1 basename | postsuper -d -
It's important that the IP address has escaped dots, because dots can account for any character. In the worst case it will end up matching a lot of wrong IP addresses. Another important bits are the '[^0-9]' groups in both ends of the pattern. Those make the IP address only match that particular IP address. Without that extra limitation 1.1.1.1 would match anything that has 1 as the last number in the first octet and 1 as the first number of the last octet. For example: 211.1.1.154 would be a valid match.
The other important bit, yet oddly unknown, is the postsuper
command. Postsuper modifies the queue and -d flag makes it delete
files in the queue by QueueID. For some reason I keep on seeing all
sorts of find -exec rm {} spells all over, which isn't
really that nice for the daemon itself.
So here it is, one more tidbit I've been meaning to write up for quite some time now. Enjoy!