I finally decided that it's time for me to upgrade my Xen installation. It used to run etch with backported Xen, because the etch version was increasingly difficult to work with.
I also acknowledge that some of the issues I've been having are simply caused by yours truly, but even still the Debian Xen installation is way too fragile to my taste. I've already considered installing XenServer Express locally and running the hosts on it. The big drawback has been that XenCenter (the tool that is used to manage XenServer) is windows only and it doesn't work with wine.
So you can imagine my desperation...
Anyway, the latest upgrade from etch to lenny was painful as usual. The first part went smoothly, bit of sed magic on sources.list and a few upgrade commands (carefully picking the Xen packages out of the upgrade set). So in the end I had a working lenny installation with backported Xen.
Next I made sure that there was nothing major going on in my network (one of the virtual machines acts as my local firewall) and took a deep breath before upgrading the rest of the packages. I knew to be careful about xendomains -script which has reliably restored my virtual machines after reboot to a broken host so I had always ended up restarting my virtual machines after reboot.
I carefully cleared
XENDOMAINS_AUTO and set
XENDOMAINS_RESTORE to false in
/etc/default/xendomains so that the virtual machines
would be saved but not restored or restarted on reboot.
After the normal pre-boot checks I went for it.
Oddly enough everything worked normally and the system came up after a bit of waiting. I checked the bridges and everything appeared normal, so it was time to try and restore a single domain to see that everything actually did work as planned.
Hydrogen:~# xm restore /var/lib/xen/save/Aluminium Error: Device 0 (vif) could not be connected. Hotplug scripts not working.
Oof, Googling for the issue revealed that there were others that
had suffered from the same problem on various different platforms
the problems were caused by different things. One would assume that
the problem is in the
vif-bridge script that is
mentioned in the
xend-config.sxp file as the script
that brings up the vif, but after many hours of tial and error and
pointless googling (over gprs connection), I couldn't find any
solution to the problem. It was time to call it a day (it was
almost 3 am already...)
During the night I had a new idea about the possible cause. What
if the problem isn't in xend, but somewhere else. I fired up
udevadm monitor to see what udev saw and it wasn't
much. I'm not an expert with udev, but from previous encounters I
had a vague feeling that there was supposed to be more events
I wasn't able to pinpoint what was wrong so I decided to purge xen-utils, of which I had 2 versions installed: 3.2-1 and 3.0.2. I also removed everything related to xenstore. After reinstalling the current versions and restoring my configuration files the first host came up just fine.
I still had problems resuming the virtual machines and I ended up rebooting them again, which was nothing new, but at least they were running again.
In the end I don't know what was the actual cause for udev not handling the devices properly, but I'm happy to have them all running again. And I learned a valuable lesson of all this: udev is an important part of Xen, make sure it works properly.