This is an account of how not to spend a Sunday!
This morning my automatic update (cron-apt) installed a new kernel to all my hosts and virtual machines. Of course a new kernel requires a reboot, so I sat down to get it over with. Usually this is not so big a deal. As I had sufficient time, I started recording all the unnecessary little things I had to do manually after the reboots. As “usual”, my corosync cluster ran into trouble when the first node rebooted. The real fun started, however, when I rebooted the second of the cluster nodes. When it came back up, it decided to not re-join the cluster, taking my website and mailserver off the air.
After some investigation it turned out that the gfs2 on a drbd device caused permanent kernel soft locks. Both machines were pretty much unusable. The situation improved slightly as I kept one node turned off. Nonetheless, the cluster remained in a hung state with the shut down node maked UNCLEAN and the services not starting.
After a few unsuccessful attempts to get the drbd filesystem back up again, I decided to stop the cluster altogether. I brought up an old copy of the website and email server, as I was still seeing kworker consume one core completely.
Finally, I was back on-line in a non-redundant mode! (Was it ever really redundant after all?)
Time to fire up the second node and disable the cluster services. Both machines kept running (almost) fine. After some trial and error, I managed to gain access to the drbd filesystem on the second node in read-only mode, and was therefore able to copy the files from it. See my How-To for details.
Conclusion: Even if you believe you’re running a redundant system, you must always
- Make automatic backups
- Execute fail-over tests regularly