This is an account of how not to spend a Sunday!
Today morning, my automatic update (cron-apt) installed a new kernel to all my hosts and virtual machines. Of course a new kernel requires a reboot, so I sat down to get it over with. Usually this is not big a deal. As I had sufficient time, I started recording all the unnecessary, little things I had to do manually after the reboots. As “usual”, my corosync cluster ran into trouble when the first node rebooted. The real fun started, however, when I rebooted the second of the cluster nodes. When it came back up, it decided to not re-join the cluster, taking my website and my mailserver off the air.
After some investigation it turned out that the gfs2 on a drbd device caused permanent kernel soft locks. Both machines were pretty much unusable. The situation improved slightly as I kept one node turned off. Nonetheless, the cluster remained in a hung state with the shut down node maked UNCLEAN and the services not starting.
After a few unsucessful attempts to get the drbd filesystem back up again, I decided to stop the cluster altogether and bring an old copy of the website and email server back up, still seeing kworker consume one core completely.
Finally, I was back on-line in a non-redundant mode! (Has it ever been redundant after all?)
Time to fire up the second node and disable the cluster services. Both machines kept running (almost) fine. After some try and error, I managed to gain access to the drbd filesystem on the second node in read-only mode and therefore able to copy the files from it. See my How-To for details.
Conclusion: Even if you believe you’re running a redundant system, you must always
- Make automatic backups
- Execute fail-over tests regularly