Meet the Gang 1 2 3 4 5 6 7 |
Answer By James T. Dennis
The Linux kernel supports a class of devices called "watchdog" drivers. These are programmable timers which are wired to a system's reset or power lines. They are common on non-PC servers and workstations and in embedded devices and are increasing included in PC PCI chipsets. There are also PC adapter cards that can function as watchdog timers, some of them are included in adapters with other functions (such as the PC Weasel 2000, or some high precision real-time clocks?) and some of them have electronics to monitor CPU or case temperature, power supply voltages, etc.
These all have one function in common, they can be set to some time interval (60 seconds by default, under Linux) and will count down towards zero. If they ever reach zero they'll strobe the reset line and force the hardware to reboot. Thus the require period "petting" or they'll "bite" you.
The Linux kernel supports a variety of watchdog hardware, and also includes one which is a software emulation of what a watchdog timer does. (Those are a bit less robust since some forms of kernel panic or failure might leave the system wedged and unable to execute the softdog code). (The Linux kernel can be set to reset after a time delay in case of panic --- the default is to dump a message and registers to the the console and wait for a human to read them and reboot. Read the bootparam(7) man pages and search for panic= for details on how to over-ride that).
All of this is of no use unless you also have a daemon or utility that can set the watchdog, monitor the system, and periodically "pet the dog." (Some texts on this topic use the more abusive "kicking" analogy --- but I find that distasteful).
Of course one can write one's own daemon, or even a cron job (if one over-rode the default 60 second value to be a bit longer, to account for possible cron delays). However, it's best to start with one that's already written and reasonably well proven. The Debian project has one that's simply called "watchdog." Although it is a Debian package it can be adapted for use on any Linux distribution.
This particular daemon performs up to 10 internal system tests (most are optional) and it can be configured to execute a custom suite of tests --- your own script or binary which must return a zero exit value on success (and should run in under some liberal time limit). In other words, it's extensible. On failure it can attempt to execute a custom "repair" script or binary, then it can try a soft reboot (with statically compile code -- NOT by calling the normal 'shutdown' or 'reboot' binaries). Failing all of that, it will simply fail to write to the /dev/watchdog which will cause the kernel to fail to "pet the dog" (hardware) or cause the kernel to reboot (softdog).
In (almost) any event a system failure should result in a reboot instead of a hang. That can be good for systems that are remotely located and hard to get reach. Of course Linux is pretty robust and reliable: so it's rare that the watchdog will be needed; and of course it could be that the watchdog will cause some spurious reboots, sometimes --- especially when initially configuring and tuning it. But there are cases where it's worth the risk and effort.
Meet the Gang 1 2 3 4 5 6 7 |