written at 23:01 by faberman
Watchdog

I needed a watchdog with variable timeouts during the lifecycle of the supervised process - longer timeouts during startup/initialisation, shorter during interactive operation. Since I could not find one, I wrote my own:

  • no library dependencies (except libc)
  • no config files, single binary
  • ideal for embedded systems
  • support for kernel watchdogs
  • permission check via UID and program name
  • millisecond precision

I find it so easy to use that I also run it on my web servers to monitor my daemons.

PROCESS WATCHDOG

A process that wants to be monitored creates a file with unique filename (e.g. command name + pid) in /run/watchdog. The file contains the following three lines:

<command name> <pid> <timeout>
where
  • command name is the name of the running process as in /proc/<PID>/command,
  • pid is the PID of the process, and
  • timeout is the timeout in milliseconds.
If the file has not been updated before last modification time + timeout has passed, the process is sent a KILL signal ONLY IF
  1. the program name given in the file in /run/watchdog matches the name in /proc/<PID>/comm, and
  2. the EUID of the process (as in /proc/<PID>/status) is the same as the UID of the file in /run/watchdog

SYSTEM WATCHDOG

All watchdogs in /dev/watchdogX are pinged every WATCHDOGTIMEOUT ms and will receive the "Magic Close" upon exit of watchdogd.

INSTALLING

KERNEL REQUIREMENTS

  • CONFIG_INOTIFY_USER
  • PROC_FS

BUILDING

make; make install

RUNNING

Simply run 'watchdogd' at startup. If /var/run/watchdog does not exist, it will be created.

STOPPING

Send the process a SIGINT or SIGTERM to exit. These signals are caught, watchdogd writes the magic close byte to all system watchdogs and exits gracefully. Any files in /var/run/watchdog are kept in place, so after a restart of watchdogd it will resume operation immediately.

LOGGING

All starts, shutdowns, process killings and errors are logged via syslog and stdout/stderr.

INTEGRATION

To use watchdogd in your program, simply include watchdogd.h and call watchdog_update() before any previous timeout runs out. To turn off the watchdog, call watchdog_disable():

#include "watchdogd.h" ... watchdog_update("myapp", getpid(), 5000); // 5s timeout during startup ... main_loop { ... watchdog_update("myapp", getpid(), 1000); // 1s during normal operation }

Make sure that "myapp" is the EXACT command name of your application as it appears in /proc/<PID>/comm.

RECOMMENDATIONS

  • /var/run should be tmpfs so you do not wear out your flash
  • if you cache your pid make sure to reset it in the child process after fork()ing

SOURCE

watchdogd_v1.tgz