Recovery
Issue 1 May 2002
3-5
555-233-143
Recovery
Watchdog and Applications
The Watchdog monitors the sanity of the various applications it initially created. It
does this using two mechanisms. The Watchdog receives:
■
A SIGCHLD signal if any process it created dies
■
Periodic heartbeat messages from those processes
If heartbeat messages go away for a certain time (as specified in the Watchdog’s
configuration file), the application is killed. When an application terminates (either
unintentionally died or intentionally killed), Watchdog runs an application-specific
“recovery” script that:
■
May try to kill every process in the application
■
Checks for and corrects any resource problems
■
Tries to recreate the application
The Watchdog tries to recreate the application a specified number of times. If
unsuccessful after that number of tries within the specified retry interval, the
Watchdog runs the application’s “total failure” script.
For MultiVantage, the recovery script kills every MultiVantage process. Its total
failure script kills off the MultiVantage processes and causes a Linux reboot.
Watchdog and Linux
The Watchdog also monitors several Linux services/daemons. Since the Linux
“init” process originally started these processes, Watchdog can’t use the
SIGCHLD signal to monitor these processes. Instead, Watchdog uses a thread to
periodically check the validity of the process identifier for each monitored
processes. If invalid, the Watchdog calls a Linux script to stop and then restart the
particular service. The Linux services monitored by Watchdog are:
■
atd – at daemon (runs programs at specific times)
■
crond – cron daemon (runs programs periodically)
■
dbgserv – provides debugging services
■
httpd – Apache hypertext transfer protocol server
■
inetd – Internet server daemon (provides telnet/rlogin/etc. connectivity)
■
klogd – Linux kernel log daemon (manages logging from Linux
kernel/drivers)
■
prune – monitors and cleans up partitions
Summary of Contents for S8700 Series
Page 50: ...Maintenance Architecture 555 233 143 1 26 Issue 1 May 2002 ...
Page 74: ...Initialization and Recovery 555 233 143 3 12 Issue 1 May 2002 ...
Page 186: ...Alarms Errors and Troubleshooting 555 233 143 4 112 Issue 1 May 2002 ...
Page 232: ...Additional Maintenance Procedures 555 233 143 5 46 Issue 1 May 2002 ...
Page 635: ...status psa Issue 1 May 2002 7 379 555 233 143 status psa See status tti on page 7 406 ...
Page 722: ...Maintenance Commands 555 233 143 7 466 Issue 1 May 2002 ...