Server initialization, recovery, and resets
S8100 Initialization
Maintenance Procedures
125
December 2003
The Watchdog tries to recreate the application a specified number of times. If unsuccessful after that
number of tries within the specified retry interval, the Watchdog runs the application’s “total failure”
script.
For Communication Manager, the recovery script kills every Communication Manager process. Its total-
failure script kills off the Communication Manager processes and causes a Linux reboot.
Watchdog and Linux
The Watchdog monitors several Linux services/daemons. Since the Linux init process originally started
these processes, Watchdog can not use the SIGCHLD signal to monitor these processes. Instead,
Watchdog uses a thread to periodically check the validity of the process identifier for each monitored
processes. If invalid, the Watchdog calls a Linux script to stop and then restart the particular service. The
Linux services monitored by Watchdog are:
•
atd – at daemon (runs programs at specific times)
•
crond – cron daemon (runs programs periodically)
•
dbgserv – provides debugging services
•
httpd – Apache hypertext transfer protocol server (provides Web service)
•
inetd – Internet server daemon (provides telnet/rlogin/etc. connectivity)
•
klogd – Linux kernel log daemon (manages logging from Linux kernel/drivers)
•
prune – monitors and cleans up partitions
•
syslogd – Linux system log daemon (manages logging from Linux services and applications)
•
xntpd – network time protocol daemon (manages clock synchronizations across the network)
Watchdog’s HiMonitor
The Watchdog’s HiMonitor checks for run-away processes and terminates them. HiMonitor deals with an
infinitely looping process that prevents lower-priority processes from running. More specifically, the
high-priority HiMonitor process periodically looks for responses from the low-priority LoMonitor
process. If present, HiMonitor resets Watchdog’s timer. If not, HiMonitor issues and logs a top command
to determine which processes are taking up CPU resources. HiMonitor then takes one of three recovery
actions in this order:
1
If a process within Watchdog’s or the Process Manager’s Linux process group, is consuming too
high a percentage (percentage set in
watchd.conf
) of CPU occupancy, HiMonitor kills the
process.
2
If no process is using too high a percentage, but more than 100 instances of the same monitored
process is running, HiMonitor reboots Linux.
3
Does nothing and waits for the system to recover on its own.
If LoMonitor does not respond to a preset threshold of HiMonitor checks, then, as a final recovery action,
HiMonitor reboots Linux.
CAUTION:
Escalate to an Avaya engineer for guidance with this recovery, because it is potentially
disruptive. A process can legitimately occupy abnormally high amounts of processor time
due to server load, and killing it could make the server totally unavailable.
Summary of Contents for CMC1
Page 1: ...Maintenance Procedures 555 245 103 Issue 1 1 December 2003 ...
Page 14: ...Contents 14 Maintenance Procedures December 2003 ...
Page 416: ...Additional maintenance procedures IP Telephones 416 Maintenance Procedures December 2003 ...
Page 426: ...Index X 426 Maintenance Procedures December 2003 ...