Good standard practice in handling a high availability system
includes careful fault monitoring so as to prevent failures if possible
or at least to react to them swiftly when they occur. The following
should be monitored for errors or warnings of all kinds:
Some monitoring can be done through simple physical inspection,
but for the most comprehensive monitoring, you should examine the
system log file (/var/adm/syslog/syslog.log) periodically for reports
on all configured HA devices. The presence of errors relating to
a device will show the need for maintenance.
When the proper redundancy has been configured, failures can
occur with no external symptoms. Proper monitoring is important. For example,
if a Fibre Channel switch in a redundant mass storage configuration
fails, LVM will automatically fail over to the alternate path through
another Fibre Channel switch. Without monitoring, however, you may
not know that the failure has occurred, since the applications are
still running normally. But at this point, there is no redundant
path if another failover occurs, so the mass storage configuration
is vulnerable.
Using
Event Monitoring Service |
|
Event Monitoring Service (EMS) allows you to configure monitors
of specific devices and system resources. You can direct alerts
to an administrative workstation where operators can be notified
of further action in case of a problem. For example, you could configure
a disk monitor to report when a mirror was lost from a mirrored
volume group being used in the cluster.
Refer to the manual Using High Availability Monitors (http://docs.hp.com -> High Availability -> Event Monitoring Service and HA Monitors -> Installation and User’s Guide) for additional information.
Using
EMS (Event Monitoring Service) Hardware Monitors |
|
A set of hardware monitors is available for monitoring and
reporting on memory, CPU, and many other system values. Some of
these monitors are supplied with specific hardware products.
Hardware
Monitors and Persistence Requests |
|
When hardware monitors are disabled using the monconfig tool, associated hardware monitor persistent requests
are removed from the persistence files. When hardware monitoring
is re-enabled, the monitor requests that were initialized using
the monconfig tool are re-created.
However, hardware monitor requests created using Serviceguard Manager,
or established when Serviceguard is started, are not re-created.
These requests are related to the psmmon hardware monitor.
To re-create the persistence monitor requests, halt Serviceguard
on the node, and then restart it. This will re-create the persistence
monitor requests.
Using
HP ISEE (HP Instant Support Enterprise Edition) |
|
In addition to messages reporting actual device failure, the
logs may accumulate messages of lesser severity which, over time,
can indicate that a failure may happen soon. One product that provides
a degree of automation in monitoring is called HP ISEE, which gathers
information from the status queues of a monitored system to see
what errors are accumulating. This tool will report failures and
will also predict failures based on statistics for devices that
are experiencing specific non-fatal errors over time. In a Serviceguard
cluster, HP ISEE should be run on all nodes.
HP ISEE also reports error conditions directly to an HP Response Center,
alerting support personnel to the potential problem. HP ISEE is available
through various support contracts. For more information, contact
your HP representative.