System availability monitoring

Pat Moran,Pat Gaffney,John Melody,Maria Condon,Margaret Hayden

System availability monitoring

1990

A process set up by Digital to monitor and quantify the availability of its systems is described. The reliability data are collected in an automated manner and stored in a database. The breadth of data gathered provides a unique opportunity to correlate hardware andsoftware failures. In addition, several hypotheses have been tested, e.g. the relationship between crash rate and system load, the interdependence of crashes, the cause of crashes, and the effect of new releases in the operating system. It is concluded that the process (in operation since 1988) has yielded worthwhile information on the products monitored. The usual availability metrics are calculated regularly for the machines monitored. Trends in system fault occurrence have been identified, leading to suggestions for both software and hardware improvements. The monitoring process and analysis methodology are revised on an ongoing basis to improve the quality of information obtained and to extend the analysis to Digital's new systems. The recently announced VAX9000 mainframe and fault-tolerant VAXft 3000 are two such systems. >

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations