Fault-Detection Managers: More May Not Be the Merrier.

2021 
A fault management system contains managers that detect faults as well as initiate recovery actions. Such management systems often come in an architecture that is not only a distributed one but also decoupled from the applications. Although an arrangement like this promotes scalability, it unfortunately makes the recovery of applications dependent on the fault management system itself. This work introduces two novel equations to meet the performance objectives of applications. To this end, we first create an equation that estimates the maximum number of jobs to be handled by an application instance for meeting a given performance objective. This formula is then used by admission control mechanism to restrict the number of jobs (targeted for operational application instances) to be allowed to enter the system. Next, we create a second equation that computes the response time distribution of an application. Thereafter, we develop a simulation model that predicts the impact of the failure of four sample fault management architectures on application’s performance. Exploiting our equations, we compare the architectures in terms of three distinct ways of handling affected jobs when application instances fail—allow job loss; retry jobs resulting in overload; employ admission control to mitigate the overload. Our simulation results show that boosting the number of managers may not always be beneficial; rather, it could possibly be the interconnection topology (i.e. the layout of interconnects linking the architectural components) of the management architecture, together with the model parameter values that may sometimes have a bigger role to play in the application’s performance.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    0
    Citations
    NaN
    KQI
    []