Geographical failover for the EGEE-WLCG grid collaboration tools

2008 
Worldwide grid projects such as EGEE and WLCG need services with high availability, not only for grid usage, but also for associated operations. In particular, tools used for daily activities or operational procedures are considered to be critical. The operations activity of EGEE relies on many tools developed by teams from different countries. For each tool, only one instance was originally deployed, thus representing single points of failure. In this context, the EGEE failover problem was solved by replicating tools at different sites, using specific DNS features to automatically failover to a given service. A new domain for grid operations (gridops.org) was registered and deployed following DNS testing in a virtual machine (vm) environment using nsupdate, NS/zone configuration and fast TTLs. In addition, replication of databases, web servers and web services have been tested and configured. In this paper, we describe the technical mechanism used in our approach to replication and failover. We also describe the procedure implemented for the EGEE/WLCG CIC Operations Portal use case. Furthermore, we present the interest in failover procedures in the context of other grid projects and grid services. Future plans for improvements of the procedures are also described.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    2
    References
    6
    Citations
    NaN
    KQI
    []