Centralized configuration system for a large scale farm of network booted computers

2012 
The ATLAS trigger and data acquisition online farm is composed of nearly 3,000 computing nodes, with various configurations, functions and requirements. Maintaining such a cluster is a big challenge from the computer administration point of view, thus various tools have been adopted by the System Administration team to help manage the farm efficiently. In particular, a custom central configuration system, ConfDBv2, was developed for the overall farm management. The majority of the systems are network booted, and are running an operating system image provided by a Local File Server (LFS) via the local area network (LAN). This method guarantees the uniformity of the system and allows, in case of issues, very fast recovery of the local disks which could be used as scratch area. It also provides greater flexibility as the nodes can be reconfigured and restarted with a different operating system in a very timely manner. A user-friendly web interface offers a quick overview of the current farm configuration and status, allowing changes to be applied on selected subsets or on the whole farm in an efficient and consistent manner. Also, various actions that would otherwise be time consuming and error prone can be quickly and safely executed. We describe the design, functionality and performance of this system and its web–based interface, including its integration with other CERN and ATLAS databases and with the monitoring infrastructure.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    2
    References
    3
    Citations
    NaN
    KQI
    []