DAP-enabled server-side data reduction and analysis

2007 
3B.2 DAP-ENABLED SERVER-SIDE DATA REDUCTION AND ANALYSIS Daniel L. Wang ∗ , Charles S. Zender, and Stephen F. Jenks University of California, Irvine 1. INTRODUCTION and Kesselman, 1998). Unfortunately, their appli- cation independence means that, in practice, data- dependencies cannot be specified automatically, and that data locality is generally ignored in favor of greater scheduling flexibility for higher through- put. The Globus toolkit for grid systems allows users to define input and output files to be staged to and from compute nodes (Foster and Kessel- man, 1997), but this capability may not be de- sired for data-intensive computation, where data volume is more significant than computation com- plexity. Systems for remote data access that spe- cific to geoscience data processing are also in wide use. While such systems succeed in providing sim- ple, lightweight access to large, remote datasets, their interfaces remain optimized for operations on small volumes of data, prompting scientists to uti- lize other tools such as NCO for larger-volume data analysis and reduction. The Open-source Project for a Network Data Access Protocol (OPeNDAP) server serves a signif- icant fraction of available geoscience data (Cornil- lon, 2003). It provides metadata querying and sub- setting capabilities as well as access to raw data. The Earth System Grid II (ESG) project aims to address concerns similar to those of this project, and is in development of filtering servers that permit data to be processed and reduced closer to its point of residence (Foster et al., 2002). The NCO tools supported by our system operate on data stored in network Common Data Form (netCDF) (Rew and Davis, 1990). netCDF is a self-describing, machine-independent format for representing sci- entific data, and is the most popular format for ex- changing ocean-atmosphere model output. Despite the frenetic pace of technology advance- ment towards faster, better, and cheaper hardware, terascale data reduction and analysis remain elu- sive for most. Disk technology advances now en- able scientists to store such data volumes locally, but long-haul network bandwidth considerations all but prohibit frequent terascale transfers. Bell et al. (2006) have noted that downloading data for com- putation is worthwhile only if the analysis involves more than 100,000 CPU cycles per byte of data, meaning that a 1GB dataset is only worth down- loading if analysis requires 100 teracycles, or nearly 14 hours on a 2GHz CPU. In data-intensive science, data volume rather than CPU speed drives analysis, pointing to a need for a system that locates colo- cates computation with data. Our system provides a facility for colocat- ing computation with data sources, leveraging shell script-based analysis methods to specify de- tails through an interface piggy-backed over the Data Access Protocol (DAP) protocol, implemented through a custom OPeNDAP data handler (Cornil- lon, 2003). Scripts of netCDF Operator (NCO) (Zender, 2006b) commands are sent through an interface extended from DAP’s subsetting facility and processed by a server-side execution engine. Resultant datasets may be retrieved in the same DAP request, or deferred for later retrieval. Ad- ditional processing efficiency is available through script-based parallelism. Our execution engine op- tionally parses scripts for data-dependencies and exploits parallelism opportunities from the extracted dataflow. With this capability, existing analyses can better utilize the available parallelism of high- capacity datacenter hardware. 3. OVERVIEW 2. RELATED WORK Our system, currently called SSDAP (Server-Side DAP), is implemented as a wrapper to the netCDF data handler in an OPeNDAP instance. Just as in generic DAP protocol transactions, SSDAP re- quests are sent to an httpd, where they are shunted to the OPeNDAP CGI handler, which parses the request and forwards the result to a data-handler specific to the dataset file format. Our customized netCDF data handler processes the re- quested script, if available, forwarding the request Other systems for remote computation exist in many areas and can be characterized by their generality and computation size. Grid computation engines remain the most general, allowing the widest va- riety of heavy computational tasks to be run on a heterogeneous set of remote systems (Foster ∗ Corresponding author address: Daniel L. Wang, Depart- ment of EECS, 425 Engineering Tower, Irvine, CA; e-mail: wangd@uci.edu
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    3
    References
    0
    Citations
    NaN
    KQI
    []