Clowder is an open source data management system to support data curation of long tail data and metadata across multiple research domains and diverse data types. Institutions and labs can install and customize their own instance of the framework on local hardware or on remote cloud computing resources to provide a shared service to distributed communities of researchers. Data can be ingested directly from instruments or manually uploaded by users and then shared with remote collaborators using a web front end. We discuss some of the challenges encountered in designing and developing a system that can be easily adapted to different scientific areas including digital preservation, geoscience, material science, medicine, social science, cultural heritage and the arts. Some of these challenges include support for large amounts of data, horizontal scaling of domain specific preprocessing algorithms, ability to provide new data visualizations in the web browser, a comprehensive Web service API for automatic data ingestion and curation, a suite of social annotation and metadata management features to support data annotation by communities of users and algorithms, and a web based front-end to interact with code running on heterogeneous clusters, including HPC resources.
We present Brown Dog, two highly extensible services that aim to leverage any existing pieces of code, libraries, services, or standalone software (past or present) towards providing users with a simple to use and programmable means of automated aid in the curation and indexing of distributed collections of uncurated and/or unstructured data. Data collections such as these encompassing large varieties of data, in addition to large amounts of data, pose a significant challenge within modern day "Big Data" efforts. The two services, the Data Access Proxy (DAP) and the Data Tilling Service (DTS), focusing on format conversions and content based analysis/extraction respectively, wrap relevant conversion and extraction operations within arbitrary software, manages their deployment in an elastic manner, and manages job execution from behind a deliberately compact REST API. We describe both the motivation and need/scientific drivers for such services, the constituent components that allow for arbitrary software/code to be used and managed, and lastly an evaluation of the systems capabilities and scalability.
This paper discusses why research software is important, and what sustainability means in this context. It then talks about how research software sustainability can be achieved, and what our experiences at NCSA have been using specific examples, what we have learned from this, and how we think these lessons can help others.
Renal biopsies form the gold standard of diagnostic and prognostic assessments of renal transplants. With the addition of new quantitative strategies to supplement renal biopsy interpretation such as gene array and metabolomics, the capability to incorporate all quantitative measures for clinical interpretation will require multi-dimensional analyses. Currently, renal biopsies are analyzed manually; the quantitative features of pathology observed on the biopsies are limited to hand counts. Standardized, automated detection of pathology observed in a kidney transplant biopsy will enable the input of these digital images alongside other quantitative measures of new technologies, with potential gains in precision in patient care. We investigate a learning framework to detect pathological changes in biopsy image that addresses two main issues: the inadequate training set and the significant diversity of color and tissue shape on whole slide images. Two case studies, automatic detection of interstitial inflammation and tubular cast, are presented in this work. Afterwards, we propose a fully automated glomerulus extraction framework on micrograph of entire renal tissue, focusing on extracting Bowman's capsule, the supportive structure of glomeruli. Statistical approaches are also introduced to further improve the performance. Human expert annotations of interstitial inflammation and tubular casts in 10 H&E stained renal tissues of nonhuman primates and more than 100 glomeruli are used to demonstrate the superior performance of the proposed algorithm over existing solutions.
Digitizing large collections of Cultural Heritage (CH) resources and providing tools for their management, analysis and visualization is critical to CH research. A key element in achieving the above goal is to provide user-friendly software offering an abstract interface for interaction with a variety of digital content types. To address these needs, the Medici content management system is being developed in a collaborative effort between the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, Bibliotheca Alexandrina (BA) in Egypt, and the Cyprus Institute (CyI). The project is pursued in the framework of European Project “Linking Scientific Computing in Europe and Eastern Mediterranean 2” (LinkSCEEM2) and supported by work funded through the U.S. National Science Foundation (NSF), the U.S. National Archives and Records Administration (NARA), the U.S. National Institutes of Health (NIH), the U.S. National Endowment for the Humanities (NEH), the U.S. Office of Naval Research (ONR), the U.S. Environmental Protection Agency (EPA) as well as other private sector efforts. Medici is a Web 2.0 environment integrating analysis tools for the auto-curation of un-curated digital data, allowing automatic processing of input (CH) datasets, and visualization of both data and collections. It offers a simple user interface for dataset preprocessing, previewing, automatic metadata extraction, user input of metadata and provenance support, storage, archiving and management, representation and reproduction. Building on previous experience (Medici 1), NCSA, and CyI are working towards the improvement of the technical, performance and functionality aspects of the system. The current version of Medici (Medici 2) is the result of these efforts. It is a scalable, flexible, robust distributed framework with wide data format support (including 3D models and Reflectance Transformation Imaging-RTI) and metadata functionality. We provide an overview of Medici 2’s current features supported by representative use cases as well as a discussion of future development directions.
Brown Dog is a data transformation service for auto-curation of long-tail data. In this digital age, we have more data available for analysis than ever and this trend will only increase. According to most estimates, 70--80% of this data is unstructured, and together with unsupported data formats and inaccessible software tools, in essence, this data is not either easily accessible or usable to its owners in a meaningful way. Brown Dog aims at making this data more accessible and usable by auto-curation and indexing, leveraging existing and novel data transformation tools. In this paper, we discuss the recent major component improvements to Brown Dog including transformation tools called extractors and converters; desktop, web and terminal-based clients which perform data transformations; libraries written in multiple programming languages which integrate with existing software and extend their data curation capabilities; an online tool store for users to contribute, manage and share data transformation tools and receive credit for developing them; cyberinfrastructure for deploying the system on diverse computing platforms leveraging scalability via Docker swarm; workflow management service for creatively integrating existing transformations to generate custom, reproducible workflows which meet research needs, and its data management capabilities. This paper also discusses data transformation tools developed to support some scientific and allied use cases, thereby benefiting researchers in diverse domains. Finally, we briefly discuss our future directions with regard to production deployments as well as how users can access Brown Dog to manage their un-curated unstructured data.
Different institutions worldwide, such as economic, social and political, are relying increasingly on the communication technology to perform a variety of functions: holding remote business meetings, discussing design issues in product development, enabling consumers to remain connected with their families and children, and so on. In this environment, where geographic and temporal boundaries are shrinking rapidly, electronic communication medium are playing an important role. With recent advances in 3D sensing, computing on new hardware platforms, high bandwidth communication connectivity and 3D display technology, the vision of 3D video-teleconferencing and of tele-immersive experience has become very attractive. These advances lead to tele-immersive communication systems that enable 3D interactive experience in a virtual space consisting of objects born in physical and virtual environments. This experience is achieved by fusing real-time color plus depth video of physical scenes from multiple stereo cameras located at different geographic sites, displaying 3D reconstructions of physical and virtual objects, and performing computations to facilitate interactions between objects. While tele-immersive (TI) systems have been attracting a lot of attention these days, the advantages of enabled interactions and delivered 3D content for viewing as opposed to current 2D high definition video have not been evaluated. In this paper, we study the effectiveness of three different types of communication media on remote collaboration in order to document the pros and cons of new technologies such as TI. The three communication media include 3D video tele-immersive, 2D video Skype and face-to-face used in a collaborative environment of a remote product development scenario. Through a study done over 90 subjects, we discuss the strengths and weaknesses of different media and propose a scope for improvement in each of them.