Community detection is a common problem in graph data analytics that consists of finding groups of densely connected nodes with few connections to nodes outside of the group. In particular, identifying communities in large‐scale networks is an important task in many scientific domains. In this review, we evaluated eight state‐of‐the‐art and five traditional algorithms for overlapping and disjoint community detection on large‐scale real‐world networks with known ground‐truth communities. These 13 algorithms were empirically compared using goodness metrics that measure the structural properties of the identified communities, as well as performance metrics that evaluate these communities against the ground‐truth. Our results show that these two types of metrics are not equivalent. That is, an algorithm may perform well in terms of goodness metrics, but poorly in terms of performance metrics, or vice versa. WIREs Comput Stat 2014, 6:426–439. doi: 10.1002/wics.1319 This article is categorized under: Algorithms and Computational Methods > Algorithms Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Data: Types and Structure > Graph and Network Data
The ability to efficiently handle massive amounts of data is necessary for the continuing development towards exascale scientific data-mining applications and database systems. Unfortunately, recent years have shown a growing gap between the size and complexity of data produced from scientific applications and the limited I/O bandwidth available on modern high-performance computing systems. Utilizing data compression in order to lower the degree of I/O activity offers a promising means to addressing this problem. However, the standard compression algorithms previously explored for such use offer limited gains on both the end-to-end throughput and storage fronts. In this paper, we introduce an in-situ compression scheme aimed at improving end-to-end I/O throughput as well as reduction of dataset size. Our technique, PRIMACY (Preconditioning Id-MApper for Compressing incompressibility), acts as a preconditioner for standard compression libraries by modifying representation of original floating-point scientific data to increase byte-level repeatability, allowing standard loss less compressors to take advantage of their entropy-based byte-level encoding schemes. We additionally present a theoretical model for compression efficiency in high-performance computing environments and evaluate the efficiency of our approach via comparative analysis. Based on our evaluations on 20 real-world scientific datasets, PRIMACY achieved up to 38% and 22% improvements upon standard end-to-end write and read throughputs respectively in addition to a 25% increase in compression ratios paired with 3-to-4-fold improvement in both compression and decompression throughput over general purpose compressors.
Abstract Background A latent behavior of a biological cell is complex. Deriving the underlying simplicity, or the fundamental rules governing this behavior has been the Holy Grail of systems biology. Data-driven prediction of the system components and their component interplays that are responsible for the target system’s phenotype is a key and challenging step in this endeavor. Results The proposed approach, which we call System Phenotype-related Interplaying Components Enumerator ( Spice ), iteratively enumerates statistically significant system components that are hypothesized (1) to play an important role in defining the specificity of the target system’s phenotype(s); (2) to exhibit a functionally coherent behavior, namely, act in a coordinated manner to perform the phenotype-specific function; and (3) to improve the predictive skill of the system’s phenotype(s) when used collectively in the ensemble of predictive models. Spice can be applied to both instance-based data and network-based data. When validated, Spice effectively identified system components related to three target phenotypes: biohydrogen production, motility, and cancer. Manual results curation agreed with the known phenotype-related system components reported in literature. Additionally, using the identified system components as discriminatory features improved the prediction accuracy by 10% on the phenotype-classification task when compared to a number of state-of-the-art methods applied to eight benchmark microarray data sets. Conclusion We formulate a problem—enumeration of phenotype-determining system component interplays—and propose an effective methodology ( Spice ) to address this problem. Spice improved identification of cancer-related groups of genes from various microarray data sets and detected groups of genes associated with microbial biohydrogen production and motility, many of which were reported in literature. Spice also improved the predictive skill of the system’s phenotype determination compared to individual classifiers and/or other ensemble methods, such as bagging, boosting, random forest, nearest shrunken centroid, and random forest variable selection method.
Data-driven construction of predictive models for biological systems faces challenges from data intensity, uncertainty, and computational complexity. Data-driven model inference is often considered a combinatorial graph problem where an enumeration of all feasible models is sought. The data-intensive and the NP-hard nature of such problems, however, challenges existing methods to meet the required scale of data size and uncertainty, even on modern supercomputers. Maximal clique enumeration (MCE) in a graph derived from such biological data is often a rate-limiting step in detecting protein complexes in protein interaction data, finding clusters of co-expressed genes in microarray data, or identifying clusters of orthologous genes in protein sequence data. We report two key advances that address this challenge. We designed and implemented the first (to the best of our knowledge) parallel MCE algorithm that scales linearly on thousands of processors running MCE on real-world biological networks with thousands and hundreds of thousands of vertices. In addition, we proposed and developed the Graph Perturbation Theory (GPT) that establishes a foundation for efficiently solving the MCE problem in perturbed graphs, which model the uncertainty in the data. GPT formulates necessary and sufficient conditions for detecting the differences between the sets of maximal cliques in the original and perturbed graphs and reduces the enumeration time by more than 80% compared to complete recomputation.
Runtime data sharing across applications is of great importance for avoiding high I/O overhead for scientific data analytics. Sharing data on a staging space running on a set of dedicated compute nodes is faster than writing data to a slow disk-based parallel file system (PFS) and then reading it back for post-processing. Originally, the staging space has been purely based on main memory (DRAM), and thus was several orders of magnitude faster than the PFS approach. However, storing all the data produced by large-scale simulations on DRAM is impractical. Moving data from memory to SSD-based burst buffers is a potential approach to address this issue. However, SSDs are about one order of magnitude slower than DRAM. To optimize data access performance over the staging space, methods such as prefetching data from SSDs according to detected spatial access patterns and distributing data across the network topology have been explored. Although these methods work well for uniform mesh data, which they were designed for, they are not well suited for adaptive mesh refinement (AMR) data. Two major issues must be addressed before constructing such a memory hierarchy and topology-aware runtime AMR data sharing framework: (1) spatial access pattern detection and prefetching for AMR data; (2) AMR data distribution across the network topology at runtime. We propose a framework that addresses these challenges and demonstrate its effectiveness with extensive experiments on AMR data. Our results show the framework's spatial access pattern detection and prefetching methods demonstrate about 26% performance improvement for client analytical processes. Moreover, the framework's topology-aware data placement can improve overall data access performance by up to 18%.