Streaming Hierarchical Clustering for Concept Mining

2007 
We are concerned with the general problem of concept mining - discovering useful associations, relationships, and groupings in large collections of data. Mathematical transformation algorithms have proven effective at reducing the content of multilingual, unstructured data into a vector that describes the content. Such methods are particularly desirable in fields undergoing information explosions, such as network traffic analysis, bioinformatics, and the intelligence community. In response, concept mining methodology is being extended to improve performance and permit hardware implementation -traditional methods are not sufficiently scalable. Hardware-accelerated systems have proven effective at automatically classifying such content when topics are known in advance. Our complete system builds on our past work in this area, presented in the Aerospace 2005 and 2006 conferences, where we described a novel algorithmic approach for extracting semantic content from unstructured text document streams. However, there is an additional need within the intelligence community to cluster related sets of content without advance training. To allow this function to happen at high speed, we have implemented a system that hierarchically clusters streaming content. The method, streaming hierarchical partitioning, is designed to be implemented in hardware and handle extremely high ingestion rates. As new documents are ingested, they are dynamically organized into a hierarchy, which has a fixed maximal size. Once this limit is reached, documents must consequently be excreted at a rate equaling their ingestion. The choice of documents to excrete is a point of interest -we present several autonomous heuristics for doing so intelligently, as well as a proposal for incorporating user interaction to focus attention on concepts of interest. A related desideratum is robust accommodation of concept drift -gradual change in the distribution and content of the document stream over time. Accordingly, we present and analyze experimental results for document streams evolving over time under several regimes. Current and proposed methods for concisely and informatively presenting derived content from streaming hierarchical clustering to the user for analysis are presented in this content. To support our claims of eventual hardware implementation and real-time performance with a high ingestion rate, we provide a detailed hardware-ready design, with asymptotic analysis and performance predictions. The system has been prototyped and tested on a Xeon processor as well as on a PowerPC embedded within a Xilinx Virtex2 FPGA. In summary, we describe a system designed to satisfy three primary goals: (1) real-time concept mining of high-volume data streams; (2) dynamic organization of concepts into a relational hierarchy; (3) adaptive reorganization of the concept hierarchy in response to evolving circumstances and user feedback.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    15
    Citations
    NaN
    KQI
    []