logo
    Studying and analyzing on data streams mining technique based on clustering method
    0
    Citation
    0
    Reference
    20
    Related Paper
    Abstract:
    With the development of data gathering and communication technologies,it becomes increasingly possible to support real-time monitoring of large amount of information from diverse information sources.A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate.Due to this reason,traditional data mining approach is replaced by the system that is able to mine continuous,high-volume,open-ended data streams as they arrive.This paper introduces a new algorithm using clustering method to improve the data streams mining technique.We have studied clustering data streams using K-Means algorithm,statistical grid-based algorithm and regression analysis and compared these techniques.
    Keywords:
    Data stream clustering
    Clustering data streams is one of the important branches in mining data streams.Because of dynamic and massive characteristics of data streams,traditional data mining algorithms could not satisfy the requirement of online analysis.The focus on data stream technologies is to design one-pass scan algorithmover data set,and maintain an effective synopsis data structure(digest) in memory incrementally which is far smaller than the size of whole data set.A novel algorithm for clustering data streams is presented in this paper.In this algorithm,means method is used for the subset division,sliding window model is used for the data changing and updating,DFT digest is used for data reduction and can be incrementally maintained.This algorithm can save main memory and run time,it is suitable for online clustering.Experiment of clustering real electrical consumption data verify the effectiveness of the presented algorithm.
    Data stream clustering
    Sliding window protocol
    Data set
    Citations (0)
    In a variety of modern mining applications, data are commonly viewed as infinite time ordered data streams rather as finite data sets stored on disk. This view challenges fundamental assumptions commonly made in the context of several data mining algorithms.In this paper, we study the problem of identifying correlations between multiple data streams. In particular, we propose algorithms capable of capturing correlations between multiple continuous data streams in a highly efficient and accurate manner. Our algorithms and techniques are applicable in the case of both synchronous and asynchronous data streaming environments. We capture correlations between multiple streams using the well known technique of Singular Value Decomposition (SVD). Correlations between data items, and the SVD technique in particular, have been repeatedly utilized in an off-line (non stream) data mining problems, for example forecasting, approximate query answering, and data reduction.We propose a methodology based on a combination of dimensionality reduction and sampling to make the SVD technique suitable for a data stream context. Our techniques are approximate, trading accuracy with performance, and we analytically quantify this tradeoff. We present a through experimental evaluation, using both real and synthetic data sets, from a prototype implementation of our technique, investigating the impact of various parameters in the accuracy of the overall computation. Our results indicate, that correlations between multiple data streams can be identified very efficiently and accurately. The algorithms proposed herein, are presented as generic tools, with a multitude of applications on data stream mining problems.
    Citations (58)
    According to the condition that there are some overlap and missing data in distributed data streams,and to meet the needs of lower communication costs,DAM-Distream,a clustering algorithm combining density method and model method is proposed.The al-gorithm uses the Gaussian mixture model to describe the data streams flowing into the local distribution sites.However,Gaussian mixture model parameters are obtained by EM algorithm which is sensitive to initial value.DAM-Distream presents density based algorithm to cluster data streams at first,that is,to search the suitable initial parameters for Gaussian mixture model.Second,EM algorithm is used to iterative clustering,and then the algorithm determines.At last,the models are uploaded to the central site for the integrated treatment.Experimental results show that DAM-Distream can effectively overcome the shortcomings of the EM algorithm and obtain better parame-ters of GMM.Experiment show that it can improve the clustering quality of data streams in distributed systems and reduce the communi-cation cost of the system.
    Data stream clustering
    Gaussian network model
    Citations (0)
    With the emergence of big data and cloud computing, data stream arrives rapidly, large-scale and continuously, real-time data stream clustering analysis has become a hot topic in the study on the current data stream mining. Some existing data stream clustering algorithms cannot effectively deal with the high-dimensional data stream and are incompetent to find clusters of arbitrary shape in real-time, as well as the noise points could not be removed timely. To address these issues, this paper proposes PGDC-Stream, a algorithm based on grid and density for clustering data streams in a parallel distributed environment [4]. The algorithm adopts density threshold function to deal with the noise points and inspect and remove them periodically. It also can find clusters of arbitrary shape in large-scale data flow in real-time. The Map-Reduce framework is used for parallel cluster analysis of data streams.
    Data stream clustering
    Citations (6)
    Data streams are massive, dynamic and unbounded. Due to these issues data stream clustering is challenging problem. Data stream are observed in network monitoring, critical scientific application, weather monitoring and astronomical applications, electronic business, stock trading etc. Data stream clustering puts additional constraints on clustering algorithms. Data streams must be processed in single pass with limited memory as well as with less processing time, but the streams can be highly dynamic. Most of the existing clustering algorithms are distance based and unable to handle the interwoven clusters and also it is impossible to save the data streams, because of infinite characteristic. Proposed work focuses on density based clustering algorithms using micro-clusters. The process is divided into two-phases, online and offline, micro clusters are created in online phase and final clusters are generated in offline phase.
    Data stream clustering
    Online and offline
    Citations (1)