logo
    Cleaning Big Data Streams: A Systematic Literature Review
    7
    Citation
    82
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    In today’s big data era, cleaning big data streams has become a challenging task because of the different formats of big data and the massive amount of big data which is being generated. Many studies have proposed different techniques to overcome these challenges, such as cleaning big data in real time. This systematic literature review presents recently developed techniques that have been used for the cleaning process and for each data cleaning issue. Following the PRISMA framework, four databases are searched, namely IEEE Xplore, ACM Library, Scopus, and Science Direct, to select relevant studies. After selecting the relevant studies, we identify the techniques that have been utilized to clean big data streams and the evaluation methods that have been used to examine their efficiency. Also, we define the cleaning issues that may appear during the cleaning process, namely missing values, duplicated data, outliers, and irrelevant data. Based on our study, the future directions of cleaning big data streams are identified.
    Online data stream mining is one of the most important issues in data mining. Identifying the recent knowledge can provide valuable information for the analysis of the data stream. In this paper, we proposed an one-pass data stream mining algorithm to mine the recent frequent itemsets in data streams with a sliding window basing on transactions. To reduce the cost of time and memory needed to slide the windows, each items is denoted a bit-sequence representations. Basing on a priori property, this kind of representations can find frequent items in data streams efficiently. We named this method MRFI-SW (mining recent frequent itemsets by sliding window) algorithm. Experiment results show that the proposed algorithm not only attains highly accurate mining result, but also consumes less memory than existing algorithms for mining frequent itemsets over recent data streams.
    Sliding window protocol
    Online algorithm
    Sequence (biology)
    Sequential pattern mining (SPAM) is one of the most interesting research issues of data mining. In this paper, a new research problem of mining data streams for sequential patterns is dened. A data stream is an unbound sequence of data ele- ments arriving at a rapid rate. Based on the characteristics of data streams, the problem complexity of mining data streams for sequential patterns is more difficult than that of mining sequential patterns from large static databases. Therefore, mining sequential patterns from data streams is a challenging research issue of data mining and knowl- edge discovery. Hence, an efficient single-pass algorithm, called
    Sequential Pattern Mining
    Sequence (biology)
    Citations (5)
    Mining maximal frequent itemsets has been widely concerned. However, mining data streams is more difficult than mining static databases because of the huge, high-speed and continuous characteristics of streaming data. This paper presents an algorithm, called IDSM-MFI. The algorithm uses a synopsis data structure to store the items of transactions embedded data streams so far. It adopts a top-bottom and bottom-top method to mine the set of all maximal frequent itemsets in landmark windows over data stream, which can be output in real time based on users' specified thresholds. Theoretical analysis and experimental results show that our algorithm is efficient and scalable for mining the set of all maximal frequent itemsets over the entire history of data stream.
    Data set
    Streaming Data
    Citations (13)
    In this paper, we present a closed labeled tree mining algorithm, FBMiner, based on the add-remove principle of closed sets which is newly introduced. Also, we propose a time-decay module to solve stream data mining which gives more attention on the latest data. Compared to the traditional mining algorithms in data stream, FBMiner performs well even that the data is of high complexity. The experiment shows that FBMiner is efficient in data streams mining by reducing consuming dramatically.
    Tree (set theory)
    Citations (4)
    A data stream is a continuous and high-speed flow of data items. High speed refers to the phenomenon that the data rate is high relative to the computational power. The increasing focus of applications that generate and receive data streams stimulates the need for online data stream analysis tools. Mining data streams is a real time process of extracting interesting patterns from high-speed data streams. Mining data streams raises new problems for the data mining community in terms of how to mine continuous high-speed data items that you can only have one look at. In this paper, we propose algorithm output granularity as a solution for mining data streams. Algorithm output granularity is the amount of mining results that fits in main memory before any incremental integration. We show the application of the proposed strategy to build efficient clustering, frequent items and classification techniques. The empirical results for our clustering algorithm are presented and discussed which demonstrate acceptable accuracy coupled with efficiency in running time.
    Granularity
    Data stream clustering
    Citations (55)
    The mining of frequent weighted patterns (FWPs) that considers the different semantic significance (weight) of items is more suitable for practice than the mining of frequent patterns. Therefore, it plays a vital role in real-world scenarios. However, there exist several limitations when applying methods for mining FWPs designed for static data on growth datasets, especially data streams. Hence, this study proposes an algorithm for mining FWPs over data streams. First, we introduce the concept of mining FWPs over data streams via a sliding window model. Then, we introduce a modification of the weighted node tree (WN-tree) named SWN-tree that has the ability to maintain the information over data streams. Next, this study develops a method for mining FWPs over data streams employing a sliding window model based on SWN-tree. This method is called FWPODS (Frequent Weighted Patterns Over Data Stream) algorithm. Finally, we conduct empirical experiments to compare the performances of our approach and the state-of-the-art algorithm (NFWI) for mining FWPs over data streams. The results of experiment indicate that our approach outperforms the NFWI algorithm when running in batch mode in a sliding window.
    Sliding window protocol
    Tree (set theory)
    Citations (13)
    With the development of data gathering and communication technologies,it becomes increasingly possible to support real-time monitoring of large amount of information from diverse information sources.A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate.Due to this reason,traditional data mining approach is replaced by the system that is able to mine continuous,high-volume,open-ended data streams as they arrive.This paper introduces a new algorithm using clustering method to improve the data streams mining technique.We have studied clustering data streams using K-Means algorithm,statistical grid-based algorithm and regression analysis and compared these techniques.
    Data stream clustering
    Citations (0)