Cleaning Big Data Streams: A Systematic Literature Review

Technologies (2023)

Citation

Reference

Related Paper

Citation Trend

Abstract:

In today’s big data era, cleaning big data streams has become a challenging task because of the different formats of big data and the massive amount of big data which is being generated. Many studies have proposed different techniques to overcome these challenges, such as cleaning big data in real time. This systematic literature review presents recently developed techniques that have been used for the cleaning process and for each data cleaning issue. Following the PRISMA framework, four databases are searched, namely IEEE Xplore, ACM Library, Scopus, and Science Direct, to select relevant studies. After selecting the relevant studies, we identify the techniques that have been utilized to clean big data streams and the evaluation methods that have been used to examine their efficiency. Also, we define the cleaning issues that may appear during the cleaning process, namely missing values, duplicated data, outliers, and irrelevant data. Based on our study, the future directions of cleaning big data streams are identified.

Topics:

Privacy-Preserving Technologies in Data

Data Quality and Management

Data Stream Mining Techniques

10.3390/technologies11040101

Cite

PDF

WSFI-Mine: Mining Frequent Patterns in Data Streams

Lecture notes in computer science (2009)

Young-Hee Kim Young‐Gab Kim

10.1007/978-3-642-01510-6_95

Cite

Citations (1)

Online data stream Mining of Recent Frequent Itemsets based on Sliding Window model

International Conference on Machine Learning and Cybernetics (2008)

Jiadong Ren Ke Li

Online data stream mining is one of the most important issues in data mining. Identifying the recent knowledge can provide valuable information for the analysis of the data stream. In this paper, we proposed an one-pass data stream mining algorithm to mine the recent frequent itemsets in data streams with a sliding window basing on transactions. To reduce the cost of time and memory needed to slide the windows, each items is denoted a bit-sequence representations. Basing on a priori property, this kind of representations can find frequent items in data streams efficiently. We named this method MRFI-SW (mining recent frequent itemsets by sliding window) algorithm. Experiment results show that the proposed algorithm not only attains highly accurate mining result, but also consumes less memory than existing algorithms for mining frequent itemsets over recent data streams.

Sliding window protocol

Online algorithm

Sequence (biology)

10.1109/icmlc.2008.4620420

Cite

Citations (8)

A SINGLE-SCAN ALGORITHM FOR MINING SEQUENTIAL PATTERNS FROM DATA STREAMS

International journal of innovative computing, information & control (2012)

Hua-Fu Li Chin-Chuan Ho Hsuan-Sheng Chen Suh-Yin Lee

Sequential pattern mining (SPAM) is one of the most interesting research issues of data mining. In this paper, a new research problem of mining data streams for sequential patterns is dened. A data stream is an unbound sequence of data ele- ments arriving at a rapid rate. Based on the characteristics of data streams, the problem complexity of mining data streams for sequential patterns is more difficult than that of mining sequential patterns from large static databases. Therefore, mining sequential patterns from data streams is a challenging research issue of data mining and knowl- edge discovery. Hence, an efficient single-pass algorithm, called

Sequential Pattern Mining

Sequence (biology)

Source

Cite

Citations (5)

A Mining Maximal Frequent Itemsets over the Entire History of Data Streams

Yinmin Mao Hong Li Lumin Yang Zhigang Chen Lixin Liu

Mining maximal frequent itemsets has been widely concerned. However, mining data streams is more difficult than mining static databases because of the huge, high-speed and continuous characteristics of streaming data. This paper presents an algorithm, called IDSM-MFI. The algorithm uses a synopsis data structure to store the items of transactions embedded data streams so far. It adopts a top-bottom and bottom-top method to mine the set of all maximal frequent itemsets in landmark windows over data stream, which can be output in real time based on users' specified thresholds. Theoretical analysis and experimental results show that our algorithm is efficient and scalable for mining the set of all maximal frequent itemsets over the entire history of data stream.

Data set

Streaming Data

10.1109/dbta.2009.125

Cite

Citations (13)

A new method of mining frequent closed trees in data streams

2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (2010)

Bo Feng Yajing Xu Na Zhao Huimin Xu

In this paper, we present a closed labeled tree mining algorithm, FBMiner, based on the add-remove principle of closed sets which is newly introduced. Also, we propose a time-decay module to solve stream data mining which gives more attention on the latest data. Compared to the traditional mining algorithms in data stream, FBMiner performs well even that the data is of high complexity. The experiment shows that FBMiner is efficient in data streams mining by reducing consuming dramatically.

Tree (set theory)

10.1109/fskd.2010.5569534

Cite

Citations (4)

Challenges in Mining Big Data Streams

Advances in intelligent systems and computing (2018)

Veena Tayal Ritesh Srivastava

Streaming Data

10.1007/978-981-13-2254-9_15

Cite

Citations (5)

Cost-efficient mining techniques for data streams

Mohamed Medhat Gaber Shonali Krishnaswamy Arkady Zaslavsky

A data stream is a continuous and high-speed flow of data items. High speed refers to the phenomenon that the data rate is high relative to the computational power. The increasing focus of applications that generate and receive data streams stimulates the need for online data stream analysis tools. Mining data streams is a real time process of extracting interesting patterns from high-speed data streams. Mining data streams raises new problems for the data mining community in terms of how to mine continuous high-speed data items that you can only have one look at. In this paper, we propose algorithm output granularity as a solution for mining data streams. Algorithm output granularity is the amount of mining results that fits in main memory before any incremental integration. We show the application of the proposed strategy to build efficient clustering, frequent items and classification techniques. The empirical results for our clustering algorithm are presented and discussed which demonstrate acceptable accuracy coupled with efficiency in running time.

Granularity

Data stream clustering

Source

Cite

Citations (55)

A Sliding Window-Based Approach for Mining Frequent Weighted Patterns Over Data Streams

IEEE Access (2021)

Huong Bui Tu-Anh Nguyen-Hoang Bay Vo Ham Nguyen Tuong Le

The mining of frequent weighted patterns (FWPs) that considers the different semantic significance (weight) of items is more suitable for practice than the mining of frequent patterns. Therefore, it plays a vital role in real-world scenarios. However, there exist several limitations when applying methods for mining FWPs designed for static data on growth datasets, especially data streams. Hence, this study proposes an algorithm for mining FWPs over data streams. First, we introduce the concept of mining FWPs over data streams via a sliding window model. Then, we introduce a modification of the weighted node tree (WN-tree) named SWN-tree that has the ability to maintain the information over data streams. Next, this study develops a method for mining FWPs over data streams employing a sliding window model based on SWN-tree. This method is called FWPODS (Frequent Weighted Patterns Over Data Stream) algorithm. Finally, we conduct empirical experiments to compare the performances of our approach and the state-of-the-art algorithm (NFWI) for mining FWPs over data streams. The results of experiment indicate that our approach outperforms the NFWI algorithm when running in batch mode in a sliding window.

Sliding window protocol

Tree (set theory)

10.1109/access.2021.3070132

Cite

Citations (13)

Studying and analyzing on data streams mining technique based on clustering method

Journal of Zhejiang University of Technology (2007)

LU Yi-hong

With the development of data gathering and communication technologies,it becomes increasingly possible to support real-time monitoring of large amount of information from diverse information sources.A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate.Due to this reason,traditional data mining approach is replaced by the system that is able to mine continuous,high-volume,open-ended data streams as they arrive.This paper introduces a new algorithm using clustering method to improve the data streams mining technique.We have studied clustering data streams using K-Means algorithm,statistical grid-based algorithm and regression analysis and compared these techniques.

Data stream clustering

Source

Cite

Citations (0)