Machine learning (ML) has employed various discretization methods to partition numerical attributes into intervals. However, an effective discretization technique remains elusive in many ML applications, such as association rule mining. Moreover, the existing discretization techniques do not reflect best the impact of the independent numerical factor on the dependent numerical target factor. This research aims to establish a benchmark approach for numerical attribute partitioning. We conduct an extensive analysis of human perceptions of partitioning a numerical attribute and compare these perceptions with the results obtained from our two proposed measures. We also examine the perceptions of experts in data science, statistics, and engineering by employing numerical data visualization techniques. The analysis of collected responses reveals that $68.7\%$ of human responses approximately closely align with the values generated by our proposed measures. Based on these findings, our proposed measures may be used as one of the methods for discretizing the numerical attributes.
In today's world, artificial intelligence based smart applications and smart medical devices are developed with big data-based trained datasets. However, what if a training dataset used to train a machine learning module is incorrect and has a statistical paradox. Statistical paradoxes are complicated to observe in data but are very important to analyze in every training datasets. This article discusses Simpson's paradox and its effects on various datasets. We provide that Simpson's paradox is more common in a variety of data and it leads to wrong conclusions potentially with harmful consequences. We provide a mathematical analysis of Simpson's paradox and analyse its effects on continuous data. Experiments on real-world and synthetic datasets clearly show that the paradox severely impacts big data.
Content based image retrieval from large database has become an area of wide interest nowadays in many applications.Content-based image retrieval (CBIR) technique use image content to search and retrieve digital images.Content-based image retrieval (CBIR) is an important research area for manipulating large amount of image databases.In this paper the analysis work is done for finding the spatial features and collects them into a frame to view all the spatial features and the scope of implementing these features into the image retrieval.The commercial image search engines available as on date are: QBIC, VisualSeek, Virage, Netra, PicSOM, FIRE, AltaVista, etc. Region-Based Image Retrieval (RBIR) is a promising extension of CBIR.The shape and spatial features are quite simple to derive and effective, and can be extracted in real time.Our analysis is able to propose a system that has the advantage of increasing the retrieval accuracy and decreasing the retrieval time.
Numerical association rule mining (NARM) is an extended version of association rule mining that determines association rules in numerical data items, primarily via distribution, discretization and optimization techniques. Under the umbrella of optimization techniques, several evolutionary and swarm intelligence-based algorithms have been proposed to extract association rules from a numeric dataset. However, a sufficient understanding of the performance of swarm intelligence-based algorithms, especially for NARM, is still missing. In state-of-the-art, various swarm intelligence-based optimization algorithms are claimed to be better based on their arbitrary comparisons with different algorithms in different classes, e.g., swarm intelligence-based algorithms are compared with genetic algorithms. Unfortunately, they are not compared within their own class algorithms. Therefore, it is challenging to select an appropriate swarm intelligence-based algorithm for NARM. This article aims at filling this gap by conducting an exhaustive multi-aspect analysis of four popular swarm intelligence-based optimization algorithms (MOPAR, MOCANAR, ACO-R and MOB-ARM) with four real-world datasets and six major metrics and objectives: performance time, the number of rules, support, confidence, comprehensibility, and interestingness. In our analysis, the MOPAR algorithm produces a low number of rules and shows high values of confidence, comprehensibility, and interestingness. The MOCANAR algorithm provides satisfactory results with respect to all six parameters across all the data sets. The ACO-R algorithm produces high-quality rules but needs parameter modification for a large number of attributes in datasets, and the MOB-ARM algorithm is way slower than the other three algorithms.
In the past two decades, there have been tremendous advancements in artificial intelligence (AI), machine learning (ML), and deep learning (DL) across various fields, including healthcare, autonomous driving, personal assistant technology, businesses, education, and justice. However, despite many success stories and advantages, AI-based systems are often considered biased, unfair, and untrustworthy. In this paper, we argue that statistical paradoxes are one of the well-known challenges for inducing bias in AI systems. Unfortunately, they have not been adequately addressed in the AI application development scenario. To support our claim, we investigate instances of Simpson's paradox, an extreme case of confounding, in various benchmark datasets. In doing so, we demonstrate the severe consequences of statistical paradoxes on AI systems. Thus, to handle confounding effects and deal with the severe impacts of statistical paradoxes in AI systems, the contribution of this paper is threefold; First, we introduce a framework to mitigate bias in training datasets. Second, we present a set of three algorithms capable of identifying and adjusting the impact of potential statistical confounders in both categorical and continuous datasets. Third, on top of the proposed framework and algorithms, we develop a web-based tool which identifies confounding effects, deals with the instances of Simpson's paradox, and provides adjusted observations to reduce the impacts of confounders. A series of experiments are conducted on multiple real-world and benchmark datasets to validate the efficacy and usefulness of the proposed framework and the algorithms. The results validate the effectiveness of the framework and algorithms. Additionally, the web application serves as a valuable tool for data scientists and researchers by automatically detecting and addressing confounding effects. This paper significantly contributes towards fostering fair and trustworthy AI and holds immense potential for further extensions beyond its current use.
Numerical association rule mining is a widely used variant of the association rule mining technique, and it has been extensively used in discovering patterns and relationships in numerical data. Initially, researchers and scientists integrated numerical attributes in association rule mining using various discretization approaches; however, over time, a plethora of alternative methods have emerged in this field. Unfortunately, the increase of alternative methods has resulted into a significant knowledge gap in understanding diverse techniques employed in numerical association rule mining -- this paper attempts to bridge this knowledge gap by conducting a comprehensive systematic literature review. We provide an in-depth study of diverse methods, algorithms, metrics, and datasets derived from 1,140 scholarly articles published from the inception of numerical association rule mining in the year 1996 to 2022. In compliance with the inclusion, exclusion, and quality evaluation criteria, 68 papers were chosen to be extensively evaluated. To the best of our knowledge, this systematic literature review is the first of its kind to provide an exhaustive analysis of the current literature and previous surveys on numerical association rule mining. The paper discusses important research issues, the current status, and future possibilities of numerical association rule mining. On the basis of this systematic review, the article also presents a novel discretization measure that contributes by providing a partitioning of numerical data that meets well human perception of partitions.