Could we detect anomalies during the run-time of a program by learning from the analysis of its previous traces for normally completed executions? In this paper we create a featured data set from program traces at run time, either during its regular life, or during its testing phase. This data set represents execution traces of relevant variables including inputs, outputs, intermediate variables, and invariant checks. During a learning mining step, we start from exhaustive random training input sets and map program traces to a minimal set of conceptual patterns. We employ formal concept analysis to do this in an incremental way, and without losing dependencies between data set features. This set of patterns becomes a reference for checking the normality of future program executions as it captures invariant functional dependencies between the variables that need to be preserved during execution. During the learning step, we consider enough input classes corresponding to the different patterns by using random input selection until reaching stability of the set of patterns (i.e. the set is almost no longer changing, and only negligible new patterns are not reducible to it). Experimental results show that the generated patterns are significant in representing normal program executions. They also enable the detection of different executable code contamination at early stages. The proposed method is general and modular. If applied systematically, it enhances software resilience against abnormal and unpredictable events.
Nowadays, knowledge discovery from data is one of the challenging problems, due to its importance in different fields such as; biology, economy and social sciences. One way of extracting knowledge from data can be achieved by discovering functional dependencies (FDs). FD explores the relation between different attributes, so that the value of one or more attributes is determined by another attribute set [1]. FD discovery helps in many applications, such as; query optimization, data normalization, interface restructuring, and data cleaning. A plethora of functional dependency discovery algorithms has been proposed. Some of the most widely used algorithms are; TANE [2], FD_MINE [3], FUN [4], DFD [5], DEP-MINER [6], FASTFDS [7] and FDEP [8]. These algorithms extract FDs using different techniques, such as; (1) building a search space of all attributes combinations in an ordered manner, then start searching for candidate attributes that are assumed to have functional dependency between them, (2) generating agreeing and difference sets, where the agreeing sets are acquired through applying cross product of all tuples, the difference sets are the complement of the agreeing sets, both sets are used to infer the dependencies, (3) generating one generic set of functional dependency, in which each attribute can determine all other attributes, this set is then updated and some dependencies are removed to include more specialized dependencies through records pairwise comparisons. Huge efforts have been dedicated to compare the most widely used algorithms in terms of runtime and memory consumption. No attention has been paid to the accuracy of resultant set of functional dependencies represented. Functional dependency accuracy is defined by two main factors; being complete and minimal. In this paper, we are proposing a conceptual-based functional dependency detection framework. The proposed method is mainly based on Formal Concept Analysis (FCA); which is a mathematical framework rooted in lattice theory and is used for conceptual data analysis where data is represented in the form of a binary relation called a formal context [9]. From this formal context, a set of implications is extracted, these implications are in the same form of FDs. Implications are proven to be semantically equivalent to the set of all functional dependencies available in the certain database [10]. This set of implications should be the smallest set representing the formal context which is termed the Duquenne–Guigues, or canonical, basis of implications [11]. Moreover, completeness of implications is achieved through applying Armstrong rules discussed in [12]. The proposed framework is composed of three main components; they are: Data transformation component: it converts input data to binary formal context. Reduction component: it applies data reduction on tuples or attributes. Implication extraction component: this is responsible for producing minimal and complete set of implications. The key benefits of the proposed framework: 1 It works on any kind of input data (qualitative and quantitative) that is automatically transformed to a formal context of binary relation, 2 A crisp Lukasiewicz data reduction technique is implemented to remove redundant data, which positively helps reducing the total runtime, 3 The set of implications produced are guaranteed to be minimal; due to the use of Duquenne–Guigues algorithm in extraction, The set of implications produced are guaranteed to be complete; due to the use of Armstrong rules. The proposed framework is compared to the seven most commonly used algorithms listed above and evaluated based on runtime, memory consumption and accuracy using benchmark datasets. Acknowledgement This contribution was made possible by NPRP-07-794-1-145 grant from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.
Knowledge discovery from data is a challenging problem that has significant importance in many different fields such as biology, economics and social sciences. Real-world data is incomplete and ambiguous; moreover, its rapid increase in size complicates the analysis process. Therefore, data reduction techniques that consider data uncertainty are highly required. In this paper, our objective is to conceptually reduce uncertain data without losing information. Two reduction methods are proposed that are mainly rooted in formal concept analysis theory. The first method is targeting approximate data reduction; it uses the result of Baixeries et al. for detecting functional dependencies by transforming an instance of a database into an approximate formal context. The second method is based on fuzzy data reduction that employs the algorithm of Elloumi et al. in fuzzy data reduction using Lukasiewicz logic. These reduction methods have been compared to three other machine learning based reduction algorithms through a classification case study of breast cancer data. Classification accuracy, root mean square error and reduced data size have been reported to show that reduced training sets using our methods result in very accurate classifiers with minimal data size. Moreover, the reduced data has the advantage of decreasing communication time and memory space.
The increase in biomedical data has given rise to the need for developing data sampling techniques. With the emergence of big data and the rise of popularity of data science, sampling or reduction techniques have been assistive to significantly hasten the data analytics process. Intuitively, without sampling techniques, it would be difficult to efficiently extract useful patterns from a large dataset. However, by using sampling techniques, data analysis can effectively be performed on huge datasets, to produce a relatively small portion of data, which extracts the most representative objects from the original dataset. However, to reach effective conclusions and predictions, the samples should preserve the data behavior. In this paper, we propose a unique data sampling technique which exploits the notion of formal concept analysis. Machine learning experiments are performed on the resulting sample to evaluate quality, and the performance of our method is compared with another sampling technique proposed in the literature. The results demonstrate the effectiveness and competitiveness of the proposed approach in terms of sample size and quality, as determined by accuracy and the F1-measure.
Online marketplaces are e-commerce websites where thousands of products are provided by multiple third parties. There are dozens of these differently structured marketplaces that need to be visited by the end users to reach their targets. This searching process consumes a lot of time and effort; moreover it negatively affects the user experience. In this paper, extensive analysis and evaluation of the existing e-marketplaces are performed to improve the end-users experience through a Mobile App. The main goal of this study is to find a solution that is capable of integrating multiple heterogeneous hidden data sources and unify the received responses into one single, structured and homogeneous source. Furthermore, the user can easily choose the desired product or reformulate the query through the interface. The proposed Android Mobile App is based on the multi-level conceptual analysis and modeling discipline, in which, data are analyzed in a way that helps in discovering the main concepts of any unknown domain captured from the hidden web. These concepts discovered through information extraction are then structured into a tree-based interface for easy navigation and query reformulation. The application has been evaluated through substantial experiments and compared to other existing mobile applications. The results showed that analyzing the query results and re-structuring the output before displaying to the end-user in a conceptual multilevel mechanism are reasonably effective in terms of number of clicks, time taken and number of navigation screens. Based on the proposed intelligent application, the interface is minimized to only two navigation screens, and the time needed to browse products from multiple marketplaces is kept reasonable in order to reach the target product.
Most modern search engines feature keyword based search interfaces. These interfaces are usually found on websites belonging to enterprises or governments or sites related to news articles, blogs and social media that contain a large corpus of documents. These collections of documents are not easily indexed by web search engines, and are considered as hidden web databases. These databases provide opportunities for data analysis for many third-parties through their keyword search interfaces. A significant amount of research has already been carried out on analyzing and extracting aggregate information about these hidden document corpora. But most of these research focus on the high level big-picture information of the database. Not enough focus has been done on extracting analytical information which is specific to individual queries. This paper focuses on that analysis gap and takes ideas from other existing research to formulate a query cardinality estimation technique i.e. the count of documents matching a query in the document corpus of a search engine. We experimentally assess the effectiveness of our method by building a search engine on the Reuters-21578 document corpus. For a given keyword the corresponding documents' count is estimated only by sending search queries using the interface.
This paper focuses on detecting inconsistencies within text corpora. It is a very interesting area with many applications. Most existing methods deal with this problem using complicated textual analysis, which is known for not being accurate enough. We propose a new methodology that consists of two steps, the first one being a machine learning step that performs multilevel text categorization. The second one applies conceptual reasoning on the predicted categories in order to detect inconsistencies. This paper has been validated on a set of Islamic advisory opinions (also known as fatwas). This domain is gaining a large interest with users continuously checking the authenticity and relevance of such content. The results show that our method is very accurate and can complement existing methods using the linguistic analysis.