We describe a machine learning method for predicting the value of a real-valued function, given the values of multiple input variables. The method induces solutions from samples in the form of ordered disjunctive normal form (DNF) decision rules. A central objective of the method and representation is the induction of compact, easily interpretable solutions. This rule-based decision model can be extended to search efficiently for similar cases prior to approximating function values. Experimental results on real-world data demonstrate that the new techniques are competitive with existing machine learning and statistical methods and can sometimes yield superior regression performance.
The Handbook of Natural Language Processing is a revised edition of an earlier handbook (Dale, Moisl, and Somers 2000). This second edition was prepared by Nitin Indurkhya, a researcher at the University of New South Wales, and the late text processing pioneer Fred J. Damerau of the IBM T. J. Watson Research Center (d. 27 January 2009), whose 1964 paper introduced a version of what is now known as the Damerau-Levenshtein distance, a metric of the similarity between two strings and a dynamic programming algorithm to compute it efficiently (Damerau 1964). Damerau also invented automatic hyphenation (Damerau 1970) and worked on early question-answering systems. Indurkhya, who is also affiliatedwith a consulting company, Data-Miner Pty Ltd., maintains a companion wiki for the book.1 The book has three parts, totaling 26 chapters. The first part, Classical Approaches, essentially covers techniques that were known prior to the statistical revolution, that is, before natural language processing people in the mainstream embraced techniques that speech engineers were already using successfully for awhile. The second part, Empirical and Statistical Approaches, covers state-of-the-art data-driven models.2 Part three, Applications, shows some techniques closer to applications. If you are talking to a computational linguist, information extraction is seen as an application, but if you are talking to business people, they will see it as a general technology area, from which many application products and services can be built. The handbook “aims to cater to the needs of NLP practitioners and language-engineering professionals in academia as well as in industry. . . . The prototypical reader is interested in the practical aspects of building NLP systems and may also be interested in working with languages other than English” (p. xxii). Hence it would have been nice to introduce some descriptions of actual products that generate revenue (even if this meant that this particular part of the handbook would become outdated more quickly) in order to demonstrate how the NLP parts are embedded in non-NLP technology, and how these products are embedded in the businesses that use them. For example, the application chapter “Information Retrieval” does not describe how the topics in other parts were applied in Web search engines or enterprise search products, as one might have expected. Rather, it basically is another technical chapter—and its probabilistic IR material could just as well have been presented in part two (statistical techniques).
We describe a lightweight learning method that induces an ensemble of decision-rule solutions for regression problems. Instead of direct prediction of a continuous output variable, the method discretizes the variable by k-means clustering and solves the resultant classification problem. Predictions on new examples are made by averaging the mean values of classes with votes that are close in number to the most likely class. We provide experimental evidence that this indirect approach can often yield strong results for many applications, generally outperforming direct approaches such as regression trees and rivaling bagged regression trees.
Item descriptions on an online e-Commerce site such as eBay consist of item-specific information along with generic information such as shipping and return policies, requests for feedback, and contact information. Extracting these textual segments from the item descriptions is non-trivial as they contain html markups, advertisements, templates, and navigational elements. Since sellers have considerable editorial freedom in how to describe their items, many of the descriptions lack homogeneity and compactness. Very often, the relevant information has to be extracted from incomplete, ill-formed discourse units adding to the challenge of finding coherent segments. In this paper we describe an approach that identifies item-specific text segments from eBay descriptions. This approach uses a bootstrapping technique to learn high-quality semantic lexicons for item-agnostic text segments. We first extract useful text by removing html markups using a boiler-plate removal technique that preserves markup information and captures visual segmentation. Each segment is further processed to extract discourse units that play the same role as sentences. This is followed by a clustering technique that identifies thematic breaks to extract coherent segments. We evaluate our approach on a diverse set of descriptions and show that our approach outperforms a commonly-used approach that relies only on the title keywords.