Many species of plants produce leaves with distinct teeth around their margins. The presence and nature of these teeth can often help botanists to identify species. Moreover, it has long been known that more species native to colder regions have teeth than species native to warmer regions. It has therefore been suggested that fossilized remains of leaves can be used as a proxy for ancient climate reconstruction. Similar studies on living plants can help our understanding of the relationships. The required analysis of leaves typically involves considerable manual effort, which in practice limits the number of leaves that are analyzed, potentially reducing the power of the results. In this work, we describe a novel algorithm to automate the marginal tooth analysis of leaves found in digital images. We demonstrate our methods on a large set of images of whole herbarium specimens collected from Tilia trees (also known as lime, linden or basswood). We chose the genus Tilia as its constituent species have toothed leaves of varied size and shape. In a previous study we extracted leaves automatically from a set of images. Our new algorithm locates teeth on the margins of such leaves and extracts features such as each tooth's area, perimeter and internal angles, as well as counting them. We evaluate an implementation of our algorithm's performance against a manually analyzed subset of the images. We found that the algorithm achieves an accuracy of 85% for counting teeth and 75% for estimating tooth area. We also demonstrate that the automatically extracted features are sufficient to identify different species of Tilia using a simple linear discriminant analysis, and that the features relating to teeth are the most useful.
Newsworthy stories are increasingly being shared through social networking platforms such as Twitter and Reddit, and journalists now use them to rapidly discover stories and eye-witness accounts. We present a technique that detects “bursts” of phrases on Twitter that is designed for a real-time topic-detection system. We describe a time-dependent variant of the classic tf-idf approach and group together bursty phrases that often appear in the same messages in order to identify emerging topics. We demonstrate our methods by analysing tweets corresponding to events drawn from the worlds of politics and sport, as well as more general mainstream news. We created a user-centred “ground truth” to evaluate our methods, based on mainstream media accounts of the events. This helps ensure our methods remain practical. We compare several clustering and topic ranking methods to discover the characteristics of news-related collections, and show that different strategies are needed to detect emerging topics within them. We show that our methods successfully detect a range of different topics for each event and can retrieve messages (for example, tweets) that represent each topic for the user.
Lightness illusions are fundamental to human perception, and yet why we see them is still the focus of much research. Here we address the question by modelling not human physiology or perception directly as is typically the case but our natural visual world and the need for robust behaviour. Artificial neural networks were trained to predict the reflectance of surfaces in a synthetic ecology consisting of 3-D "dead-leaves" scenes under non-uniform illumination. The networks learned to solve this task accurately and robustly given only ambiguous sense data. In addition—and as a direct consequence of their experience—the networks also made systematic "errors" in their behaviour commensurate with human illusions, which includes brightness contrast and assimilation—although assimilation (specifically White's illusion) only emerged when the virtual ecology included 3-D, as opposed to 2-D scenes. Subtle variations in these illusions, also found in human perception, were observed, such as the asymmetry of brightness contrast. These data suggest that "illusions" arise in humans because (i) natural stimuli are ambiguous, and (ii) this ambiguity is resolved empirically by encoding the statistical relationship between images and scenes in past visual experience. Since resolving stimulus ambiguity is a challenge faced by all visual systems, a corollary of these findings is that human illusions must be experienced by all visual animals regardless of their particular neural machinery. The data also provide a more formal definition of illusion: the condition in which the true source of a stimulus differs from what is its most likely (and thus perceived) source. As such, illusions are not fundamentally different from non-illusory percepts, all being direct manifestations of the statistical relationship between images and scenes.
This thesis compares the performance of machine learning techniques and statistics in the analysis of food design data. The goal of the analysis is to understand what makes people like (or dislike) a product, by building models relating sensory features (such as flavour or texture) to consumer preferences. One difficulty in analysing these data sets is that they are extremely small, due to taste-fatigue of consumer preference panels. Feature selection is essential because food sensory data sets typically have many features and few records. Several feature selection algorithms are compared, and the results highlight the need to limit the number of features used. We therefore apply model order selection to feature selection. A semi-supervised feature selection method is introduced and compared with more traditional methods. After the selection of a suitable set of features, the relationship between those features and consumers preferences must be modelled. Two regression techniques are compared, focussing on their relative performance on very small data sets. A semi-supervised ensemble learning algorithm is introduced, and analysed. Consumers have individual preferences, so rather than producing a single generic product, food designers must first discover homogeneous groups of consumers, and then target each group with a different product. Several clustering techniques are compared, and consideration of their inherent biases reveals further information regarding the structure of the data. A combination of regression and clustering is proposed, which allows evaluation of clustering results using the predictive power of the resultant models. Preference data sets contain a significant number of misleading outliers owing to the way they are collected. An algorithm that combines clustering and outlier detection is introduced. Which aims to produce an outlier-free cluster model, and also provides heuristic estimates of the number of outliers present. Overall, machine learning techniques show performance similar to traditional statistical techniques, with small improvements in accuracy in some cases. Machine learning brings the benefit of typically being dependent on fewer assumptions: where these assumptions are invalid, results may be improved. Furthermore, machine learning makes use of considerable computational power, which is now cheaply available, in the search for improved solutions. In this thesis, we examine the efficacy of machine learning techniques when analyzing food design data sets. In summary, the main contributions of this thesis are: A semi-supervised feature selection algorithm. A semi-supervised ensemble for regression. A clustering evaluation technique. An outlier detection technique for clustering.
The SNOW 2014 Data Challenge aimed at creating a public benchmark and evaluation resource for the problem of topic detection in streams of social content. In particular, given a set of tweets spanning a time interval of interest, the Challenge required the extraction of the most significant news topics in short timeslots within the selected interval. Here, we provide details with respect to the Challenge definition, the data collection and evaluation process, and the results achieved by the 11 teams that participated in it, along with a concise retrospective analysis of the main conclusions and arising issues.