logo
    Low-Shot Classification: A Comparison of Classical and Deep Transfer Machine Learning Approaches
    3
    Citation
    20
    Reference
    20
    Related Paper
    Citation Trend
    Abstract:
    Despite the recent success of deep transfer learning approaches in NLP, there is a lack of quantitative studies demonstrating the gains these models offer in low-shot text classification tasks over existing paradigms. Deep transfer learning approaches such as BERT and ULMFiT demonstrate that they can beat state-of-the-art results on larger datasets, however when one has only 100-1000 labelled examples per class, the choice of approach is less clear, with classical machine learning and deep transfer learning representing valid options. This paper compares the current best transfer learning approach with top classical machine learning approaches on a trinary sentiment classification task to assess the best paradigm. We find that BERT, representing the best of deep transfer learning, is the best performing approach, outperforming top classical machine learning algorithms by 9.7% on average when trained with 100 examples per class, narrowing to 1.8% at 1000 labels per class. We also show the robustness of deep transfer learning in moving across domains, where the maximum loss in accuracy is only 0.7% in similar domain tasks and 3.2% cross domain, compared to classical machine learning which loses up to 20.6%.
    Keywords:
    Transfer of learning
    Robustness
    The use of meta-learning and transfer learning in the task of few-shot image classification is a well researched area with many papers showcasing the advantages of transfer learning over meta-learning in cases where data is plentiful and there is no major limitations to computational resources. In this paper we will showcase our experimental results from testing various state-of-the-art transfer learning weights and architectures versus similar state-of-the-art works in the meta-learning field for image classification utilizing Model-Agnostic Meta Learning (MAML). Our results show that both practices provide adequate performance when the dataset is sufficiently large, but that they both also struggle when data sparsity is introduced to maintain sufficient performance. This problem is moderately reduced with the use of image augmentation and the fine-tuning of hyperparameters. In this paper we will discuss: (1) our process of developing a robust multi-class convolutional neural network (CNN) for the task of few-shot image classification, (2) demonstrate that transfer learning is the superior method of helping create an image classification model when the dataset is large and (3) that MAML outperforms transfer learning in the case where data is very limited. The code is available here: github.com/JBall1/Few-Shot-Limited-Data
    Transfer of learning
    Inductive transfer
    Contextual image classification
    Hyperparameter
    Citations (2)
    We address the problem of learning cross-modal representations. We propose an instance-based deep metric learning approach in joint visual and textual space. The key novelty of this paper is that it shows that using per-image semantic supervision leads to substantial improvement in zero-shot performance over using class-only supervision. We also provide a probabilistic justification and empirical validation for a metric rescaling approach to balance the seen/unseen accuracy in the GZSL task. We evaluate our approach on two fine-grained zero-shot datasets: cub and flowers.
    Zero (linguistics)
    Novelty Detection
    There is more to learning stochastic concepts for robust statistical pattern recognition than the learning itself: computational resources must be allocated and information must be obtained. Therein lies the key to a learning strategy that is efficient, requiring the fewest resources and the least information necessary to produce classifiers that generalize well. Probabilistic learning strategies currently used with connectionist (as well as most traditional) classifiers are inefficient, requiring high classifier complexity and large training sample sizes to ensure good generalization. An asymptotically efficient differential learning strategy is set forth, which guarantees Bayesian (i.e., minimum probability-of-error) discrimination with the minimum-complexity classifier. Moreover, differential learning guarantees the best generalization allowed by the choice of classifier paradigm as long as the training sample size is large. When the training sample size is small, differential learning usually guarantees the best generalization allowed by the choice of classifier paradigm. The theory is demonstrated in several real-world machine learning/pattern recognition tasks associated with optical character recognition, medical diagnosis, airborne remote sensing imagery interpretation, and adaptive digital telecommunications. These applications focus on the implementation of differential learning and illustrate its advantages and limitations in a series of experiments that complement the theory. The experiments demonstrate that differentially-generated classifiers consistently generalize better than their probabilistically-generated counterparts across a wide range of real-world learning-and-classification tasks. The discrimination improvements range from moderate to significant, depending on the statistical nature of the learning task and its relationship to the functional basis of the classifier used.
    Statistical learning theory
    Connectionism
    Citations (17)
    Deep learning has had remarkable success in several applications such as classification, clustering, regression etc. Several assumptions are made during the learning process which may not be apt for all real-world applications due to change in the feature space. For the classification task, deep learning models are most appropriate if a large amount of data is used for training. Therefore, enhancement is made from deep learning to transfer learning by knowledge transfer from feature space. In this paper, the accuracy obtained, number of iterations, and time taken for classification of various pre-trained networks is compared through transfer learning. The results reveal that the accuracy is higher when the training data is large compared to that with a small dataset.
    Transfer of learning
    Feature (linguistics)
    Inductive transfer
    The superior performance of deep learning algorithms in fields such as computer vision and natural language processing has fueled an increased interest towards these algorithms in both research and in practice. Ever since, many studies have applied these algorithms to other machine learning contexts with other types of data in the hope of achieving comparable superior performance. This study departs from the latter motivation and investigates the application of deep learning classification techniques on big behavioral data while comparing its predictive performance with 11 widely-used shallow classifiers. In addition to the application on a new type of data and a structured comparison of its performance with commonlyused classifiers, this study attempts to shed light onto when and why deep learning techniques perform better. Regarding the specific characteristics of applying deep learning on this unique class of data, we demonstrate that an unsupervised pretraining step does not improve classification performance and that a tanh nonlinearity achieves the best predictive performance. The results from applying deep learning on 15 big behavioral data sets demonstrate as good as or better results compared to traditionally-used, shallow classifiers. However, no significant performance improvement can be recorded. Investigating when deep learning performs better, we find that worse performance is obtained for data sets with low signal-from-noise separability. In order to gain insight into why deep learning generally performs well on this type of data, we investigate the value of the distributed, hierarchical characteristic of the learning process. The neurons in the distributed representation seem to identify more nuances in the many behavioral features as compared to shallow classifiers. We demonstrate these nuances in an intuitive manner and validate them through comparison with feature engineering techniques. This is the first study to apply and validate the use of nonlinear deep learning classification on fine-grained, human-generated data while proposing efficient conguration settings for its practical implementation. As deep learning classification is often characterized by being a black-box approach, we also provide a first attempt towards the disentanglement regarding when and why these techniques perform well.
    Instance-based learning
    Citations (2)
    Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend to overfit using a few samples only, typical meta-learning models use shallow neural networks, thus limiting its effectiveness. In order to achieve top performance, some recent works tried to use the DNNs pre-trained on large-scale datasets but mostly in straight-forward manners, e.g., (1) taking their weights as a warm start of meta-training, and (2) freezing their convolutional layers as the feature extractor of base-learners. In this paper, we propose a novel approach called meta-transfer learning (MTL) , which learns to transfer the weights of a deep NN for few-shot learning tasks. Specifically, meta refers to training multiple tasks, and transfer is achieved by learning scaling and shifting functions of DNN weights (and biases) for each task. To further boost the learning efficiency of MTL, we introduce the hard task (HT) meta-batch scheme as an effective learning curriculum of few-shot classification tasks. We conduct experiments for five-class few-shot classification tasks on three challenging benchmarks, mini ImageNet, tiered ImageNet, and Fewshot-CIFAR100 (FC100), in both supervised and semi-supervised settings. Extensive comparisons to related works validate that our MTL approach trained with the proposed HT meta-batch scheme achieves top performance. An ablation study also shows that both components contribute to fast convergence and high accuracy.
    Transfer of learning
    Citations (79)
    The ability of deep neural networks to extract complex statistics and learn high level features from vast datasets is proven. Yet current deep learning approaches suffer from poor sample efficiency in stark contrast to human perception. Few shot learning algorithms such as matching networks or Model Agnostic Meta Learning (MAML) mitigate this problem, enabling fast learning with few examples. In this paper, we extend the MAML algorithm to point cloud data using a PointNet Architecture. We construct N × K-shot classification tasks from the ModelNet40 point cloud dataset to show that this method performs classification as well as supervised deep learning methods with the added benefit of being able to adapt after a single gradient step on a single N × K task. We empirically search for optimal values of N and K for few shot classification and show our method to achieve 90% meta test accuracy compared to traditional PointNet with 89.2%. We also adapt a meta-trained PointNet to a support set of 9, N = 3, K = 3, never before seen point clouds which are drawn from an entirely different dataset, ShapeNet. Once adapted the model achieves 7.1/9 classification accuracy on average across 100 query sets of the same classes with new, unique instances. This result far exceeds the supervised Stochastic Gradient Descent (SGD) training result of 3.1/9 accuracy on the query sets which is equivalent to a random baseline.
    Stochastic Gradient Descent