Unsupervised Disentanglement Learning by intervention
0
Citation
0
Reference
20
Related Paper
Abstract:
Recently there has been an increased interest in unsupervised learning of disentangled representations on the data generated from variation factors. Existing works rely on the assumption that the generative factors are independent despite this assumption is often violated in real-world scenarios. In this paper, we focus on the unsupervised learning of disentanglement in a general setting which the generative factors may be correlated. We propose an intervention-based framework to tackle this problem. In particular, first we apply a random intervention operation on a selected feature of the learnt image representation; then we propose a novel metric to measure the disentanglement by a downstream image translation task and prove it is consistent with existing ground-truth-required metrics experimentally; finally we design an end-to-end model to learn the disentangled representations with the self-supervision information from the downstream translation task. We evaluate our method on benchmark datasets quantitatively and give qualitative comparisons on a real-world dataset. Experiments show that our algorithm outperforms baselines on benchmark datasets when faced with correlated data and can disentangle semantic factors compared to baselines on real-world dataset.Keywords:
Benchmark (surveying)
Generative model
Representation
Feature Learning
Ground truth
Feature (linguistics)
Cite
Continual learning has become increasingly important as it enables NLP models to constantly learn and gain knowledge over time. Previous continual learning methods are mainly designed to preserve knowledge from previous tasks, without much emphasis on how to well generalize models to new tasks. In this work, we propose an information disentanglement based regularization method for continual learning on text classification. Our proposed method first disentangles text hidden spaces into representations that are generic to all tasks and representations specific to each individual task, and further regularizes these representations differently to better constrain the knowledge required to generalize. We also introduce two simple auxiliary tasks: next sentence prediction and task-id prediction, for learning better generic and specific representation spaces. Experiments conducted on large-scale benchmarks demonstrate the effectiveness of our method in continual text classification tasks with various sequences and lengths over state-of-the-art baselines. We have publicly released our code at https://github.com/GT-SALT/IDBR.
Regularization
Code (set theory)
Representation
Feature Learning
Cite
Citations (0)
The effective application of representation learning to real-world problems requires both techniques for learning useful representations, and also robust ways to evaluate properties of representations. Recent work in disentangled representation learning has shown that unsupervised representation learning approaches rely on fully supervised disentanglement metrics, which assume access to labels for ground-truth factors of variation. In many real-world cases ground-truth factors are expensive to collect, or difficult to model, such as for perception. Here we empirically show that a weakly-supervised downstream task based on odd-one-out observations is suitable for model selection by observing high correlation on a difficult downstream abstract visual reasoning task. We also show that a bespoke metric-learning VAE model which performs highly on this task also out-performs other standard unsupervised and a weakly-supervised disentanglement model across several metrics.
Bespoke
Representation
Feature Learning
Ground truth
Supervised Learning
Variation (astronomy)
Cite
Citations (1)
Text classification tends to struggle when data is deficient or when it needs to adapt to unseen classes. In such challenging scenarios, recent studies have used meta-learning to simulate the few-shot task, in which new queries are compared to a small support set at the sample-wise level. However, this sample-wise comparison may be severely disturbed by the various expressions in the same class. Therefore, we should be able to learn a general representation of each class in the support set and then compare it to new queries. In this paper, we propose a novel Induction Network to learn such a generalized class-wise representation, by innovatively leveraging the dynamic routing algorithm in meta-learning. In this way, we find the model is able to induce and generalize better. We evaluate the proposed model on a well-studied sentiment classification dataset (English) and a real-world dialogue intent classification dataset (Chinese). Experiment results show that on both datasets, the proposed model significantly outperforms the existing state-of-the-art approaches, proving the effectiveness of class-wise generalization in few-shot text classification.
Representation
Sample (material)
Training set
Cite
Citations (16)
Text classification tends to struggle when data is deficient or when it needs
to adapt to unseen classes. In such challenging scenarios, recent studies often
use meta learning to simulate the few-shot task, in which new queries are
compared to a small support set on a sample-wise level. However, this
sample-wise comparison may be severely disturbed by the various expressions in
the same class. Therefore, we should be able to learn a general representation
of each class in the support set and then compare it to new queries. In this
paper, we propose a novel Induction Network to learn such generalized
class-wise representations, innovatively combining the dynamic routing
algorithm with the typical meta learning framework. In this way, our model is
able to induce from particularity to university, which is a more human-like
learning approach. We evaluate our model on a well-studied sentiment
classification dataset (English) and a real-world dialogue intent
classification dataset (Chinese). Experiment results show that, on both
datasets, our model significantly outperforms existing state-of-the-art models
and improves the average accuracy by more than 3%, which proves the
effectiveness of class-wise generalization in few-shot text classification.
Representation
Sample (material)
Cite
Citations (20)
Labeling videos at scale is impractical. Consequently, self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge. However, when applied to real-world videos, contrastive learning may unknowingly lead to the separation of instances that contain semantically similar events. In our work, we introduce a cooperative variant of contrastive learning to utilize complementary information across views and address this issue. We use data-driven sampling to leverage implicit relationships between multiple input video views, whether observed (e.g. RGB) or inferred (e.g. flow, segmentation masks, poses). We are one of the firsts to explore exploiting inter-instance relationships to drive learning. We experimentally evaluate our representations on the downstream task of action recognition. Our method achieves competitive performance on standard benchmarks (UCF101, HMDB51, Kinetics400). Furthermore, qualitative experiments illustrate that our models can capture higher-order class relationships.
Leverage (statistics)
Feature Learning
Representation
Cite
Citations (0)
As a subset of unsupervised representation learning, self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as object detection and image captioning. Many proposed approaches for self-supervised learning follow naturally a multi-view perspective, where the input (e.g., original images) and the self-supervised signals (e.g., augmented images) can be seen as two redundant views of the data. Building from this multi-view perspective, this paper provides an information-theoretical framework to better understand the properties that encourage successful self-supervised learning. Specifically, we demonstrate that self-supervised learned representations can extract task-relevant information and discard task-irrelevant information. Our theoretical framework paves the way to a larger space of self-supervised learning objective design. In particular, we propose a composite objective that bridges the gap between prior contrastive and predictive learning objectives, and introduce an additional objective term to discard task-irrelevant information. To verify our analysis, we conduct controlled experiments to evaluate the impact of the composite objectives. We also explore our framework's empirical generalization beyond the multi-view perspective, where the cross-view redundancy may not be clearly observed.
Closed captioning
Representation
Supervised Learning
Feature Learning
Cite
Citations (4)
Interpretability
Benchmark (surveying)
Code (set theory)
Feature (linguistics)
Supervised Learning
Cite
Citations (3)
Sensory input from multiple sources is crucial for robust and coherent human perception. Different sources contribute complementary explanatory factors and get combined based on factors they share. This system motivated the design of powerful unsupervised representation-learning algorithms. In this paper, we unify recent work on multimodal self-supervised learning under a single framework. Observing that most self-supervised methods optimize similarity metrics between a set of model components, we propose a taxonomy of all reasonable ways to organize this process. We empirically show on two versions of multimodal MNIST and a multimodal brain imaging dataset that (1) multimodal contrastive learning has significant benefits over its unimodal counterpart, (2) the specific composition of multiple contrastive objectives is critical to performance on a downstream task, (3) maximization of the similarity between representations has a regularizing effect on a neural network, which sometimes can lead to reduced downstream performance but still can reveal multimodal relations. Consequently, we outperform previous unsupervised encoder-decoder methods based on CCA or variational mixtures MMVAE on various datasets on linear evaluation protocol.
MNIST database
Feature Learning
Representation
Cite
Citations (3)
We study few-shot learning in natural language domains. Compared to many existing works that apply either metric-based or optimization-based meta-learning to image domain with low inter-task variance, we consider a more realistic setting, where tasks are diverse. However, it imposes tremendous difficulties to existing state-of-the-art metric-based algorithms since a single metric is insufficient to capture complex task variations in natural language domain. To alleviate the problem, we propose an adaptive metric learning approach that automatically determines the best weighted combination from a set of metrics obtained from meta-training tasks for a newly seen few-shot task. Extensive quantitative evaluations on real-world sentiment analysis and dialog intent classification datasets demonstrate that the proposed method performs favorably against state-of-the-art few shot learning algorithms in terms of predictive accuracy. We make our code and data available for further study.
Code (set theory)
Cite
Citations (14)
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection. However, current methods are still primarily applied to curated datasets like ImageNet. In this paper, we first study how biases in the dataset affect existing methods. Our results show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets. Second, given the generality of the approach, we try to realize further gains with minor modifications. We show that learning additional invariances -- through the use of multi-scale cropping, stronger augmentations and nearest neighbors -- improves the representations. Finally, we observe that MoCo learns spatially structured representations when trained with a multi-crop strategy. The representations can be used for semantic segment retrieval and video instance segmentation without finetuning. Moreover, the results are on par with specialized models. We hope this work will serve as a useful study for other researchers. The code and models will be available at this https URL.
Generality
Code (set theory)
Cite
Citations (2)