Improving Few-Shot Learning with Auxiliary Self-Supervised Pretext Tasks
Figure 1: We train the embedding model with both annotated images and unlabeled images in a multi-task setting. Self-supervised tasks such as rotation prediction or representation prediction (BYOL) act as a data-dependent regularizer for the shared feature extractor . Although additional unlabeled data can be used for the self-supervised tasks, in this work we sample images from the annotated set.
Recent work on few-shot learning (Tian et al., 2020a) showed that quality of learned representations plays an important role in few-shot classification performance. On the other hand, the goal of self-supervised learning is to recover useful semantic information of the data without the use of class labels. In this work, we exploit the complementarity of both paradigms via a multi-task framework where we leverage recent self-supervised methods as auxiliary tasks. We found that combining multiple tasks is often beneficial, and that solving them simultaneously can be done efficiently. Our results suggest that self-supervised auxiliary tasks are effective data-dependent regularizers for representation learning.
In this section, we first describe in §3.1 the few-shot learning problem addressed and introduce in §3.2 the proposed multi-task approach to improve few-shot performance with self-supervised auxiliary tasks.
Standard few-shot learning benchmarks evaluate models in episodes of N-way, K-shot classification tasks. Each task consists of a small number of N classes with K training examples per class. Meta-learning approaches for few-shot learning aim to minimize the generalization error across a distribution of tasks sampled from a task distribution. This can be thought of as learning over a collection of tasks , commonly referred to as the meta-training set.
In practice, a task is constructed on the fly during the meta-training phase and sampled as follows. For each task, N classes from the set of training classes are first sampled (with replacement), from which the training (support) set Di train of K images per class is sampled, and finally the test (query) set Di test consisting of Q images per class is sampled. The support set is used to learn how to solve this specific task, and the additional examples from the query set are used to evaluate the performance for this task. Once the meta-training phase of a model is finished, its performance is evaluated on a set of held-out tasks S = {Dj train, Dj test }Jj=1, called the meta-test set. During meta-training, an additional held-out meta-validation set can be used for hyperparameter selection and model selection. Training examples Dtrain = {(xt, yt)}Tt=1 and testing examples Dtest = {(xq, yq)}Qq=1 are sampled from the same distribution, and are mapped to a feature space using an embedding model Fθ. A base learner is trained on Dtrain and used as a predictor on Dtest .
在实践中,任务是在元训练阶段动态构建的,并采样如下。对于每个任务,首先从训练类集合中抽取 N 个类(有放回),从中抽取每类 K 个图像的训练(支持)集 Di train,最后测试(查询)集 Di test 由每类 Q 图像被采样。支持集用于学习如何解决此特定任务,查询集中的其他示例用于评估此任务的性能。一旦模型的元训练阶段完成,它的性能就会在一组保留任务 S = {Dj train, Dj test }Jj=1 上进行评估,称为元测试集。在元训练期间,额外的保留元验证集可用于超参数选择和模型选择。训练样例 Dtrain = {(xt, yt)}Tt=1 和测试样例 Dtest = {(xq, yq)}Qq=1 从同一分布中采样,并使用嵌入模型 Fθ 映射到特征空间。基础学习器在 Dtrain 上进行训练并用作 Dtest 上的预测器。
In BYOL (Grill et al., 2020), the online network directly predicts the output of one view from another view given by the target network. Essentially, this is a representation prediction task in the latent space, similar to contrastive learning except that it only relies on the positive pairs(与对比学习相似,不同的是它只依赖于正样本). In this task, the online network is composed of the shared encoder , the MLP projection head and the predictor (also an MLP). The target network has the same architecture as the online network (minus the predictor), but its paramters are an exponential moving average (EMA) of the online network parameters as illustrated in Figure 1. Denoting the parameters of the online network as , those of the target network as ξ and the target decay rate τ ∈ [0, 1), the update rule for ξ is:
The self-supervised loss is the mean squared error between the normalized predictions and target projections as defined in Grill et al. (2020). Effectively, this task enforces the representations for different views of positive pairs to be closer together in latent space, which provides transformation invariance to the pre-defined set of data augmentations used for BYOL.
In §4.4, we explore this more effificient setting by combining the supervised and BYOL tasks using the stronger data augmentation strategy from BYOL in both. More concretely, we generate an augmented view of an input image and compute the supervised loss on the first augmented view, while another augmented view of the same input is generated to solve the representation prediction task in BYOL.
In order to ensure that the performance improvement from BYOL is not strictly due to the stronger data augmentation strategy used by the task, we conduct experiments using the same data augmentations for the supervised baseline. An additional experiment without data augmentation for the supervised task is presented in Appendix A. On both CIFAR-FS (Table 1) and miniImageNet (Table 2), we find that stronger data augmentation improves the supervised baseline. This is in line with a lot of the recent work in strong data augmentation techniques (DeVries & Taylor, 2017; Zhang et al., 2018; Yun et al., 2019; Cubuk et al., 2019; 2020). Effectively, data augmentation is an important regularization technique that has been shown to improve generalization. Furthermore, we show that in this setting the addition of BYOL as a self-supervised auxiliary task still boosts the performance. As mentioned in §3.2, when used in combination with BYOL, both tasks share the same transformed inputs as part of our multi-task framework. Additional experiments where we leverage both augmented views generated to compute both the supervised and BYOL losses can be found in Table 4 (Appendix C).