With large-scale visual datasets like ImageNet[1] and CIFAR-10 [2], most computer vision tasks are solved by supervised learning methods. Such an approach requires large amounts of labeled data, which is expensive. It is impractical to hire experts to label thousands of images, and institutes and companies are unwilling to share their data due to its commercial value. To overcome this issue, researchers have focused on semi-supervised and unsupervised methods.
In unsupervised learning, the model tries to uncover naturally occurring patterns in the data without supervision, i.e., pre-labeled samples. These models can be used in a generative sense or combined with supervised methods to improve performance on recognition tasks. In both cases, the model learns a latent space representation, i.e., a hyper-plane, on which, if one goes in a specific direction, the samples change in a specific manner. In generative tasks, the latent space representation is used as the starting point for generating the random samples. However, task-specific layers are attached to the model in recognition tasks. In this case, the representation is desired to be invariant to a pre-defined set of augmentations. However, learning augmentation-specific information means the loss of generalization for the model. Hence, researchers try to prevent this information loss and obtain more generalizable representations.
Another crucial question to answer before using unsupervised learning methods is when it is possible to recover the independent factors that specify movement directions on the hyper-plane. While training an unsupervised model, one must make strong assumptions to ensure the latent variables do not move away from the desired independent factors, i.e., disentangled. Some of the most widely made assumptions are smoothness, low-density, and manifold assumptions [3]. Several methods that try to minimize information loss due to augmentation-specific and propose generalized assumptions for disentanglement are covered in the following sections.
Augmentation Invariant Feature Representations
Many unsupervised representation learning models try to solve the instance discrimination task where several transformations are applied to each image in the dataset. The model is expected to assign original images and their transformed versions to different classes. One approach to solving this task is to use contrastive loss. MOCO[4] randomly transforms the image twice and takes these as the positive pair for contrastive loss. Then, another image from the dataset is selected as the negative sample. A memory bank is used to keep a queue of encoded negative samples to reduce the requirement of GPU memory size. Since the memory requirement for memorizing encodings for each sample is large, MOCO divides data into mini-batches which may cause inconsistencies between batches. Therefore, a momentum parameter is tracked and updated to ensure the encoders for different mini-batches do not differ much. Once the model is trained, different heads are attached, and the combined model is fine-tuned for the specific task. It is shown that MOCO improves results for various tasks, from object detection to segmentation and pose estimation.
In 2021, Lee et al. [5] improved the results of MOCO by introducing an auxiliary self-supervised loss to be optimized. With such a modification, the study claims that the model
can learn to predict the difference between the augmentation parameters of two randomly augmented versions. Although the modified model, MOCO + AugSelf, performs better on 10
out of 11 datasets, one should add a different term to the loss for each augmentation type used to train the model. Hence, defining a differentiable loss for some augmentation types,
such as GAN-based ones, might not be possible.
Figure 1. Contrastive instance learning on the left vs. Swapping Assignments between Views (SwAV) on the right. [6]
Since comparing all pairwise combinations on large datasets is impractical, methods like MOCO approximate the loss by reducing the number of negative samples to a random subset of images. One may also approximate the task by relaxing the instance discrimination problem. However, clustering-based methods that follow this approach do not scale well, as the model requires a pass over the entire dataset to assign clusters that are used as targets. SwAV proposes solving the ”swapped” prediction problem where a view's cluster assignment is predicted from another view's representation [6]. The relation between contrastive learning and swapped prediction is given in Figure 1. In contrastive learning, different views of the same image are compared directly, whereas, in SwAV, the embedding is obtained by matching features from different views to a set of prototypes. Then, the prediction task is swapped as the codes obtained from one data-augmented view are predicted using the other. This method improves the results of MOCO by 15% in top-1 accuracy for linear classification tasks on ImageNet.
As contrastive self-supervised learning models learn global (image-level) representations, many visual understanding tasks require dense (pixel-level) representations. Pinheiro et al. [7] propose View-Agnostic Dense Representation (VADeR) which learns pixelwise representations by forcing local features to remain constant over different viewing conditions. VADeR relies on perceptional consistency, meaning local visual representations should remain constant over different viewing conditions. Similar to the regular contrastive learning scheme, different views of an image are generated, and another negative sample is selected randomly. However, instead of directly comparing the embeddings of these three samples, VADeR decodes the embeddings and makes comparisons on pixel-level representations. Instead of linear classification tasks as in SwAV, Pinheiro et al. show that VADeR improves the results of MOCO for semantic segmentation and keypoint detection.
Latent Variable Disentanglement
While unsupervised representation learning methods are useful for various tasks in computer vision, the explainability of the learned representations is not trivial. Several studies tried to build disentanglement approaches on VAEs [8] and GANs [9] by introducing a loss term for regularization. However, the losses such models use to generate realistic images compete with the additional disentanglement regularization loss, and the performance of such models on downstream tasks deteriorates [10]. Hence, the recent literature suggests that disentanglement is impossible without additional strong assumptions on the architecture [11].
Figure 2.Images generated by CD-GAN on Fashion-MNIST. Each row corresponds to one latent factor value. The model's results trained with the unsupervised setting are on the left. The images in each row have the same category. The results of the model trained with few labels are on the right. [12]
For example, contrastive disentanglement in generative adversarial networks (CD-GANs) [12] utilize the correlation between the class of generative images and the latent code. The model is enforced to embed the class-related features to be as similar as possible. Also, the intra-class variation and the latent code, which are assumed to be independent, are ensured to be distinct factors by enforcing the content not to change with class. During training, Pan et al. use two strategies for constructing positive and negative pairs :(i) For each image, augmented views are taken as positive samples, and the other images are regarded as negative samples. (ii) The image-level class labels are used as the basis for constructing positive and negative pairs, as in supervised contrastive learning. For each strategy, the number of positive pairs in contrastive loss formulation is increased to weaken the contrast. CD-GAN results show that the performance deteriorates as the dataset becomes more complex. Figure 2 presents some images generated by CD-GAN in unsupervised and supervised settings. The results show more obvious disentanglement is achieved in a supervised setting. Thus, considering these results, the researchers claim that disentanglement learning should have at least a limited amount of supervision.
Horan et al. [13] recently showed that the assumption of local isometry and non-Gaussianity of the factors is sufficient to recover disentangled representations. The local isometry is satisfied if the latent presentation z is changed by ε, then its observed vector x must also be changed by ε. This assumption is often used in manifold learning, and any manifold can be approximated by one that satisfies local isometry. The Hessian Auto-Encoder that will learn latent representations is encouraged to satisfy local isometry by an additional term in the loss function. Concurrently, a variable z is sufficiently non-Gaussian if a linear independent component analysis is guaranteed to recover z from infinite training examples. The latent variables are sampled from independent uniform distributions to satisfy this assumption. Hence, Horan et al. [13] combine Hessian Eigenmaps (HLLE) with the fastICA algorithm to show that these two assumptions are sufficient to obtain disentangled representations.
Discussion
Learning augmentation invariant representations enables models to reflect the global features of samples. Using a contrastive learning scheme with efficient solutions to memory requirements on large data sets improves representation potential and enhances performance on downstream tasks. However, such applications have two limitations: They do not scale well and cannot learn local features. SwAV introduces prototypes and tries to solve the swapped prediction to ensure scalability. In contrast, VADeR decodes the representation and calculates pixelwise loss to ensure local features are kept constant across different views. While both methods improve the base-level results, they cover and solve only one aspect. Hence, the next step for higher-quality representations should be utilizing the strengths of both approaches by introducing prototypes and swapped loss to the model at the pixel level.
To obtain human-controllable representations, achieving disentanglement is crucial. Although studies in the literature suggest that disentanglement is possible either with additional strong assumptions on the architecture or at least limited supervision, it is recently shown that the assumption of local isometry and non-Gaussianity of the factors is sufficient to acquire disentangled representations. However, this method assumes unlimited training data in theory. When the dataset is relatively smaller or sparser, the model may be fooled by noise as it is based on spectral-domain operations like eigen mapping. A similar issue also holds for graph neural networks. However, recent studies proposed practical spatial approximations for graph neural networks. Similarly, such a shift from spectral operation to spatial approaches could surpass the limitations of current models for disentanglement. [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009. [2] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
[3] J. E. Van Engelen and H. H. Hoos, “A survey on semisupervised learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2020.
[4] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 9729–9738, 2020.
[5] H. Lee, K. Lee, K. Lee, H. Lee, and J. Shin, “Improving transferability of representations via augmentation-aware self-supervision,” in Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
[6] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” arXiv preprint
arXiv:2006.09882, 2020.
[7] P. O. Pinheiro, A. Almahairi, R. Y. Benmalek, F. Golemo, and A. Courville, “Unsupervised learning of dense visual representations,” arXiv preprint arXiv:2011.05499, 2020.
[8] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2180–2188, 2016.
[10] A. Burns, A. Sarna, D. Krishnan, and A. Maschinot, “Unsupervised disentanglement without autoencoding: Pitfalls and future directions,” arXiv preprint arXiv:2108.06613, 2021.
[11] F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Sch ̈olkopf, and O. Bachem, “Challenging common assumptions in the unsupervised learning of disentangled
representations,” in the International conference on machine learning, pp. 4114–4124, PMLR, 2019.
[12] L. Pan, P. Tang, Z. Chen, and Z. Xu, “Contrastive disentanglement in generative adversarial networks,” arXiv preprint arXiv:2103.03636, 2021.
[13] D. Horan, E. Richardson, and Y. Weiss, “When is unsupervised disentanglement possible?” in Thirty-Fifth Conference on Neural Information Processing Systems,
2021
Comentarios