A Cookbook of Self-Supervised Learning: #
Initial Notes from: https://arxiv.org/abs/2304.12210
Intro: #
-
NLP advanced due to SSL –> No need of labelled data to train supervised model
-
SSL -> Define a pretext task –> Un-labelled data –> intelligent representation
-
NLP: Word2Vec is SSL – In a sentence, mask a word and predict the surrounding words (It learns context)
-
CV: 2 current popular ways:
-
mask a patch and prediction of masked path
-
augmented version of the same sample –> train model such that embeddings from these 2 images are close as compared to any other image.
-
-
Why SSL is hard and need of cookbook
-
Computational Cost
-
No detailed papers and its proper implementation with parameters
-
unified vocab
-
Origin of SSL: #
Discussion about several pre-text tasks which were used few years ago in the field of SSL:
-
Information restoration:
-
Remove something from image and restore it or convert to grayscale and train a ML model to predict the colors. This helps in learning object semantics and boundaries.
-
Newer Method: Masked-AutoEncoding - Transformer based where patches are masked
-
-
Video Temporal Relationship:
-
Model training using triplet loss for similarity of two representations of same object in 2 different frames.
-
Remove audio track and predict it based on the video input
-
Prediction of depth mapping between un-labelled image pairs.
-
-
Learning spatial context:
-
Random rotation –> predict the amount of rotation
-
Jigsaw: convert image to blocks and create pairs –> predict the relative position of each pair.
-
-
: