BERT

BERT (Bidirectional Encoder Representation From Transformer) #


Source:


Source:

Before BERT: #

  • LSTM’s were used.

  • Problems:

    • Slow as each word is processed at a time (sequentially)
    • Not truly bi-directional (left to right and right to left at a time in bidirectional LSTM)
  • Bert Architecture: Multiple encoders stacked on each-other

  • Pretraining and Finetuning

  • Pretraining Task is used to learn the language and its context. It is done using two tasks:

    • Mask Language Model (MLM):

      • Sentence sentence {Fill_in_the_Blanks} remaining sentence

      • Helps bert understand the bidirectional meaning of a sentence

    • Next Sentence Prediction (NSP):

      • Predict whether the a given sentence is the next sentence of the current sentence. Like a binary classification task.

      • It helps bert in understanding context across different sentences.

    • Usually, the MLM task and NSP task are performed simultaneously.

  • Finetuning:

    • Finetune on task specific data.

    • Fast and compute efficient

    • Only replace last few layers of the original architecture