self-attention – Notes of a Neuropsychiatry Amateur

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer-based neural network architecture for natural language processing tasks such as text classification, question answering, and language inference. One important feature of BERT is its use of word embeddings, which are mathematical representations of words in a continuous vector space.

In BERT, word embeddings are learned during the pre-training phase and are fine-tuned during the task-specific fine-tuning phase. These embeddings are learned by training the model on a large corpus of text, and they are able to capture semantic and syntactic properties of words.

The BERT model architecture is composed of multiple layers of transformer blocks, with the input being a sequence of tokens (e.g., words or subwords) and the output being a contextualized representation of each token in the sequence. The model also includes a pooled output which is used for many down stream task, which are generated by applying a pooling operation over the entire sequence representation.

How does BERT differ from Word2Vec or GloVe?

Training objective: The main difference between BERT and Word2Vec/GloVe is the training objective. BERT is trained to predict missing words in a sentence (masked language modeling) and predict the next sentence (next sentence prediction), this way the model learns to understand the context of the words. Word2Vec and GloVe, on the other hand, are trained to predict a word given its context or to predict the context given a word, this way the model learn the association between words.
Inputs: BERT takes a pair of sentences as input, and learns to understand the relationship between them, Word2Vec and GloVe only take a single sentence or a window of context words as input.
Directionality: BERT is a bidirectional model, meaning that it takes into account the context of a word before and after it in a sentence. This is achieved by training on both the left and the right context of the word. Word2Vec is unidirectional model which can be trained on either the left or the right context, and GloVe is also unidirectional but it is trained on the global corpus statistics.
Pre-training: BERT is a pre-trained model that can be fine-tuned on specific tasks, Word2Vec and GloVe are also pre-trained models, but the main difference is that their pre-training is unsupervised with no downstream task, this means that the fine-tuning of BERT can provide better performance on some tasks because it is pre-trained with the task objective in mind.

BERT Architecture

The core component of BERT is a stack of transformer encoder layers, which are based on the transformer architecture introduced by Vaswani et al. in 2017. Each transformer encoder layer in BERT consists of multiple self-attention heads. The self-attention mechanism allows the model to weigh the importance of different parts of the input sequence when generating the representation of each token. This allows the model to understand the relationships between words in a sentence and to capture the context in which a word is used.

The transformer architecture also includes a feed-forward neural network, which is applied to the output of each self-attention head to produce the final representation of each token.

Transformer Encoder Layer

In the transformer encoder layer, for every input token in a sequence, the self-attention mechanism computes key, value, and query vectors, which are used to create a weighted representation of the input. The key, value and query vectors are computed by applying different linear transformations (matrix multiplications) on the input embeddings, these linear transformations are learned during the training process.

In BERT, the input representations are computed by combining multiple embedding layers. The input is first tokenized into word pieces, which is a technique that allows the model to handle out-of-vocabulary words by breaking them down into subword units. The tokenized input is then passed through three embedding layers:

Token Embedding: Each word piece is represented as a token embedding
Position Embedding: Each word piece is also represented by a position embedding, which encodes information about the position of the word piece in the input sequence.
Segment Embedding: BERT also uses a segment embedding to represent the input segments, this embedding helps BERT to distinguish between the two sentences when the input is a pair of sentences.

These embeddings are concatenated or added together to obtain a fixed-length vector representation of each word piece. Special tokens [CLS] and [SEP] are used to indicate the beginning and the end of the input segments and the classification prediction respectively, [CLS] token is used as the representation of the entire input sequence, which is used in the classification tasks, while [SEP] token is used to separate the input segments, in the case of the input being a pair of sentences.

Masked Language Modeling

If you try to predict each word of the input sequence using the training data with cross-entropy loss, the learning task becomes trivial for the network. Since the network knows beforehand what it has to predict, it can easily learn weights to reach a 100% classification accuracy.

The masked language modeling (MLM) approach, also known as the “masked word prediction” task, addresses this problem by randomly masking a portion of the input words (e.g., 15%) during training and requiring the network to predict the original value of the masked words based on the context. By masking a portion of the input words, the network is forced to understand the context of the words and to learn meaningful representations of the input.

In the MLM approach, the network is only required to predict the value of the masked words, and the loss is calculated only over the masked words. This means that the model is not learning to predict words it has already seen, but instead, it is learning to predict words it hasn’t seen while seeing all the context around those words.

In addition to MLM, BERT also uses another objective during pre-training called “next sentence prediction” this objective is a binary classification task that is applied on the concatenation of two sentences, the model is trained to predict whether the second sentence is the real next sentence in the corpus or not. This objective helps BERT to understand the relationship between two sentences and how they are related.

Tag: self-attention

NLP – Word Embeddings – BERT