NLP – Word Embeddings – ELMo

ELMo (Embeddings from Language Models) is a deep learning approach for representing words as vectors (also called word embeddings). It was developed by researchers at Allen Institute for Artificial Intelligence and introduced in a paper published in 2018.

ELMo represents words as contextualized embeddings, meaning that the embedding for a word can change based on the context in which it is used. For example, the word “bank” could have different embeddings depending on whether it is used to refer to a financial institution or the edge of a river.

ELMo has been shown to improve the performance of a variety of natural language processing tasks, including language translation, question answering, and text classification. It has become a popular approach for representing words in NLP models, and the trained ELMo embeddings are freely available for researchers to use.

How does ELMo differ from Word2Vec or GloVe?

ELMo (Embeddings from Language Models) is a deep learning approach for representing words as vectors (also called word embeddings). It differs from other word embedding approaches, such as Word2Vec and GloVe, in several key ways:

  • Contextualized embeddings: ELMo represents words as contextualized embeddings, meaning that the embedding for a word can change based on the context in which it is used. In contrast, Word2Vec and GloVe represent words as static embeddings, which do not take into account the context in which the word is used.
  • Deep learning approach: ELMo uses a deep learning model, specifically a bidirectional language model, to generate word embeddings. Word2Vec and GloVe, on the other hand, use more traditional machine learning approaches based on a neural network (Word2Vec) and matrix factorization (GloVe).

To generate context-dependent embeddings, ELMo uses a bi-directional Long Short-Term Memory (LSTM) network trained on a specific task (such as language modeling or machine translation). The LSTM processes the input sentence in both directions (left to right and right to left) and generates an embedding for each word based on its context in the sentence.

Overall, ELMo is a newer approach for representing words as vectors that has been shown to improve the performance of a variety of natural language processing tasks. It has become a popular choice for representing words in NLP models.

What is the model for training ELMo word embeddings?

The model used to train ELMo word embeddings is a bidirectional language model, which is a type of neural network that is trained to predict the next word in a sentence given the context of the words that come before and after it. To train the ELMo model, researchers at Allen Institute for Artificial Intelligence used a large dataset of text, such as news articles, books, and websites. The model was trained to predict the next word in a sentence given the context of the words that come before and after it. During training, the model learns to represent words as vectors (also called word embeddings) that capture the meaning of the word in the context of the sentence.

Explain in details the bidirectional language model

A bidirectional language model is a type of neural network that is trained to predict the next word in a sentence given the context of the words that come before and after it. It is called a “bidirectional” model because it takes into account the context of words on both sides of the word being predicted.

To understand how a bidirectional language model works, it is helpful to first understand how a unidirectional language model works. A unidirectional language model is a type of neural network that is trained to predict the next word in a sentence given the context of the words that come before it.

A unidirectional language model can be represented by the following equation:

P(w[t] | w[1], w[2], …, w[t-1]) = f(w[t-1], w[t-2], …, w[1])

This equation says that the probability of a word w[t] at time t (where time is the position of the word in the sentence) is determined by a function f of the words that come before it (w[t-1], w[t-2], …, w[1]). The function f is learned by the model during training.

A bidirectional language model extends this equation by also taking into account the context of the words that come after the word being predicted:

P(w[t] | w[1], w[2], …, w[t-1], w[t+1], w[t+2], …, w[n]) = f(w[t-1], w[t-2], …, w[1], w[t+1], w[t+2], …, w[n])

This equation says that the probability of a word w[t] at time t is determined by a function f of the words that come before it and the words that come after it. The function f is learned by the model during training.

In practice, a bidirectional language model is implemented as a neural network with two layers: a forward layer that processes the input words from left to right (w[1], w[2], …, w[t-1]), and a backward layer that processes the input words from right to left (w[n], w[n-1], …, w[t+1]). The output of these two layers is then combined and used to predict the next word in the sentence (w[t]). The forward and backward layers are typically implemented as recurrent neural networks (RNNs) or long short-term memory (LSTM) networks, which are neural networks that are designed to process sequences of data.

During training, the bidirectional language model is fed a sequence of words and is trained to predict the next word in the sequence. The model uses the output of the forward and backward layers to generate a prediction, and this prediction is compared to the actual next word in the sequence. The model’s weights are then updated to minimize the difference between the prediction and the actual word, and this process is repeated for each word in the training dataset. After training, the bidirectional language model can be used to generate word embeddings by extracting the output of the forward and backward layers for each word in the input sequence.

ELMo model training algorithm

  1. Initialize the word vectors:
  • The word vectors are usually initialized randomly using a Gaussian distribution.
  • Alternatively, you can use pre-trained word vectors such as Word2Vec or GloVe.
  1. Process the input sequence:
  • Input the sequence of words w[1], w[2], ..., w[t-1] into the forward layer and the backward layer.
  • The forward layer processes the words from left to right, and the backward layer processes the words from right to left.
  • Each layer has its own set of weights and biases, which are updated during training.
  1. Compute the output:
  • The output of the forward layer and the backward layer are combined to form the final output o[t].
  • The final output is used to predict the next word w[t].
  1. Compute the loss:
  • The loss is computed as the difference between the predicted word w[t] and the true word w[t].
  • The loss function is usually the cross-entropy loss, which measures the difference between the predicted probability distribution and the true probability distribution.
  1. Update the weights and biases:
  • The weights and biases of the forward and backward layers are updated using gradient descent and backpropagation.
  1. Repeat steps 2-5 for all words in the input sequence.

ELMo generates contextualized word embeddings by combining the hidden states of a bi-directional language model (BLM) in a specific way.

The BLM consists of two layers: a forward layer that processes the input words from left to right, and a backward layer that processes the input words from right to left. The hidden state of the BLM at each position t is a vector h[t] that represents the context of the word at that position.

To generate the contextualized embedding for a word, ELMo concatenates the hidden states from the forward and backward layers and applies a weighted summation. The hidden states are combined using a task-specific weighting of all biLM layers. The weighting is controlled by a set of learned weights γ_task and a bias term s_task. The ELMo embeddings for a word at position k are computed as a weighted sum of the hidden states from all L layers of the biLM:

ELMo_task_k = E(R_k; Θtask) = γ_task_L * h_LM_k,L + γ_task{L-1} * h_LM_k,{L-1} + … + γ_task_0 * h_LM_k,0 + s_task

Here, h_LM_k,j represents the hidden state at position k and layer j of the biLM, and γ_task_j and s_task are the task-specific weights and bias term, respectively. The task-specific weights and bias term are learned during training, and are used to combine the hidden states in a way that is optimal for the downstream task.

Using ELMo for NLP tasks

ELMo can be used to improve the performance of supervised NLP tasks by providing context-dependent word embeddings that capture not only the meaning of the individual words, but also their context in the sentence.

To use a pre-trained bi-directional language model (biLM) for a supervised NLP task, the first step is to run the biLM and record the layer representations for each word in the input sequence. These layer representations capture the context-dependent information about the words in the sentence, and can be used to augment the context-independent token representation of each word.

In most supervised NLP models, the lowest layers are shared across different tasks, and the task-specific information is encoded in the higher layers. This allows ELMo to be added to the model in a consistent and unified manner, by simply concatenating the ELMo embeddings with the context-independent token representation of each word.

The model then combines the ELMo embeddings with the context-independent token representation to form a context-sensitive representation h_k, typically using either bidirectional RNNs, CNNs, or feed-forward networks. The context-sensitive representation h_k is then used as input to the higher layers of the model, which are task-specific and encode the information needed to perform the target NLP task. It can be helpful to add a moderate amount of dropout to ELMo and to regularize the ELMo weights by adding a regularization term to the loss function. This can help to prevent overfitting and improve the generalization ability of the model.