NLP – Word Embeddings – FastText

What is the FastText method for word embeddings?

FastText is a library for efficient learning of word representations and sentence classification. It was developed by Facebook AI Research (FAIR).

FastText represents each word in a document as a bag of character n-grams. For example, the word “apple” would be represented as the following character n-grams: “a”, “ap”, “app”, “appl”, “apple”, “p”, “pp”, “ppl”, “pple”, “p”, “pl”, “ple”, “l”, “le”. This representation has two advantages:

  1. It can handle spelling mistakes and out-of-vocabulary words. For example, the model would still be able to understand the word “apple” even if it was misspelled as “appel” or “aple”.
  2. It can handle words in different languages with the same script (e.g., English and French) without the need for a separate model for each language.

FastText uses a shallow neural network to learn the word representations from this character n-gram representation. It is trained using the skip-gram model with negative sampling, similar to word2vec.

FastText can also be used for sentence classification by averaging the word vectors for the words in the sentence and training a linear classifier on top of the averaged vector. It is particularly useful for languages with a large number of rare words, or in cases where using a word’s subwords (also known as substrings or character n-grams) as features can be helpful.

How are word embeddings trained in FastText?

Word embeddings in FastText can be trained using either the skip-gram model or the continuous bag-of-words (CBOW) model.

In the skip-gram model, the goal is to predict the context words given a target word. For example, given the input sequence “I have a dog”, the goal would be to predict “have” and “a” given the target word “I”, and to predict “I” given the target word “have”. The skip-gram model learns to predict the context words by minimizing the negative log likelihood of the context words given the target word.

In the CBOW model, the goal is to predict the target word given the context words. For example, given the input sequence “I have a dog”, the goal would be to predict “I” given the context words “have” and “a”, and to predict “have” given the context words “I” and “a”. The CBOW model learns to predict the target word by minimizing the negative log likelihood of the target word given the context words.

Both the skip-gram and CBOW models are trained using stochastic gradient descent (SGD) and backpropagation to update the model’s parameters. The model is trained by minimizing the negative log likelihood of the words in the training data, given the model’s parameters.

Explain how FastText represents each word in a document as a bag of character n-grams

To represent a word as a bag of character n-grams, FastText breaks the word down into overlapping substrings (also known as character n-grams). For example, the word “apple” could be represented as the following character 3-grams (trigrams): [“app”, “ppl”, “ple”]. The number of characters in each substring is specified by the user and is typically set to between 3 and 6 characters.

For example, consider the following sentence:

“I have a dog”

If we set the number of characters in each substring to 3, FastText would represent each word in the sentence as follows:

“I”: [“I”] “have”: [“hav”, “ave”] “a”: [“a”] “dog”: [“dog”]

The use of character n-grams allows FastText to learn good vector representations for rare words, as it can use the vector representations of the character n-grams that make up the rare word to compute its own vector representation. This is particularly useful for handling out-of-vocabulary words that may not have a pre-trained vector representation available.

How are vector representations for each word computed from n-gram vectors?

In FastText, the vector representation for each word is computed as the sum of the vector representations of the character n-grams (subwords) that make up the word. For example, consider the following sentence:

“I have a dog”

If we set the number of characters in each substring to 3, FastText would represent each word in the sentence as a bag of character 3-grams (trigrams) as follows:

“I”: [“I”] “have”: [“hav”, “ave”] “a”: [“a”] “dog”: [“dog”]

FastText would then learn a vector representation for each character n-gram and use these vector representations to compute the vector representation for each word. For example, the vector representation for the word “have” would be computed as the sum of the vector representations for the character n-grams [“hav”, “ave”].

Since there can be huge number of unique n-grams, how does FastText deal with the memory requirement?

One of the ways that FastText deals with the large number of unique character n-grams is by using hashing to map the character n-grams to a fixed-size hash table rather than storing them in a dictionary. This allows FastText to store the character n-grams in a compact form, which can save memory.

What is hashing? How are character sequences hashed to integer values?

Hashing is the process of converting a given input (called the ‘key’) into a fixed-size integer value (called the ‘hash value’ or ‘hash code’). The key is typically some sort of string or sequence of characters, but it can also be a number or other data type.

There are many different ways to hash a character sequence, but most algorithms work by taking the input key, performing some mathematical operations on it, and then returning the hash value as an integer. The specific mathematical operations used will depend on the specific hashing algorithm being used.

One simple example of a hashing algorithm is the ‘modulo’ method, which works as follows:

  1. Take the input key and convert it into a numerical value, for example by assigning each character in the key a numerical value based on its ASCII code.
  2. Divide this numerical value by the size of the hash table (the data structure in which the hashed keys will be stored).
  3. The remainder of this division is the hash value for the key.

This method is simple and fast, but it is not very robust and can lead to a high number of collisions (when two different keys produce the same hash value). More sophisticated algorithms are typically used in practice to improve the performance and reliability of hash tables.

How is the Skip-gram with negative sampling applied in FastText?

Skip-gram with negative sampling (SGNS) algorithm is used to learn high-quality word embeddings (i.e., dense, low-dimensional representations of words that capture the meaning and context of the words). The Skip-gram with negative sampling algorithm works by training a predictive model to predict the context words (i.e., the words that appear near a target word in a given text) given the target word. During training, the model is given a sequence of word pairs (a target word and a context word) and tries to predict the context words given the target words.

To train the model, the SGNS algorithm uses a technique called negative sampling, which involves sampling a small number of negative examples (random words that are not the true context words) and using them to train the model along with the positive examples (the true context words). This helps the model to learn the relationship between the target and context words more efficiently by focusing on the most informative examples.

The SGNS algorithm steps are as following:

  • The embedding for a target word (also called the ‘center word’) is calculated by taking the sum of the embeddings for the word itself and the character n-grams that make up the word.
  • The context words are represented by their word embeddings, without adding the character n-grams.
  • Negative samples are selected randomly from the vocabulary during training, with the probability of selecting a word being proportional to the square root of its unigram frequency (i.e., the number of times it appears in the text).
  • The dot product of the embedding for the center word and the embedding for the context word is calculated. We then need to normalize the similarity scores over all of the context words in the vocabulary, so that the probabilities sum to 1 and form a valid probability distribution.
  • Compute the cross-entropy loss between the predicted and true context words. Use an optimization algorithm such as stochastic gradient descent (SGD) to update the embedding vectors in order to minimize this loss. This involves bringing the actual context words closer to the center word (i.e., the target word) and increasing the distance between the center word and the negative samples.

    The cross-entropy loss function can be expressed as:
  • L = – ∑i(y_i log(p(w_i|c)) + (1 – y_i)log(1 – p(w_i|c)))
  • where:
  • L is the cross-entropy loss.
  • y_i is a binary variable indicating whether context word i is a positive example (y_i = 1) or a negative example (y_i = 0).
  • p(w_i|c) is the probability of context word i given the target word c and its embedding.
  • ∑i indicates that the sum is taken over all context words i in the vocabulary.

FastText and hierarchical softmax

FastText can use a technique called hierarchical softmax to reduce the computation time during training. Hierarchical softmax works by organizing the vocabulary into a binary tree, with the word at the root of the tree and its descendant words arranged in a hierarchy according to their probability of occurrence.

During training, the model uses the hierarchical structure of the tree to compute the loss and update the model weights more efficiently. This is done by traversing the tree from the root to the appropriate leaf node for each word, rather than computing the loss and updating the weights for every word in the vocabulary separately.

The standard softmax function has a computational complexity of O(Kd), where K is the number of classes (i.e., the size of the vocabulary) and d is the number of dimensions in the hidden layer of the model. This complexity arises from the need to normalize the probabilities over all potential classes in order to obtain a valid probability distribution. The hierarchical softmax reduces the computational complexity to O(d*log(K)). Huffman coding can be used to construct a binary tree structure for the softmax function, where the lowest frequency classes are placed deeper into the tree and the highest frequency classes are placed near the root of the tree.

In the hierarchical softmax function, a probability is calculated for each path through the Huffman coding tree, based on the product of the output vector v_n_i of each inner node n and the output value of the hidden layer of the model, h. The sigmoid function is then applied to this product to obtain a probability between 0 and 1.

The idea of this method is to represent the output classes (i.e., the words in the vocabulary) as the leaves on the tree and to use a random walk through the tree to assign probabilities to the classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as the product of the probabilities along the path from the root to the leaf node corresponding to the class.

This allows the hierarchical softmax function to compute the probability of each class more efficiently, since it only needs to consider the path through the tree rather than the entire vocabulary. This can significantly reduce the computational complexity of the model, particularly for large vocabularies, making it practical to train word embeddings on very large datasets.

Hierarchical softmax and conditional probabilities

To compute the probability of each context word given the center word and its embedding using the hierarchical softmax function, we first organize the vocabulary into a binary tree, with the words at the nodes of the tree and their descendant words arranged in a hierarchy according to their probability of occurrence.

We then compute the probability of each context word by traversing the tree from the root to the appropriate leaf node for the word. For each inner node n in the tree, we compute the probability of traversing the left or right branch of the tree as follows:

p(left|n) = sigmoid(v_n_i · h) p(right|n) = 1 – p(left|n)

where:

  • v_n_i is the vector representation of inner node n
  • h is the output value of the hidden layer of the model

The probability of a context word w is then computed as the product of the probabilities of the branches along the path from the root to the leaf node corresponding to w.

Chatting with GPT – Natural Language Processing

Let’s talk about natural language processing. How does input text data get transformed to numerical vectors?

In natural language processing (NLP), input text data is often transformed into numerical vectors in order to make it more suitable for machine learning algorithms to work with. There are several methods that can be used to transform text data into numerical vectors, including:

  1. One-hot encoding: This is a simple method that converts each word in the text into a unique integer index, and then represents each word as a binary vector, with a “1” in the position corresponding to the index of the word and “0”s in all other positions. One-hot encoding is a common method for converting text data into numerical vectors, but it can result in very large and sparse vectors, which may not be efficient to work with.
  2. Word embeddings: Word embeddings are numerical vectors that represent the meaning of a word in a continuous, low-dimensional space. Word embeddings can be trained on large datasets of text data using techniques such as word2vec or GloVe, and they can capture the contextual relationships between words and the similarity between words. Word embeddings are a popular method for representing text data in NLP, as they can capture the meaning of words in a compact and dense form.
  3. Bag of words: The bag of words model represents text data as a fixed-length vector, with each element in the vector representing the number of times a particular word appears in the text. The bag of words model is a simple and commonly used method for representing text data, but it does not capture the order or context of the words in the text.

There are many other methods for converting text data into numerical vectors, and the specific method that is used may depend on the specific NLP task and the goals of the model. It is important to choose an appropriate method for representing text data in order to effectively capture the meaning and context of the text.

What are word2vec word embeddings? How are they trained?

Word2vec is a method for training word embeddings, which are numerical vectors that represent the meaning of a word in a continuous, low-dimensional space. Word embeddings are a popular method for representing text data in natural language processing (NLP), as they can capture the meaning of words in a compact and dense form and can be used as input to machine learning algorithms.

Word2vec is a type of neural network model that is trained to predict the context of a word given a set of surrounding words, or to predict a target word given a set of context words. During training, the model is presented with a large dataset of text data and learns to predict the context of a word based on the words that appear around it. The learned model can then be used to generate word embeddings by taking the weights of the model’s hidden layer as the word vectors.

There are two main variants of the word2vec model: continuous bag of words (CBOW) and skip-gram. In the CBOW model, the goal is to predict the target word given a set of context words, while in the skip-gram model, the goal is to predict the context words given a target word.

Word2vec word embeddings can be trained on large datasets of text data using techniques such as stochastic gradient descent (SGD) or negative sampling.

Provide the exact steps for the Word2Vec algorithm, include equations.

Word2Vec is a technique for learning vector representations of words, also known as “word embeddings”, in a way that captures the context and meaning of the words. The basic idea behind Word2Vec is to use the context of words in a text corpus to predict a target word, based on a sliding window of surrounding words. The prediction task is accomplished using a shallow neural network with a single hidden layer.

Here are the steps for the Word2Vec algorithm:

  1. Preprocess the text data: This involves cleaning and formatting the text data, such as lowercasing, stemming, and removing stop words.
  2. Tokenize the text: This involves dividing the text into individual words or tokens.
  3. Build a vocabulary: This involves creating a list of all the unique words in the text data and assigning each word a unique index.
  4. Create word-context pairs: For each word in the text, create a word-context pair by considering a window of surrounding words. For example, if the window size is 2, the word-context pairs for the word “cat” in the sentence “The cat sat on the mat” would be (“cat”, “The”), (“cat”, “sat”), (“cat”, “on”), and (“cat”, “the”).
  5. Train the model: Use the word-context pairs to train a shallow neural network with a single hidden layer. The input layer consists of the context words, and the output layer consists of the target word. The hidden layer contains a set of “neurons” that represent the word embeddings.

The training process involves minimizing the error between the predicted output (the target word) and the true output (the context words) using stochastic gradient descent. The error is measured using a loss function, such as the negative log likelihood loss.

Here is the equation for the negative log likelihood loss:

Loss = -log P(w | c)

where w is the target word and c is the context.

  1. Extract the word embeddings: Once the model is trained, the word embeddings can be extracted from the hidden layer of the neural network. These word embeddings can then be used for various natural language processing tasks, such as text classification, machine translation, and information retrieval.

You mentioned that word2vec can be trained using negative sampling. How does that work?

Yes, that’s correct! In addition to the traditional Word2Vec algorithm that uses the negative log likelihood loss function, there is also a variant called “negative sampling” that can be used to train the model. Negative sampling is a simplified training method that can be used to learn word embeddings more efficiently, especially for large text corpora.

Here’s how negative sampling works:

  1. Preprocess and tokenize the text data as described in the previous steps.
  2. Build a vocabulary and create word-context pairs as described in the previous steps.
  3. Train the model: For each word-context pair, randomly sample a number of negative examples (i.e., words that are not the target word). The objective is to maximize the probability of the target word given the context words, while minimizing the probability of the negative examples given the context words.

The training process involves minimizing the following loss function:

Loss = -log(sigmoid(u_w * v_c)) – ∑log(sigmoid(-u_n * v_c))

where u_w and v_c are the embeddings for the target word and the context words, respectively, and u_n is the embedding for a negative example. The sigmoid function is used to squash the output to a value between 0 and 1.

  1. Extract the word embeddings: Once the model is trained, the word embeddings can be extracted from the hidden layer of the neural network as described in the previous steps.

So when word2vec is trained using negative sampling, that means that just the loss function changes, otherwise stochastic gradient descent is still used?

Yes, that’s correct. When training word2vec using negative sampling, the loss function changes, but the underlying training algorithm, stochastic gradient descent (SGD), remains the same.

In word2vec, the goal is to learn a set of word vectors that can predict the surrounding words in a text. Negative sampling is a variant of the word2vec model that uses a different loss function to achieve this goal. Specifically, it uses a simplified version of the traditional word2vec loss function that only considers a small, randomly-selected subset of the negative examples (words that are not the target word) at each training step, rather than considering all of the negative examples. This allows the model to be trained more efficiently, as it doesn’t need to compute the loss for all of the negative examples at each training step.

SGD is still used to optimize the model’s parameters (the word vectors) by updating them in the direction that reduces the loss. The key difference with negative sampling is that the loss function being optimized is different from the traditional word2vec loss function.

But basically word2vec is trained using a pretty simple neural network, with just one hidden layer

Yes, that’s correct. Word2vec is typically implemented as a shallow neural network with just one hidden layer. The input to the network is a one-hot encoded word, and the hidden layer is a dense (fully-connected) layer with a fixed number of units. The output of the hidden layer is passed through a softmax activation function to produce a probability distribution over the vocabulary, which represents the model’s prediction of the probability of each word in the vocabulary given the input word.

You mentioned that there are two variants of word2vec – CBOW and skip-gram, what is the difference between the two?

In the Continuous Bag-of-Words (CBOW) model, the goal is to predict the current word given the surrounding context. For example, given the context “the cat sat on,” the model would try to predict the word “mat.” CBOW uses a sliding window approach to build up the context from the surrounding words.

On the other hand, the Skip-Gram model does the opposite: it tries to predict the surrounding context given a current word. So in the same example, given the word “mat,” the model would try to predict the context “the cat sat on.” Skip-gram works well with small amounts of data and is faster to train than CBOW.

What is the algorithms for the CBOW model?

The CBOW model uses a neural network with a single hidden layer to predict the current word based on the surrounding context. The input to the model is a one-hot encoded vector representing the context words, and the output is a probability distribution over the vocabulary, with the goal of predicting the target word with the highest probability.

The model is trained using the negative sampling loss function, which tries to maximize the probability of the target word given the context and minimize the probability of randomly sampled negative examples (words that are not the target word).

Here’s a summary of the CBOW algorithm:

  1. Initialize the model weights and biases.
  2. Iterate through each training example, which consists of a target word and a context of surrounding words.
  3. Convert the context words and target word to one-hot encoded vectors.
  4. Calculate the hidden layer activations by taking the dot product of the input vector (one-hot encoded context words) and the input weight matrix, and adding the bias.
  5. Apply the activation function (e.g. sigmoid or ReLU) to the hidden layer activations to get the hidden layer output.
  6. Calculate the output layer activations by taking the dot product of the hidden layer output and the output weight matrix, and adding the bias.
  7. Apply the softmax function to the output layer activations to get the predicted probability distribution over the vocabulary.
  8. Calculate the loss using the negative sampling loss function, which compares the predicted probability of the target word to the actual label (1 for the target word, 0 for the negative samples).
  9. Backpropagate the error and update the model weights and biases using gradient descent.
  10. Repeat steps 2-9 until the model has converged (i.e. the loss is minimized).

The CBOW algorithm uses the negative sampling loss function to train the model. The negative sampling loss function is defined as:

Loss = -(log(sigmoid(u_o * v_c)) + ∑ log(sigmoid(-u_i * v_c)))
where:

  • u_o is the output weight vector for the target word.
  • v_c is the input weight vector for the context words.
  • u_i is the output weight vector for a negative sample (a word that is not the target word).

The first term in the loss function maximizes the probability of the target word given the context, while the second term minimizes the probability of the negative samples.

The sigmoid function is applied to the dot product of the weight vectors to squash the output to a value between 0 and 1, which can be interpreted as a probability. The log function is applied to the sigmoid output to make the loss function smoother and easier to optimize.

The negative sampling loss function is used because it is computationally more efficient than the full softmax loss function, which involves calculating the probabilities of all words in the vocabulary.

Why is the dot product used? What does it represent?

In the word2vec algorithm, the dot product of the input and output weight vectors is used to measure the similarity between the context words and the target word. The dot product is a measure of the angle between two vectors, and it is defined as the product of the magnitudes of the vectors and the cosine of the angle between them.

In the case of word2vec, the input and output weight vectors are learned embeddings for the context words and the target word, respectively. These embeddings are learned during training, and they represent the underlying relationships between the words in the vocabulary.

The dot product of the input and output weight vectors is then used to predict the probability of the target word given the context. If the dot product is high, it means that the angle between the vectors is small, indicating that the context and target words are similar. This results in a high probability for the target word. On the other hand, if the dot product is low, it means that the angle between the vectors is large, indicating that the context and target words are dissimilar. This results in a low probability for the target word.

Overall, the dot product is used in the word2vec algorithm as a measure of similarity between the context words and the target word, which is used to predict the probability of the target word given the context.

What is the algorithm for the skip-gram model?

The Skip-Gram model is a neural network that uses a single hidden layer to predict the surrounding context given a current word. The input to the model is a one-hot encoded vector representing the current word, and the output is a probability distribution over the vocabulary, with the goal of predicting the context words with the highest probability.

The Skip-Gram model is trained using the negative sampling loss function, which tries to maximize the probability of the context words given the current word and minimize the probability of randomly sampled negative examples (words that are not in the context).

Here’s a summary of the Skip-Gram algorithm:

  1. Initialize the model weights and biases.
  2. Iterate through each training example, which consists of a current word and a context of surrounding words.
  3. Convert the current word and context words to one-hot encoded vectors.
  4. Calculate the hidden layer activations by taking the dot product of the input vector (one-hot encoded current word) and the input weight matrix, and adding the bias.
  5. Apply the activation function (e.g. sigmoid or ReLU) to the hidden layer activations to get the hidden layer output.
  6. Calculate the output layer activations by taking the dot product of the hidden layer output and the output weight matrix, and adding the bias.
  7. Apply the softmax function to the output layer activations to get the predicted probability distribution over the vocabulary.
  8. Calculate the loss using the negative sampling loss function, which compares the predicted probabilities of the context words to the actual labels (1 for the context words, 0 for the negative samples).
  9. Backpropagate the error and update the model weights and biases using gradient descent.
  10. Repeat steps 2-9 until the model has converged (i.e. the loss is minimized).

The negative sampling loss function used in the Skip-Gram model is defined as:

Loss = -(∑ log(sigmoid(u_c * v_o)) + ∑ log(sigmoid(-u_i * v_o)))

where:

  • u_c is the output weight vector for a context word.
  • v_o is the input weight vector for the current word.
  • u_i is the output weight vector for a negative sample (a word that is not in the context).

The first term in the loss function maximizes the probability of the context words given the current word, while the second term minimizes the probability of the negative samples.

The sigmoid function is applied to the dot product of the weight vectors to squash the output to a value between 0 and 1, which can be interpreted as a probability. The log function is applied to the sigmoid output to make the loss function smoother and easier to optimize.

The negative sampling loss function is used because it is computationally more efficient than the full softmax loss function, which involves calculating the probabilities of all words in the vocabulary.

In which cases is it better to use the CBOW model vs. Skip-Gram for word2vec?

The Continuous Bag-of-Words (CBOW) model and the Skip-Gram model are two approaches for learning vector representations of words, also known as word embeddings, in natural language processing tasks. These models are trained to predict a target word given the context of surrounding words, using a large corpus of text as input.

The main difference between the CBOW and Skip-Gram models is the way in which they define the context of a word. The CBOW model predicts the target word based on the context provided by the surrounding words, while the Skip-Gram model predicts the surrounding context words based on the target word.

In general, the CBOW model is faster to train and performs better on smaller datasets, while the Skip-Gram model is slower to train but can provide better results on larger datasets. The CBOW model is also better at handling rare words, as it takes into account the context provided by multiple surrounding words, while the Skip-Gram model only considers the context provided by a single target word.

In practice, the choice between the CBOW and Skip-Gram models will depend on the specific application and the available resources. If you have a large dataset and the time and computational resources to train a model, the Skip-Gram model may provide better results. If you have a smaller dataset and need to train a model quickly, the CBOW model may be a better choice.