NLP – Word Embeddings – FastText

What is the FastText method for word embeddings?

FastText is a library for efficient learning of word representations and sentence classification. It was developed by Facebook AI Research (FAIR).

FastText represents each word in a document as a bag of character n-grams. For example, the word “apple” would be represented as the following character n-grams: “a”, “ap”, “app”, “appl”, “apple”, “p”, “pp”, “ppl”, “pple”, “p”, “pl”, “ple”, “l”, “le”. This representation has two advantages:

  1. It can handle spelling mistakes and out-of-vocabulary words. For example, the model would still be able to understand the word “apple” even if it was misspelled as “appel” or “aple”.
  2. It can handle words in different languages with the same script (e.g., English and French) without the need for a separate model for each language.

FastText uses a shallow neural network to learn the word representations from this character n-gram representation. It is trained using the skip-gram model with negative sampling, similar to word2vec.

FastText can also be used for sentence classification by averaging the word vectors for the words in the sentence and training a linear classifier on top of the averaged vector. It is particularly useful for languages with a large number of rare words, or in cases where using a word’s subwords (also known as substrings or character n-grams) as features can be helpful.

How are word embeddings trained in FastText?

Word embeddings in FastText can be trained using either the skip-gram model or the continuous bag-of-words (CBOW) model.

In the skip-gram model, the goal is to predict the context words given a target word. For example, given the input sequence “I have a dog”, the goal would be to predict “have” and “a” given the target word “I”, and to predict “I” given the target word “have”. The skip-gram model learns to predict the context words by minimizing the negative log likelihood of the context words given the target word.

In the CBOW model, the goal is to predict the target word given the context words. For example, given the input sequence “I have a dog”, the goal would be to predict “I” given the context words “have” and “a”, and to predict “have” given the context words “I” and “a”. The CBOW model learns to predict the target word by minimizing the negative log likelihood of the target word given the context words.

Both the skip-gram and CBOW models are trained using stochastic gradient descent (SGD) and backpropagation to update the model’s parameters. The model is trained by minimizing the negative log likelihood of the words in the training data, given the model’s parameters.

Explain how FastText represents each word in a document as a bag of character n-grams

To represent a word as a bag of character n-grams, FastText breaks the word down into overlapping substrings (also known as character n-grams). For example, the word “apple” could be represented as the following character 3-grams (trigrams): [“app”, “ppl”, “ple”]. The number of characters in each substring is specified by the user and is typically set to between 3 and 6 characters.

For example, consider the following sentence:

“I have a dog”

If we set the number of characters in each substring to 3, FastText would represent each word in the sentence as follows:

“I”: [“I”] “have”: [“hav”, “ave”] “a”: [“a”] “dog”: [“dog”]

The use of character n-grams allows FastText to learn good vector representations for rare words, as it can use the vector representations of the character n-grams that make up the rare word to compute its own vector representation. This is particularly useful for handling out-of-vocabulary words that may not have a pre-trained vector representation available.

How are vector representations for each word computed from n-gram vectors?

In FastText, the vector representation for each word is computed as the sum of the vector representations of the character n-grams (subwords) that make up the word. For example, consider the following sentence:

“I have a dog”

If we set the number of characters in each substring to 3, FastText would represent each word in the sentence as a bag of character 3-grams (trigrams) as follows:

“I”: [“I”] “have”: [“hav”, “ave”] “a”: [“a”] “dog”: [“dog”]

FastText would then learn a vector representation for each character n-gram and use these vector representations to compute the vector representation for each word. For example, the vector representation for the word “have” would be computed as the sum of the vector representations for the character n-grams [“hav”, “ave”].

Since there can be huge number of unique n-grams, how does FastText deal with the memory requirement?

One of the ways that FastText deals with the large number of unique character n-grams is by using hashing to map the character n-grams to a fixed-size hash table rather than storing them in a dictionary. This allows FastText to store the character n-grams in a compact form, which can save memory.

What is hashing? How are character sequences hashed to integer values?

Hashing is the process of converting a given input (called the ‘key’) into a fixed-size integer value (called the ‘hash value’ or ‘hash code’). The key is typically some sort of string or sequence of characters, but it can also be a number or other data type.

There are many different ways to hash a character sequence, but most algorithms work by taking the input key, performing some mathematical operations on it, and then returning the hash value as an integer. The specific mathematical operations used will depend on the specific hashing algorithm being used.

One simple example of a hashing algorithm is the ‘modulo’ method, which works as follows:

  1. Take the input key and convert it into a numerical value, for example by assigning each character in the key a numerical value based on its ASCII code.
  2. Divide this numerical value by the size of the hash table (the data structure in which the hashed keys will be stored).
  3. The remainder of this division is the hash value for the key.

This method is simple and fast, but it is not very robust and can lead to a high number of collisions (when two different keys produce the same hash value). More sophisticated algorithms are typically used in practice to improve the performance and reliability of hash tables.

How is the Skip-gram with negative sampling applied in FastText?

Skip-gram with negative sampling (SGNS) algorithm is used to learn high-quality word embeddings (i.e., dense, low-dimensional representations of words that capture the meaning and context of the words). The Skip-gram with negative sampling algorithm works by training a predictive model to predict the context words (i.e., the words that appear near a target word in a given text) given the target word. During training, the model is given a sequence of word pairs (a target word and a context word) and tries to predict the context words given the target words.

To train the model, the SGNS algorithm uses a technique called negative sampling, which involves sampling a small number of negative examples (random words that are not the true context words) and using them to train the model along with the positive examples (the true context words). This helps the model to learn the relationship between the target and context words more efficiently by focusing on the most informative examples.

The SGNS algorithm steps are as following:

  • The embedding for a target word (also called the ‘center word’) is calculated by taking the sum of the embeddings for the word itself and the character n-grams that make up the word.
  • The context words are represented by their word embeddings, without adding the character n-grams.
  • Negative samples are selected randomly from the vocabulary during training, with the probability of selecting a word being proportional to the square root of its unigram frequency (i.e., the number of times it appears in the text).
  • The dot product of the embedding for the center word and the embedding for the context word is calculated. We then need to normalize the similarity scores over all of the context words in the vocabulary, so that the probabilities sum to 1 and form a valid probability distribution.
  • Compute the cross-entropy loss between the predicted and true context words. Use an optimization algorithm such as stochastic gradient descent (SGD) to update the embedding vectors in order to minimize this loss. This involves bringing the actual context words closer to the center word (i.e., the target word) and increasing the distance between the center word and the negative samples.

    The cross-entropy loss function can be expressed as:
  • L = – ∑i(y_i log(p(w_i|c)) + (1 – y_i)log(1 – p(w_i|c)))
  • where:
  • L is the cross-entropy loss.
  • y_i is a binary variable indicating whether context word i is a positive example (y_i = 1) or a negative example (y_i = 0).
  • p(w_i|c) is the probability of context word i given the target word c and its embedding.
  • ∑i indicates that the sum is taken over all context words i in the vocabulary.

FastText and hierarchical softmax

FastText can use a technique called hierarchical softmax to reduce the computation time during training. Hierarchical softmax works by organizing the vocabulary into a binary tree, with the word at the root of the tree and its descendant words arranged in a hierarchy according to their probability of occurrence.

During training, the model uses the hierarchical structure of the tree to compute the loss and update the model weights more efficiently. This is done by traversing the tree from the root to the appropriate leaf node for each word, rather than computing the loss and updating the weights for every word in the vocabulary separately.

The standard softmax function has a computational complexity of O(Kd), where K is the number of classes (i.e., the size of the vocabulary) and d is the number of dimensions in the hidden layer of the model. This complexity arises from the need to normalize the probabilities over all potential classes in order to obtain a valid probability distribution. The hierarchical softmax reduces the computational complexity to O(d*log(K)). Huffman coding can be used to construct a binary tree structure for the softmax function, where the lowest frequency classes are placed deeper into the tree and the highest frequency classes are placed near the root of the tree.

In the hierarchical softmax function, a probability is calculated for each path through the Huffman coding tree, based on the product of the output vector v_n_i of each inner node n and the output value of the hidden layer of the model, h. The sigmoid function is then applied to this product to obtain a probability between 0 and 1.

The idea of this method is to represent the output classes (i.e., the words in the vocabulary) as the leaves on the tree and to use a random walk through the tree to assign probabilities to the classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as the product of the probabilities along the path from the root to the leaf node corresponding to the class.

This allows the hierarchical softmax function to compute the probability of each class more efficiently, since it only needs to consider the path through the tree rather than the entire vocabulary. This can significantly reduce the computational complexity of the model, particularly for large vocabularies, making it practical to train word embeddings on very large datasets.

Hierarchical softmax and conditional probabilities

To compute the probability of each context word given the center word and its embedding using the hierarchical softmax function, we first organize the vocabulary into a binary tree, with the words at the nodes of the tree and their descendant words arranged in a hierarchy according to their probability of occurrence.

We then compute the probability of each context word by traversing the tree from the root to the appropriate leaf node for the word. For each inner node n in the tree, we compute the probability of traversing the left or right branch of the tree as follows:

p(left|n) = sigmoid(v_n_i · h) p(right|n) = 1 – p(left|n)

where:

  • v_n_i is the vector representation of inner node n
  • h is the output value of the hidden layer of the model

The probability of a context word w is then computed as the product of the probabilities of the branches along the path from the root to the leaf node corresponding to w.

Published by

Neuropsych Amateur

Misdiagnosed with schizophrenia for a year. Later on received the correct diagnosis of autoimmune encephalitis (Hashimoto's Encephalitis) in April 2017. This is me trying to understand this autoimmune disease, what led to it, and why it took so long to diagnose.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s