CBOW – Notes of a Neuropsychiatry Amateur

Let’s talk about natural language processing. How does input text data get transformed to numerical vectors?

In natural language processing (NLP), input text data is often transformed into numerical vectors in order to make it more suitable for machine learning algorithms to work with. There are several methods that can be used to transform text data into numerical vectors, including:

One-hot encoding: This is a simple method that converts each word in the text into a unique integer index, and then represents each word as a binary vector, with a “1” in the position corresponding to the index of the word and “0”s in all other positions. One-hot encoding is a common method for converting text data into numerical vectors, but it can result in very large and sparse vectors, which may not be efficient to work with.
Word embeddings: Word embeddings are numerical vectors that represent the meaning of a word in a continuous, low-dimensional space. Word embeddings can be trained on large datasets of text data using techniques such as word2vec or GloVe, and they can capture the contextual relationships between words and the similarity between words. Word embeddings are a popular method for representing text data in NLP, as they can capture the meaning of words in a compact and dense form.
Bag of words: The bag of words model represents text data as a fixed-length vector, with each element in the vector representing the number of times a particular word appears in the text. The bag of words model is a simple and commonly used method for representing text data, but it does not capture the order or context of the words in the text.

There are many other methods for converting text data into numerical vectors, and the specific method that is used may depend on the specific NLP task and the goals of the model. It is important to choose an appropriate method for representing text data in order to effectively capture the meaning and context of the text.

What are word2vec word embeddings? How are they trained?

Word2vec is a method for training word embeddings, which are numerical vectors that represent the meaning of a word in a continuous, low-dimensional space. Word embeddings are a popular method for representing text data in natural language processing (NLP), as they can capture the meaning of words in a compact and dense form and can be used as input to machine learning algorithms.

Word2vec is a type of neural network model that is trained to predict the context of a word given a set of surrounding words, or to predict a target word given a set of context words. During training, the model is presented with a large dataset of text data and learns to predict the context of a word based on the words that appear around it. The learned model can then be used to generate word embeddings by taking the weights of the model’s hidden layer as the word vectors.

There are two main variants of the word2vec model: continuous bag of words (CBOW) and skip-gram. In the CBOW model, the goal is to predict the target word given a set of context words, while in the skip-gram model, the goal is to predict the context words given a target word.

Word2vec word embeddings can be trained on large datasets of text data using techniques such as stochastic gradient descent (SGD) or negative sampling.

Provide the exact steps for the Word2Vec algorithm, include equations.

Word2Vec is a technique for learning vector representations of words, also known as “word embeddings”, in a way that captures the context and meaning of the words. The basic idea behind Word2Vec is to use the context of words in a text corpus to predict a target word, based on a sliding window of surrounding words. The prediction task is accomplished using a shallow neural network with a single hidden layer.

Here are the steps for the Word2Vec algorithm:

Preprocess the text data: This involves cleaning and formatting the text data, such as lowercasing, stemming, and removing stop words.
Tokenize the text: This involves dividing the text into individual words or tokens.
Build a vocabulary: This involves creating a list of all the unique words in the text data and assigning each word a unique index.
Create word-context pairs: For each word in the text, create a word-context pair by considering a window of surrounding words. For example, if the window size is 2, the word-context pairs for the word “cat” in the sentence “The cat sat on the mat” would be (“cat”, “The”), (“cat”, “sat”), (“cat”, “on”), and (“cat”, “the”).
Train the model: Use the word-context pairs to train a shallow neural network with a single hidden layer. The input layer consists of the context words, and the output layer consists of the target word. The hidden layer contains a set of “neurons” that represent the word embeddings.

The training process involves minimizing the error between the predicted output (the target word) and the true output (the context words) using stochastic gradient descent. The error is measured using a loss function, such as the negative log likelihood loss.

Here is the equation for the negative log likelihood loss:

Loss = -log P(w | c)

where w is the target word and c is the context.

Extract the word embeddings: Once the model is trained, the word embeddings can be extracted from the hidden layer of the neural network. These word embeddings can then be used for various natural language processing tasks, such as text classification, machine translation, and information retrieval.

You mentioned that word2vec can be trained using negative sampling. How does that work?

Yes, that’s correct! In addition to the traditional Word2Vec algorithm that uses the negative log likelihood loss function, there is also a variant called “negative sampling” that can be used to train the model. Negative sampling is a simplified training method that can be used to learn word embeddings more efficiently, especially for large text corpora.

Here’s how negative sampling works:

Preprocess and tokenize the text data as described in the previous steps.
Build a vocabulary and create word-context pairs as described in the previous steps.
Train the model: For each word-context pair, randomly sample a number of negative examples (i.e., words that are not the target word). The objective is to maximize the probability of the target word given the context words, while minimizing the probability of the negative examples given the context words.

The training process involves minimizing the following loss function:

Loss = -log(sigmoid(u_w * v_c)) – ∑log(sigmoid(-u_n * v_c))

where u_w and v_c are the embeddings for the target word and the context words, respectively, and u_n is the embedding for a negative example. The sigmoid function is used to squash the output to a value between 0 and 1.

Extract the word embeddings: Once the model is trained, the word embeddings can be extracted from the hidden layer of the neural network as described in the previous steps.

So when word2vec is trained using negative sampling, that means that just the loss function changes, otherwise stochastic gradient descent is still used?

Yes, that’s correct. When training word2vec using negative sampling, the loss function changes, but the underlying training algorithm, stochastic gradient descent (SGD), remains the same.

In word2vec, the goal is to learn a set of word vectors that can predict the surrounding words in a text. Negative sampling is a variant of the word2vec model that uses a different loss function to achieve this goal. Specifically, it uses a simplified version of the traditional word2vec loss function that only considers a small, randomly-selected subset of the negative examples (words that are not the target word) at each training step, rather than considering all of the negative examples. This allows the model to be trained more efficiently, as it doesn’t need to compute the loss for all of the negative examples at each training step.

SGD is still used to optimize the model’s parameters (the word vectors) by updating them in the direction that reduces the loss. The key difference with negative sampling is that the loss function being optimized is different from the traditional word2vec loss function.

But basically word2vec is trained using a pretty simple neural network, with just one hidden layer

Yes, that’s correct. Word2vec is typically implemented as a shallow neural network with just one hidden layer. The input to the network is a one-hot encoded word, and the hidden layer is a dense (fully-connected) layer with a fixed number of units. The output of the hidden layer is passed through a softmax activation function to produce a probability distribution over the vocabulary, which represents the model’s prediction of the probability of each word in the vocabulary given the input word.

You mentioned that there are two variants of word2vec – CBOW and skip-gram, what is the difference between the two?

In the Continuous Bag-of-Words (CBOW) model, the goal is to predict the current word given the surrounding context. For example, given the context “the cat sat on,” the model would try to predict the word “mat.” CBOW uses a sliding window approach to build up the context from the surrounding words.

On the other hand, the Skip-Gram model does the opposite: it tries to predict the surrounding context given a current word. So in the same example, given the word “mat,” the model would try to predict the context “the cat sat on.” Skip-gram works well with small amounts of data and is faster to train than CBOW.

What is the algorithms for the CBOW model?

The CBOW model uses a neural network with a single hidden layer to predict the current word based on the surrounding context. The input to the model is a one-hot encoded vector representing the context words, and the output is a probability distribution over the vocabulary, with the goal of predicting the target word with the highest probability.

The model is trained using the negative sampling loss function, which tries to maximize the probability of the target word given the context and minimize the probability of randomly sampled negative examples (words that are not the target word).

Here’s a summary of the CBOW algorithm:

Initialize the model weights and biases.
Iterate through each training example, which consists of a target word and a context of surrounding words.
Convert the context words and target word to one-hot encoded vectors.
Calculate the hidden layer activations by taking the dot product of the input vector (one-hot encoded context words) and the input weight matrix, and adding the bias.
Apply the activation function (e.g. sigmoid or ReLU) to the hidden layer activations to get the hidden layer output.
Calculate the output layer activations by taking the dot product of the hidden layer output and the output weight matrix, and adding the bias.
Apply the softmax function to the output layer activations to get the predicted probability distribution over the vocabulary.
Calculate the loss using the negative sampling loss function, which compares the predicted probability of the target word to the actual label (1 for the target word, 0 for the negative samples).
Backpropagate the error and update the model weights and biases using gradient descent.
Repeat steps 2-9 until the model has converged (i.e. the loss is minimized).

The CBOW algorithm uses the negative sampling loss function to train the model. The negative sampling loss function is defined as:

Loss = -(log(sigmoid(u_o * v_c)) + ∑ log(sigmoid(-u_i * v_c)))
where:

u_o is the output weight vector for the target word.
v_c is the input weight vector for the context words.
u_i is the output weight vector for a negative sample (a word that is not the target word).

The first term in the loss function maximizes the probability of the target word given the context, while the second term minimizes the probability of the negative samples.

The sigmoid function is applied to the dot product of the weight vectors to squash the output to a value between 0 and 1, which can be interpreted as a probability. The log function is applied to the sigmoid output to make the loss function smoother and easier to optimize.

The negative sampling loss function is used because it is computationally more efficient than the full softmax loss function, which involves calculating the probabilities of all words in the vocabulary.

Why is the dot product used? What does it represent?

In the word2vec algorithm, the dot product of the input and output weight vectors is used to measure the similarity between the context words and the target word. The dot product is a measure of the angle between two vectors, and it is defined as the product of the magnitudes of the vectors and the cosine of the angle between them.

In the case of word2vec, the input and output weight vectors are learned embeddings for the context words and the target word, respectively. These embeddings are learned during training, and they represent the underlying relationships between the words in the vocabulary.

The dot product of the input and output weight vectors is then used to predict the probability of the target word given the context. If the dot product is high, it means that the angle between the vectors is small, indicating that the context and target words are similar. This results in a high probability for the target word. On the other hand, if the dot product is low, it means that the angle between the vectors is large, indicating that the context and target words are dissimilar. This results in a low probability for the target word.

Overall, the dot product is used in the word2vec algorithm as a measure of similarity between the context words and the target word, which is used to predict the probability of the target word given the context.

What is the algorithm for the skip-gram model?

The Skip-Gram model is a neural network that uses a single hidden layer to predict the surrounding context given a current word. The input to the model is a one-hot encoded vector representing the current word, and the output is a probability distribution over the vocabulary, with the goal of predicting the context words with the highest probability.

The Skip-Gram model is trained using the negative sampling loss function, which tries to maximize the probability of the context words given the current word and minimize the probability of randomly sampled negative examples (words that are not in the context).

Here’s a summary of the Skip-Gram algorithm:

Initialize the model weights and biases.
Iterate through each training example, which consists of a current word and a context of surrounding words.
Convert the current word and context words to one-hot encoded vectors.
Calculate the hidden layer activations by taking the dot product of the input vector (one-hot encoded current word) and the input weight matrix, and adding the bias.
Apply the activation function (e.g. sigmoid or ReLU) to the hidden layer activations to get the hidden layer output.
Calculate the output layer activations by taking the dot product of the hidden layer output and the output weight matrix, and adding the bias.
Apply the softmax function to the output layer activations to get the predicted probability distribution over the vocabulary.
Calculate the loss using the negative sampling loss function, which compares the predicted probabilities of the context words to the actual labels (1 for the context words, 0 for the negative samples).
Backpropagate the error and update the model weights and biases using gradient descent.
Repeat steps 2-9 until the model has converged (i.e. the loss is minimized).

The negative sampling loss function used in the Skip-Gram model is defined as:

Loss = -(∑ log(sigmoid(u_c * v_o)) + ∑ log(sigmoid(-u_i * v_o)))

where:

u_c is the output weight vector for a context word.
v_o is the input weight vector for the current word.
u_i is the output weight vector for a negative sample (a word that is not in the context).

The first term in the loss function maximizes the probability of the context words given the current word, while the second term minimizes the probability of the negative samples.

In which cases is it better to use the CBOW model vs. Skip-Gram for word2vec?

The Continuous Bag-of-Words (CBOW) model and the Skip-Gram model are two approaches for learning vector representations of words, also known as word embeddings, in natural language processing tasks. These models are trained to predict a target word given the context of surrounding words, using a large corpus of text as input.

The main difference between the CBOW and Skip-Gram models is the way in which they define the context of a word. The CBOW model predicts the target word based on the context provided by the surrounding words, while the Skip-Gram model predicts the surrounding context words based on the target word.

In general, the CBOW model is faster to train and performs better on smaller datasets, while the Skip-Gram model is slower to train but can provide better results on larger datasets. The CBOW model is also better at handling rare words, as it takes into account the context provided by multiple surrounding words, while the Skip-Gram model only considers the context provided by a single target word.

In practice, the choice between the CBOW and Skip-Gram models will depend on the specific application and the available resources. If you have a large dataset and the time and computational resources to train a model, the Skip-Gram model may provide better results. If you have a smaller dataset and need to train a model quickly, the CBOW model may be a better choice.

Tag: CBOW

Chatting with GPT – Natural Language Processing