Build a Model with Word2Vec
- Explain what a vector gradient is.
- Compare gradient descent and stochastic gradient descent.
We have the linguistics. We have the math. What’s next on our NLP journey?
Training a Model
Now that you know the objective function, it’s time to train the model. Let’s start by reviewing the objective function.
Remember that theta in the objective function stands for all the vectors in the vocabulary. This includes both the vectors for each word as the center word and the vectors for each word as the context word. For a simple vocabulary, you’d have some center word vectors (like vaardvark and vzebra), and some context word vectors (like uaardvark and uzebra).
To train the model, we need to calculate the vector gradients for every vector in theta.
What’s a vector gradient? It’s the derivative, or rate of change, of a vector. This rate is what gives the machine a way to map every instance of a word and find the most appropriate vector. Thankfully, Word2vec calculates all the gradients for you!
Optimize Parameters with Gradient Descent
To optimize your model, you need to minimize the objective function J(θ). In other words, you want your model to have the smallest amount of error possible. One way to minimize J(θ) is an algorithm called gradient descent.
Here’s the basic idea: for the current value of theta (the current word vectors), calculate the gradient of J(θ). Then use that gradient information to take a small step in the direction of negative gradient by changing theta. Repeat these steps over and over, until you reach the minimum value for J(θ) (when the gradient of theta is zero).
Stochastic Gradient Descent
Gradient descent makes a lot of sense. However, by itself, it is an extremely large function. Remember, J(θ) includes all the windows in the input text, which could be billions!
This means that computing the gradient of J(θ) is very expensive. You would have to wait a very long time to make each update, as you calculate J(θ) over and over again.
Other Approaches with Word2vec
Up to this point, this module has discussed predicting what context words you find around a specific center word. This approach is called the skip-gram model. Word2Vec also supports the continuous bag of words model. With the continuous bag of words model, rather than finding context words from a center word, you work to predict the center word, given context words.
There are lots of ways to train the model using Word2vec. In addition to the naive softmax method we have discussed in detail, Word2vec supports a technique called negative sampling. With negative sampling, rather than updating all the word vectors after each optimization step (as with gradient descent), or updating just the word vectors in your sample (as with stochastic gradient descent), you update both the word vectors in your sample and a set of negative words. This means that you identify a set of words that are very unlikely to be found near your center word (if you’re using the skip-gram model), and update their probability to zero. (If you’re using the continuous bag of words model, you identify a set of words very unlikely to be the center word, and update their probability to zero.)
Get Hands On with Natural Language Processing
In this trail, you’ll complete problem sets using a Google product called Colaboratory. That means you have to have a Google account to complete the challenges. If you don’t have a Google account or you want to use a separate one, you can create an account here.
Now you’re ready to start coding! Each piece of code is contained in cells. When you click into a cell, a play button appears. This button lets you run the code in the cell.