# Build a Model with Word2Vec

## Learning Objectives

- Explain what a vector gradient is.
- Compare gradient descent and stochastic gradient descent.

We have the linguistics. We have the math. What’s next on our NLP journey?

## Training a Model

Now that you know the objective function, it’s time to train the model. Let’s start by reviewing the objective function.

Remember that theta in the objective function stands for all the vectors in the vocabulary.
This includes both the vectors for each word as the center word and the vectors for each word as
the context word. For a simple vocabulary, you’d have some center word vectors (like `v _{aardvark}` and

`v`), and some context word vectors (like

_{zebra}`u`and

_{aardvark}`u`).

_{zebra}

To train the model, we need to calculate the vector gradients for every vector in theta.

What’s a vector gradient? It’s the derivative, or rate of change, of a vector. This rate is what gives the machine a way to map every instance of a word and find the most appropriate vector. Thankfully, Word2vec calculates all the gradients for you!

## Optimize Parameters with Gradient Descent

To optimize your model, you need to minimize the objective function `J(θ)`. In other words, you want your model to have the smallest amount of error
possible. One way to minimize `J(θ)` is an algorithm called
gradient descent.

Here’s the basic idea: for the current value of theta (the current word vectors), calculate the
gradient of `J(θ)`. Then use that gradient information to take
a small step in the direction of negative gradient by changing theta. Repeat these steps over and
over, until you reach the minimum value for `J(θ)` (when the
gradient of theta is zero).

## Stochastic Gradient Descent

Gradient descent makes a lot of sense. However, by itself, it is an extremely large function.
Remember, `J(θ)` includes all the windows in the input text,
which could be billions!

This means that computing the gradient of `J(θ)` is very
expensive. You would have to wait a very long time to make each update, as you calculate `J(θ)` over and over again.

## Other Approaches with Word2vec

Up to this point, this module has discussed predicting what context words you find around a
specific center word. This approach is called the *skip-gram* model. Word2Vec also supports
the continuous bag of words model. With the *continuous bag of words* model, rather than
finding context words from a center word, you work to predict the center word, given context
words.

There are lots of ways to train the model using Word2vec. In addition to the naive softmax
method we have discussed in detail, Word2vec supports a technique called *negative
sampling*. With negative sampling, rather than updating all the word vectors after each
optimization step (as with gradient descent), or updating just the word vectors in your sample
(as with stochastic gradient descent), you update both the word vectors in your sample **and**
a set of negative words. This means that you identify a set of words that are very unlikely to be
found near your center word (if you’re using the skip-gram model), and update their probability
to zero. (If you’re using the continuous bag of words model, you identify a set of words very
unlikely to be the center word, and update their probability to zero.)

## Get Hands On with Natural Language Processing

In this trail, you’ll complete problem sets using a Google product called Colaboratory. That means you have to have a Google account to complete the challenges. If you don’t have a Google account or you want to use a separate one, you can create an account here.

- Download the source code.
- Make sure you’re logged in to your Google account.
- Go to Colaboratory.
- In the dialog menu, click
**Upload**. - Choose the source code file (
`.ipynb`) and click**Open**.

Now you’re ready to start coding! Each piece of code is contained in cells. When you click into a cell, a play button appears. This button lets you run the code in the cell.

Have fun!