Use GloVe for Natural Language Processing
Learning Objectives
 Explain how GloVe trains word vectors.
 Describe the performance of GloVe compared to other NLP models.
Can We Do Better?
So far, we’ve discussed two methods of constructing word vectors: direct prediction (like Word2Vec), and cooccurrence matrices with SVD. Here’s a quick recap of the differences between these two methods.
Global CoOccurrence Approaches (Matrices & SVD)  Local CoOccurrence Approaches (Word2Vec)  
Training speed  Fast  Slow (scales with corpus size) 
Statistics use  Efficient  Inefficient 
Prime performance tasks  Capturing word similarity  Other tasks (e.g. analogical reasoning) 
Pattern capture  Overemphasizes frequently occuring words  Complex capture that goes beyond word similarity 
Cooccurrence matrices are quick to train, but they mostly capture word similarity, they can give too much weight to common words, and they require extra work (like SVD) to make practical word vectors. Direct prediction captures patterns in text beyond just word similarity, but training can be cumbersome and slow because it scales with the size of your corpus. Can we do any better?
GloVe Is the Best of Both Worlds
A third technique, known as GloVe (short for Global Vectors for Word Representation), combines some of the speed and simplicity of cooccurrence matrices with the power and task performance of direct prediction.
Like the simple cooccurrence matrices we discussed in the previous unit, GloVe is a cooccurrencebased model. It starts by going through the entire corpus and constructing a cooccurrence matrix. While Word2vec learns how to represent words by trying to predict context words given a center word (or vice versa), GloVe learns by looking at each pair of words in the corpus that might cooccur.
When it comes to producing good word vectors, GloVe is more similar to Word2vec than it is to our simple cooccurrence matrices. Rather than using SVD and other hacks, GloVe uses an objective function to train word vectors from the cooccurrence matrix.
To understand this objective function, it helps to build it up, piece by piece.
The core idea is this: given that i and j are two words that cooccur, you can optimize a word vector by minimizing the difference between the dot product of the word vectors for i and j, and the log of the number of times i and j cooccur, squared.
You can express that core calculation for a single pair of words (i,j) like this:
 J(Θ) is the objective function, given Θ. Like with Word2vec, Θ is all of the word vectors you can change to minimize the objective function.

u_{i}^{T}v_{j} is the dot product of the vectors u_{i}
and v_{j}.
Like with Word2vec, GloVe uses two vectors for each word in the vocabulary (u and v). These separate vectors are also known as the input and output vectors, and they correspond roughly to a word’s row and column in the cooccurrence matrix. After training these two vectors for each word, GloVe sums them up, which gives slightly better performance in the final model. To read more about how u and v work with GloVe, check out GloVe: Global Vectors for Word Representation.
 P_{ij} is the count of the number of times i and j cooccur.
But minimizing that difference isn’t quite enough as it is with Word2vec. For example, what happens if i and j never cooccur? Then P_{ij} = 0, and the log of 0 is undefined.
Having an undefined value pop up in the objective function is a problem! To solve this problem, you need to introduce a weighting function. We call the weighting function f(P_{ij}). When we add f(P_{ij}) to the objective function, we get this:
f(P_{ij}) has a few properties. First, f(P_{ij}) = 0 when P_{ij} = 0. This means that when i and j don’t cooccur, you don’t need to calculate (u_{i}^{T}v_{j}  logP_{ij})^{2}, you can just stop at f(P_{ij}).
Secondly, f(P_{ij}) helps counteract the problem of balancing the weight of very common or very uncommon words. As we discussed in the context of cooccurrence matrices, some words are very common in most vocabularies. For example, “this,” “is,” “of,” and “a” are all very common English words. This means they have very high cooccurrence counts with many words, which gives them undue influence on word vectors. There are also some very rare words, which have low cooccurrence counts, but which should still have some influence. f(P_{ij}) applies weighting for these very common and very uncommon words to balance their influence. For more about f(P_{ij}), check out GloVe: Global Vectors for Word Representation.
Then we take the sum of that equation for i and j, over the entire vocabulary (of size W). This allows us to train over the entire vocabulary.
As a final touch, we multiply the entire objective function by ½. Multiplying the equation by ½ means that when you take the derivative of the objective function (for example, when using gradient descent to optimize), the ½ cancels out squaring the difference between the dot product and cooccurrence count of i and j.
How Does GloVe Perform?
Here’s an example of GloVe’s performance on semantic and syntactic tasks with different sized datasets compared to some of the other methods we’ve discussed. This evaluation was done for an analogical reasoning task.
Model  Dim.  Size  Sem.  Syn.  Tot. 

CBOW  300  6B  63.6  67.4  65.7 
SG  300  6B  73.0  66.0  69.1 
SVDL  300  6B  56.6  63.0  60.1 
GloVe  300  6B  77.4  67.0  71.7 
CBOW  1000  6B  57.3  68.9  63.7 
SG  1000  6B  66.1  65.1  65.6 
SVDL  300  42B  38.4  58.2  49.2 
GloVe  300  42B  81.9  69.3  75.0 
The underlined values indicate the best performance for a particular dimension and size of data. As you can see, GloVe performs very well. On smaller datasets, the continuous bagofwords (CBOW) model outperformed GloVe on syntactic measures. For large datasets, however, GloVe outperformed all of CBOW, singular value decomposition (SVD) and skipgram (SG).
The takeaway here is that GloVe combines the best parts of two of the cooccurrence matrix models and Word2vec models we’ve discussed.