Use GloVe for Natural Language Processing
- Explain how GloVe trains word vectors.
- Describe the performance of GloVe compared to other NLP models.
So far, we’ve discussed two methods of constructing word vectors: direct prediction (like Word2Vec), and co-occurrence matrices with SVD. Here’s a quick recap of the differences between these two methods.
|Global Co-Occurrence Approaches (Matrices & SVD)||Local Co-Occurrence Approaches (Word2Vec)|
|Training speed||Fast||Slow (scales with corpus size)|
|Prime performance tasks||Capturing word similarity||Other tasks (e.g. analogical reasoning)|
|Pattern capture||Overemphasizes frequently occuring words||Complex capture that goes beyond word similarity|
Co-occurrence matrices are quick to train, but they mostly capture word similarity, they can give too much weight to common words, and they require extra work (like SVD) to make practical word vectors. Direct prediction captures patterns in text beyond just word similarity, but training can be cumbersome and slow because it scales with the size of your corpus. Can we do any better?
A third technique, known as GloVe (short for Global Vectors for Word Representation), combines some of the speed and simplicity of co-occurrence matrices with the power and task performance of direct prediction.
Like the simple co-occurrence matrices we discussed in the previous unit, GloVe is a co-occurrence-based model. It starts by going through the entire corpus and constructing a co-occurrence matrix. While Word2vec learns how to represent words by trying to predict context words given a center word (or vice versa), GloVe learns by looking at each pair of words in the corpus that might co-occur.
When it comes to producing good word vectors, GloVe is more similar to Word2vec than it is to our simple co-occurrence matrices. Rather than using SVD and other hacks, GloVe uses an objective function to train word vectors from the co-occurrence matrix.
To understand this objective function, it helps to build it up, piece by piece.
The core idea is this: given that i and j are two words that co-occur, you can optimize a word vector by minimizing the difference between the dot product of the word vectors for i and j, and the log of the number of times i and j co-occur, squared.
You can express that core calculation for a single pair of words (i,j) like this:
- J(Θ) is the objective function, given Θ. Like with Word2vec, Θ is all of the word vectors you can change to minimize the objective function.
uiTvj is the dot product of the vectors ui
Like with Word2vec, GloVe uses two vectors for each word in the vocabulary (u and v). These separate vectors are also known as the input and output vectors, and they correspond roughly to a word’s row and column in the co-occurrence matrix. After training these two vectors for each word, GloVe sums them up, which gives slightly better performance in the final model. To read more about how u and v work with GloVe, check out GloVe: Global Vectors for Word Representation.
- Pij is the count of the number of times i and j co-occur.
But minimizing that difference isn’t quite enough as it is with Word2vec. For example, what happens if i and j never co-occur? Then Pij = 0, and the log of 0 is undefined.
Having an undefined value pop up in the objective function is a problem! To solve this problem, you need to introduce a weighting function. We call the weighting function f(Pij). When we add f(Pij) to the objective function, we get this:
f(Pij) has a few properties. First, f(Pij) = 0 when Pij = 0. This means that when i and j don’t co-occur, you don’t need to calculate (uiTvj - logPij)2, you can just stop at f(Pij).
Secondly, f(Pij) helps counteract the problem of balancing the weight of very common or very uncommon words. As we discussed in the context of co-occurrence matrices, some words are very common in most vocabularies. For example, “this,” “is,” “of,” and “a” are all very common English words. This means they have very high co-occurrence counts with many words, which gives them undue influence on word vectors. There are also some very rare words, which have low co-occurrence counts, but which should still have some influence. f(Pij) applies weighting for these very common and very uncommon words to balance their influence. For more about f(Pij), check out GloVe: Global Vectors for Word Representation.
Then we take the sum of that equation for i and j, over the entire vocabulary (of size W). This allows us to train over the entire vocabulary.
As a final touch, we multiply the entire objective function by ½. Multiplying the equation by ½ means that when you take the derivative of the objective function (for example, when using gradient descent to optimize), the ½ cancels out squaring the difference between the dot product and co-occurrence count of i and j.
Here’s an example of GloVe’s performance on semantic and syntactic tasks with different sized datasets compared to some of the other methods we’ve discussed. This evaluation was done for an analogical reasoning task.
The underlined values indicate the best performance for a particular dimension and size of data. As you can see, GloVe performs very well. On smaller datasets, the continuous bag-of-words (CBOW) model outperformed GloVe on syntactic measures. For large datasets, however, GloVe outperformed all of CBOW, singular value decomposition (SVD) and skip-gram (SG).
The takeaway here is that GloVe combines the best parts of two of the co-occurrence matrix models and Word2vec models we’ve discussed.