📢 Attention Salesforce Certified Trailblazers! Maintain your credentials and link your Trailhead and Webassessor accounts by December 6th. Learn more.
close
Start tracking your progress
Trailhead Home
Trailhead Home

Use GloVe for Natural Language Processing

Learning Objectives

After completing this unit, you’ll be able to:
  • Explain how GloVe trains word vectors.
  • Describe the performance of GloVe compared to other NLP models.

Can We Do Better?

So far, we’ve discussed two methods of constructing word vectors: direct prediction (like Word2Vec), and co-occurrence matrices with SVD. Here’s a quick recap of the differences between these two methods.

Global Co-Occurrence Approaches (Matrices & SVD) Local Co-Occurrence Approaches (Word2Vec)
Training speed Fast Slow (scales with corpus size)
Statistics use Efficient Inefficient
Prime performance tasks Capturing word similarity Other tasks (e.g. analogical reasoning)
Pattern capture Overemphasizes frequently occuring words Complex capture that goes beyond word similarity

Co-occurrence matrices are quick to train, but they mostly capture word similarity, they can give too much weight to common words, and they require extra work (like SVD) to make practical word vectors. Direct prediction captures patterns in text beyond just word similarity, but training can be cumbersome and slow because it scales with the size of your corpus. Can we do any better?

GloVe Is the Best of Both Worlds

A third technique, known as GloVe (short for Global Vectors for Word Representation), combines some of the speed and simplicity of co-occurrence matrices with the power and task performance of direct prediction.

Like the simple co-occurrence matrices we discussed in the previous unit, GloVe is a co-occurrence-based model. It starts by going through the entire corpus and constructing a co-occurrence matrix. While Word2vec learns how to represent words by trying to predict context words given a center word (or vice versa), GloVe learns by looking at each pair of words in the corpus that might co-occur.

When it comes to producing good word vectors, GloVe is more similar to Word2vec than it is to our simple co-occurrence matrices. Rather than using SVD and other hacks, GloVe uses an objective function to train word vectors from the co-occurrence matrix.

The function J of theta equals frac 1 over 2 endfrac times the sum of i and j from 1 to W of f of P sub i j endsub times pren the dot product of u sub i transpose and v sub j minus the log of P sub i j endsub p'ren squared.

To understand this objective function, it helps to build it up, piece by piece.

The core idea is this: given that i and j are two words that co-occur, you can optimize a word vector by minimizing the difference between the dot product of the word vectors for i and j, and the log of the number of times i and j co-occur, squared.

You can express that core calculation for a single pair of words (i,j) like this:

The function J of theta equals pren the dot product u sub i transpose and v sub j minus the log of P sub i j endsub p'ren squared.
  • J(Θ) is the objective function, given Θ. Like with Word2vec, Θ is all of the word vectors you can change to minimize the objective function.
  • uiTvj is the dot product of the vectors ui and vj.

    Like with Word2vec, GloVe uses two vectors for each word in the vocabulary (u and v). These separate vectors are also known as the input and output vectors, and they correspond roughly to a word’s row and column in the co-occurrence matrix. After training these two vectors for each word, GloVe sums them up, which gives slightly better performance in the final model. To read more about how u and v work with GloVe, check out GloVe: Global Vectors for Word Representation.

  • Pij is the count of the number of times i and j co-occur.
Note

Note

Why do we square the difference? We want to know the difference between the dot product and the log of the number of times i and j co-occur. However, we don’t care which of those values is bigger (and therefore whether this equation returns a positive or negative value). Squaring the difference makes sure this equation always returns a positive value to describe the difference between dot product of the two word vectors and the log of the number of times those words co-occur.

But minimizing that difference isn’t quite enough as it is with Word2vec. For example, what happens if i and j never co-occur? Then Pij = 0, and the log of 0 is undefined.

Having an undefined value pop up in the objective function is a problem! To solve this problem, you need to introduce a weighting function. We call the weighting function f(Pij). When we add f(Pij) to the objective function, we get this:

The function J of theta equals the function f of P sub i j endsub times pren the dot product u sub i transpose and v sub j minus the log of P sub i j endsub p'ren squared.

f(Pij) has a few properties. First, f(Pij) = 0 when Pij = 0. This means that when i and j don’t co-occur, you don’t need to calculate (uiTvj - logPij)2, you can just stop at f(Pij).

Secondly, f(Pij) helps counteract the problem of balancing the weight of very common or very uncommon words. As we discussed in the context of co-occurrence matrices, some words are very common in most vocabularies. For example, “this,” “is,” “of,” and “a” are all very common English words. This means they have very high co-occurrence counts with many words, which gives them undue influence on word vectors. There are also some very rare words, which have low co-occurrence counts, but which should still have some influence. f(Pij) applies weighting for these very common and very uncommon words to balance their influence. For more about f(Pij), check out GloVe: Global Vectors for Word Representation.

Then we take the sum of that equation for i and j, over the entire vocabulary (of size W). This allows us to train over the entire vocabulary.

As a final touch, we multiply the entire objective function by ½. Multiplying the equation by ½ means that when you take the derivative of the objective function (for example, when using gradient descent to optimize), the ½ cancels out squaring the difference between the dot product and co-occurrence count of i and j.

The function J of theta equals frac 1 over 2 endfrac times the sum of i and j from 1 to W of f of P sub i j endsub times pren the dot product of u sub i transpose and v sub j minus the log of P sub i j endsub p'ren squared.
Note

Note

GloVe has several advantages over other approaches to word vectors. The first of these is that is is very fast to train compared to approaches like Word2vec. GloVe scales up efficiently for very large texts. It also works well in practice even with small texts and small word vectors.

How Does GloVe Perform?

Here’s an example of GloVe’s performance on semantic and syntactic tasks with different sized datasets compared to some of the other methods we’ve discussed. This evaluation was done for an analogical reasoning task.

Model Dim. Size Sem. Syn. Tot.
CBOW 300 6B 63.6 67.4 65.7
SG 300 6B 73.0 66.0 69.1
SVD-L 300 6B 56.6 63.0 60.1
GloVe 300 6B 77.4 67.0 71.7
CBOW 1000 6B 57.3 68.9 63.7
SG 1000 6B 66.1 65.1 65.6
SVD-L 300 42B 38.4 58.2 49.2
GloVe 300 42B 81.9 69.3 75.0

The underlined values indicate the best performance for a particular dimension and size of data. As you can see, GloVe performs very well. On smaller datasets, the continuous bag-of-words (CBOW) model outperformed GloVe on syntactic measures. For large datasets, however, GloVe outperformed all of CBOW, singular value decomposition (SVD) and skip-gram (SG).

The takeaway here is that GloVe combines the best parts of two of the co-occurrence matrix models and Word2vec models we’ve discussed.