Capture Co-Occurrence Counts
Learning Objectives
- Explain how Word2vec captures co-occurrence.
- Describe a few methods to efficiently capture co-occurrence counts.
- Define single value decomposition (SVD).
Word2vec and Local Co-Occurrence Statistics
In linguistics, two words that are co-occurrent have a higher than average chance of appearing near each other. Word2Vec goes through an entire body of text and predicts the words around each word, one at a time, capturing their co-occurrence indirectly. It encodes co-occurrence by updating word vectors at each step so that words that co-occur have more similar word vectors. Word2vec doesn’t explicitly take note of how many times words appear near each other, but rather, over many steps, it implicitly documents co-occurrence by refining the word vectors.
There’s another, more direct approach. We can capture those counts directly.
Global Co-Occurrence Statistics
Capturing co-occurrence indirectly is actually an older method of finding word similarities than Word2vec.
The simplest way to capture co-occurrence counts is to use a co-occurrence matrix. To create a co-occurrence matrix, you go through a body of text setting a window size around each word. You then keep track of which words appear in that window.
- I like deep learning.
- I like NLP.
- I enjoy Trailhead.
A simple co-occurrence matrix with a window size of one for that dataset looks like this:
I | like | enjoy | deep | learning | NLP | Trailhead | . | |
---|---|---|---|---|---|---|---|---|
I | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 |
like | 2 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
enjoy | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
deep | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
learning | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
NLP | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Trailhead | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
. | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
- Co-occurrence matrices grow very quickly. The matrix we looked at above is pretty small, but it also only covers three short sentences (and a total of seven words). A typical corpus for NLP easily contains 20,000 words or more. For larger bodies of text, a co-occurrence matrix gets exponentially larger. Word vectors made from a co-occurrence matrix also grow very quickly as your vocabulary grows.
- Because co-occurrence matrices grow so quickly and are high dimensional, they require a lot of storage space. For illustration, a typical word vector from Word2vec has 25 to 1,000 dimensions. A word vector made from a co-occurrence matrix can have 20,000 dimensions or more!
- In practice, word vectors made from a simple co-occurrence matrix don’t produce as robust a model as Word2vec’s word vectors—it just doesn’t work as well.
You can solve some of the problems of co-occurrence vectors by creating low-dimensional vectors. Rather than including every row of the co-occurrence matrix in each vector, you can use singular value decomposition (SVD) to create vectors with 25-1,000 dimensions that store only the most important information. These vectors are similar in size to vectors built with Word2Vec.
Simple Hacks for Co-Occurrence Matrices
In addition to reducing the size and increasing the density of word vectors, you can also improve their performance with a few tweaks.
One common issue with co-occurrence matrices is that “function words” show up so frequently that they end up with too much influence. In English, these function words include articles (a, an, the), pronouns (he, she, they, and so on), and conjunctions (for, and, nor, or, but, yet, so). To manage this disproportionate influence, you can cap frequency for function words, or ignore them altogether.
You can improve performance even more by using ramped windows. With a ramped window, rather than assigning equal importance to every word found in the window around your center word, you give greater importance to closer words, and less importance to words at the edges of the window. So for example, when a word appears right next to your center word, you add one to the count, but when a word appears three words away from your center word, you add .5 to the count.
What’s the Problem with SVD?
Compared to local co-occurrence approaches like Word2vec, SVD and other global co-occurrence approaches are faster to train and make efficient use of statistics.
Even with these improvements, there are still some significant problems with using SVD. First, the computational cost of SVD scales quadratically over a matrix. This makes SVD extremely computationally expensive for the co-occurrence matrix of a typical NLP corpus.
It’s also very time consuming to incorporate new words and documents into your corpus when you’re using co-occurrence matrices and SVD. With that in mind, let’s explore another solution.