Capture Co-Occurrence Counts

Learning Objectives

After completing this unit, you’ll be able to:
  • Explain how Word2vec captures co-occurrence.
  • Describe a few methods to efficiently capture co-occurrence counts.
  • Define single value decomposition (SVD).


Be sure you’ve completed the Word Meaning and Word2vec module before beginning this unit.

While Word2vec is very effective in helping machines learn the meaning of human language through distributed representation, there are other methods that get at the same challenge in different ways. Co-occurrence statistics is one of them.

Word2vec and Local Co-Occurrence Statistics

In linguistics, two words that are co-occurrent have a higher than average chance of appearing near each other. Word2Vec goes through an entire body of text and predicts the words around each word, one at a time, capturing their co-occurrence indirectly. It encodes co-occurrence by updating word vectors at each step so that words that co-occur have more similar word vectors. Word2vec doesn’t explicitly take note of how many times words appear near each other, but rather, over many steps, it implicitly documents co-occurrence by refining the word vectors.

There’s another, more direct approach. We can capture those counts directly.

Global Co-Occurrence Statistics

Capturing co-occurrence indirectly is actually an older method of finding word similarities than Word2vec.

The simplest way to capture co-occurrence counts is to use a co-occurrence matrix. To create a co-occurrence matrix, you go through a body of text setting a window size around each word. You then keep track of which words appear in that window.

Rather than using the words around each center word to update a word vector like Word2vec does, you create a matrix to store co-occurrence counts. For example, let’s say your corpus is these three sentences:
  • I like deep learning.
  • I like NLP.
  • I enjoy Trailhead.

A simple co-occurrence matrix with a window size of one for that dataset looks like this:

I like enjoy deep learning NLP Trailhead .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
Trailhead 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
Co-occurrence matrices capture a lot of the information you need for natural language processing (NLP). They contain both semantic and syntactic information. For example, terms with related meanings (a semantic relationship) often have high co-occurrence with each other (for example, sporting terms tend to cluster together). Words with a similar syntactic role often have similar co-occurrence patterns (for example, in English, verbs often co-occur with the word “to”).


Another benefit that comes with using co-occurrence matrices is speed. It’s very fast to train a model based on this approach.

When you look at a co-occurrence matrix, you might notice that it has a lot of similarities with a word vector. You could even choose to use the columns of a co-occurrence matrix as word vectors. However, there are some drawbacks to this simple approach:
  • Co-occurrence matrices grow very quickly. The matrix we looked at above is pretty small, but it also only covers three short sentences (and a total of seven words). A typical corpus for NLP easily contains 20,000 words or more. For larger bodies of text, a co-occurrence matrix gets exponentially larger. Word vectors made from a co-occurrence matrix also grow very quickly as your vocabulary grows.
  • Because co-occurrence matrices grow so quickly and are high dimensional, they require a lot of storage space. For illustration, a typical word vector from Word2vec has 25 to 1,000 dimensions. A word vector made from a co-occurrence matrix can have 20,000 dimensions or more!
  • In practice, word vectors made from a simple co-occurrence matrix don’t produce as robust a model as Word2vec’s word vectors—it just doesn’t work as well.

You can solve some of the problems of co-occurrence vectors by creating low-dimensional vectors. Rather than including every row of the co-occurrence matrix in each vector, you can use singular value decomposition (SVD) to create vectors with 25-1,000 dimensions that store only the most important information. These vectors are similar in size to vectors built with Word2Vec.



SVD is a technique in linear algebra that lets you decompose (or factor) a matrix. Using SVD on a co-occurrence matrix breaks down that matrix into three matrices of simpler vectors. For our purposes, you don’t need to worry about doing the math yourself. You can run SVD in Python and plot the simpler vectors. If you’d like to learn more about SVD, check out this Stats and Bots blog post.

Simple Hacks for Co-Occurrence Matrices

In addition to reducing the size and increasing the density of word vectors, you can also improve their performance with a few tweaks.

One common issue with co-occurrence matrices is that “function words” show up so frequently that they end up with too much influence. In English, these function words include articles (a, an, the), pronouns (he, she, they, and so on), and conjunctions (for, and, nor, or, but, yet, so). To manage this disproportionate influence, you can cap frequency for function words, or ignore them altogether.

You can improve performance even more by using ramped windows. With a ramped window, rather than assigning equal importance to every word found in the window around your center word, you give greater importance to closer words, and less importance to words at the edges of the window. So for example, when a word appears right next to your center word, you add one to the count, but when a word appears three words away from your center word, you add .5 to the count.

What’s the Problem with SVD?

Compared to local co-occurrence approaches like Word2vec, SVD and other global co-occurrence approaches are faster to train and make efficient use of statistics.

Even with these improvements, there are still some significant problems with using SVD. First, the computational cost of SVD scales quadratically over a matrix. This makes SVD extremely computationally expensive for the co-occurrence matrix of a typical NLP corpus.

It’s also very time consuming to incorporate new words and documents into your corpus when you’re using co-occurrence matrices and SVD. With that in mind, let’s explore another solution.

Keep learning for
Sign up for an account to continue.
What’s in it for you?
  • Get personalized recommendations for your career goals
  • Practice your skills with hands-on challenges and quizzes
  • Track and share your progress with employers
  • Connect to mentorship and career opportunities