📢 Attention Salesforce Certified Trailblazers! Maintain your credentials and link your Trailhead and Webassessor accounts by December 6th. Learn more.
close
Start tracking your progress
Trailhead Home
Trailhead Home

Represent Word Meaning

Learning Objectives

After completing this unit, you’ll be able to:
  • Explain the challenges of representing word meaning computationally.
  • Define discrete representation.
  • Describe how distributional similarity conveys word meaning.
Note

Note

Be sure you complete the Deep Learning and Natural Language Processing module before beginning this unit.

Let’s Talk Word Vectors

We use vectors to represent words for natural language processing, but how do we decide what those vectors should be? To design the best word vectors for natural language processing, you have to find the best method for representing what each word means.

In this module, we explore various methods and challenges of representing word meaning. Then we learn a tried and tested technique using Word2vec and get hands-on by building a learning model that helps us properly train a machine to understand the meaning of words.

Theories of Meaning

Most of the time, when we say “meaning,” we’re talking about the idea or object indicated by a particular word or phrase. We sometimes also use it to describe what a person intends to communicate.

Linguists call this relationship between the words we use and the idea or thing we’re talking about denotation. A dictionary is a good example. Every word, or signifier, in the dictionary corresponds to the idea or thing described in the definition.

So how can we make the concept of a dictionary useful for a computer? The answer to this question doesn’t fully solve our problem, but it’s a good place to start.

One solution is to use a resource like WordNet to provide additional information to a computer. In addition to a definition for each word, WordNet includes information like synonym sets and hypernyms.

Hypernyms describe the categorical relationships between words. You can express a hypernym as “____ is a ____.” So for example, “an apple is a fruit,” or “a duck is a bird.” Simply put, hypernyms add context about a word.

Defining words this way is called discrete representation. Each word has a separate definition, synonym set, and hypernym set that distinguishes it from other words.

Like we said WordNet is an improvement over a simple dictionary, but it’s not a perfect solution.

  • What about subtle degrees of similarity? For example, expert and good are both synonyms for proficient, but expert is more similar to proficient than good is. WordNet captures that they’re both synonyms, but it doesn’t tell us which synonyms are close and which are further apart.
  • What about the learning aspect? WordNet still requires manual work by human beings to create it and to update it. For NLP, it’s critical to let the machine do the work for us.
  • How do we remain objective? Hypernym relationships can be subjective. We can agree that an apple is a fruit, but is an Apple a computer as well?
  • How about word similarity and relationships? Definitions, synonyms, and hypernyms don’t give us the full picture about how words relate.

To sum up, for NLP we have the concept of denotation, or using words to signify ideas and things. We need to add relevant context and account for the many subtleties and relationships in language. Let’s keep going and figure out how we accomplish these difficult tasks.

Discrete Representations of Meaning

Traditional natural language processing represents words as discrete symbols. In vector terms, this means representing each word as a distinct one-hot vector. A one-hot vector is a vector with a single 1 and many 0s. For example, these are both 15-dimensional one-hot vectors:
  • [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ]
  • [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

If you use one-hot vectors to represent an entire vocabulary of words, each unique vector must have as many dimensions as you have words in the vocabulary. So if you have 500,000 words in your vocabulary, each word vector has 499,999 zeros and a single one. Just like using WordNet, representing words as discrete symbols using one-hot vectors is discrete representation.

Simple one-hot vectors don’t naturally convey anything about similarities between words. For example, let’s say you’re building search functionality and you want to use NLP. You want your search to return content with similar words as well as exact matches.

If you have the words “hotel” and “motel” in your vocabulary, for instance, you’d want searches for “Seattle hotel” to also return results that include “Seattle motel,” since the terms are similar. If you use one-hot vectors to represent the words in your vocabulary, though, there’s no relationship between the vectors for “hotel” and “motel.” They’re completely independent.

To solve this, you could combine one-hot vectors with a resource like WordNet and take the synonym or hypernym lists for each word in the search into account, in addition to the word’s vector. This approach gives NLP systems the ability to handle more nuance.

But there’s one more piece to this puzzle—how do we take the hard coding out of our hands and set up a way for the machine to take care of it? Glad you asked! You can encode similarities between words directly into the word vectors themselves.

Distributed Representations of Meaning

The linguist John Rupert Firth is famous for saying, “You shall know a word by the company it keeps.” Firth was a major contributor to an area of linguistics known as distributional semantics. Distributional semantics analyzes the similarities of words based on their distribution in large volumes of text. The idea here is that words that often appear in each other’s company are more similar than words that rarely appear together.

For NLP, we can get a lot of value by representing words in terms of their neighbors. To do this, you look at thousands of instances of a word in real text and keep track of the environments where you find the word. You can teach the machine that frequent neighbors of your target word are related and let it do the work.

For example, let’s say that you want to build a vector for the word Salesforce. First, you determine a window size for your context. The window size is how many words on each side of your target word you consider to be part of the context. If you pick a window size of five, the five words before “Salesforce” and the five words after “Salesforce” in your text are the context.

Three sentences where Salesforce is the target word with a window size of 5 on each side.

  • …will guide you in configuring Salesforce for your company. From custom…
  • …they’re your only customer with Salesforce CRM. Understand their needs, solve …
  • …applications to the cloud with Salesforce after rigorously testing the security.…

From this distributional data, you can build a vector for each word so that words that frequently appear near the target word have similar vectors to the target word. These vectors represent word meaning in terms of distributional similarity between words.

How, exactly, do you build those vectors? We dive into that in the next unit.

That Was A Lot

Our journey through linguistics and meaning really got us to the right place.

It turns out, distributed representation check all our big boxes.
  • Denotation—the understanding of the things or ideas signified by our words (think of how WordNet offers a dictionary, synonyms, and hypernyms).
  • Word Similarity—the understanding of context and intention (think of a search for “hotel” to also include results for “motels”).
  • A framework for learning denotation and similarity—offering a way for machines to learn these things (think of how you can take text, point out a word, and map it against the words next to it).
So, how do we do this programmatically? We’ll dive into that in the next unit.