📢 Attention Salesforce Certified Trailblazers! Maintain your credentials and link your Trailhead and Webassessor accounts by December 6th. Learn more.
close
Start tracking your progress
Trailhead Home
Trailhead Home

Get Real with NLP

Learning Objectives

After completing this unit, you’ll be able to:
  • Define intrinsic and extrinsic evaluation.
  • Describe the advantages and disadvantages of intrinsic and extrinsic evaluation.
  • Perform sentiment analysis on a corpus of text.

Evaluating Word Vectors

Over the past couple of modules, we’ve talked a lot about how various algorithms for training word vectors perform, but how do you actually evaluate a word vector? What makes a good word vector? And how do we know when our model outperforms other models?

There are two basic categories of word vector evaluations: intrinsic and extrinsic.

Intrinsic evaluations are usually made on a specific sub-task that’s part of a larger end-task. For example, determining how well word vector similarities correlate with human ideas about word similarity is an intrinsic evaluation that you could make on a larger text classification system (the end-task).

Extrinsic evaluations are ones you make on real, external tasks. How well a particular set of word vectors works for machine translation or sentiment analysis is an extrinsic evaluation.

There are advantages and disadvantages to both methods. Intrinsic evaluations are generally fast to compute and give insight into your specific word vectors. However, it’s not clear from an intrinsic evaluation how your word vectors actually work in a real application. Improvements you can measure using an intrinsic evaluation don’t necessarily guarantee better performance on a real task.

Extrinsic evaluations, on the other hand, take longer to compute. And most final tasks, like machine translation, involve more than one system. It’s possible for a factor other than your word vectors to influence how well or poorly the task performs. With an extrinsic evaluation, you only know that your word vectors are an improvement if replacing a previous set of word vectors with your new vectors (without any other changes) improves task performance.

Who’s up for some examples?

Example: Intrinsic Evaluation

Let’s check out this intrinsic evaluation from Stanford University’s original GloVe paper, which evaluates GloVe on a bunch of different benchmarks. Here’s one table of results that shows how intuitively the GloVe algorithm works on an analogy task for some familiar words. This task was the one used to compare the various models in the previous unit.
Probability & Ratio k = solid k = gas k = water
P(k | ice) 1.9 x 10-4 6.6 x 10-5 3.0 x 10-3
P(k | steam) 2.2 x 10-5 7.8 x 10-4 2.2 x 10-3
P(k | ice)P(k | steam) 8.9 8.5 x 10-2 1.36
In this table, we’re looking at the probability of seeing the words “solid,” “gas,” and “water” given that you’ve just seen either the word “ice” or the word “steam.” With an intrinsic evaluation, we want to evaluate how well the algorithm performs given our own knowledge about language. In this case, you can see that GloVe did pretty well! Here’s what you can take away from the table:
  • GloVe is much more likely (8.9 times more) to predict the word “solid” after seeing the word “ice” than seeing the word “steam.”
  • GloVe is much more likely to predict the word “gas” after seeing the word “steam” than seeing the word “ice.”
  • The probability of seeing the word “water” is roughly equal after seeing the word “ice” or the word “steam.”

These results are pretty intuitive. Ice is solid, steam is gaseous, and both ice and steam are made upof water. This intrinsic evaluation tells us that GloVe performed well on this set of word vectors.

Notice, however, that this doesn’t tell us anything about whether we’d see results in a final task, like search or machine translation.

Example: Extrinsic Evaluation

To evaluate word vectors extrinsically, we need to look at how models perform on an NLP task using one set of vectors compared to another set of vectors on a real task.

Let’s think about sentiment analysis. When we do sentiment analysis, we’re trying to indentify the general sentiment, or feeling, of a particular text. In the real world, a company could use sentiment analysis to do something like parse a bunch of Tweets to see whether people are talking about their brand in a positive or negative light.

Take these two sample tweets:
  • “I’m having a blast learning about NLP on Trailhead!”—positive
  • “Writing code for NLP is very hard :(”—negative

A system that can make these types of classifications has a lot of components. There’s a pre-processing phase, where data is collected and cleansed. There could be systems to segment tweets into different parts, a feature extraction stage, and, of course, the model itself.

In these cases, it’s often unclear which of these sub-systems is causing problems in your larger system. Generally, if you replace a sub-system and your results improve, then you made a good change.

Get Hands On with Natural Language Processing

In this trail, you’ll complete problem sets using a Google product called Colaboratory. That means you have to have a Google account to complete the challenges. If you don’t have a Google account or you want to use a separate one, you can create an account here.

Once you have a Google account:
  1. Download the source code.
  2. Make sure you’re logged in to your Google account.
  3. Go to Colaboratory.
  4. In the dialog menu, click Upload.
  5. Choose the source code file (.ipynb) and click Open.

Now you’re ready to start coding! Each piece of code is contained in cells. When you click into a cell, a play button appears. This button lets you run the code in the cell.

Throughout the worksheet, you’ll find exercise markers that let you know you need to do something. After you complete an exercise, come back to Trailhead to answer the corresponding question about your results.
Note

Note

Before you run a cell, make sure you run all the cells above it first.

Have fun!