Start tracking your progress
Trailhead Home
Trailhead Home

Get Started with Natural Language Processing

Learning Objectives

After completing this unit, you’ll be able to:
  • Explain the goals and applications of natural language processing.
  • Describe the challenges of natural language processing.

Introduction

This module lays the foundation of deep learning and natural language processing. Before you get going, though, make sure that you have the background you need.

Note

Note

This is an advanced topic, and this module assumes you have a basic understanding of machine learning vocabulary, some experience with Python, and at least a little hands-on experience working with machine learning data and algorithms. If you don’t already have that background, you can get yourself up to speed using the following resources.

This module concludes with a challenge that is pretty one of a kind on Trailhead. Instead of completing a multiple-choice quiz or a hands-on challenge in a Salesforce org, you complete a lab exercise written in Python. Think of this lab exercise sort of like a problem set. In the lab exercise, you get hands-on with machine learning basics in Python and some of the techniques discussed in the module. Depending on your level of experience with Python and machine learning, this lab exercise can take you between 45 minutes and an hour to complete. When you’ve finished the lab exercise, you come back to this module in Trailhead and enter your solutions to the exercises for points.

What Is Natural Language Processing?

Natural language processing (NLP) brings together ideas from computer science, linguistics, and artificial intelligence. Its goal is to give computers some form of language understanding so they can process human language. Computers use language processing to perform useful tasks like translation, question answering, or making appointments.

NLP plays a unique role in the field of artificial intelligence because language itself is such an important part of how human beings think and express themselves. But language is complicated! A computer that perfectly understands the meaning of natural language would be "AI-complete." AI-complete refers to a problem that is at the core of artificial intelligence. To develop a system with a true understanding of human language, you would actually have to solve a much bigger problem: artificial intelligence itself!

Even though we’re limited by the central challenges of AI, we’ve developed some robust systems to interpret language. Natural language processing layers increasingly powerful and complex systems to handle tasks that require higher levels of understanding. That hierarchy of layers looks something like this.

1. Speech (phonetic/phonological analysis) or text (OCR/tokenization); 2. Morphological analysis; 3. Syntactic analysis; 4. Semantic interpretation; 5. Discourse processing
  1. Input and initial processing—Taking in speech or text and breaking it up into smaller pieces for processing.

    For speech, this step is called phonetic analysis, and consists of breaking down the speech into individual sounds, called phonemes. For text input, this can include optical character recognition (OCR) and tokenization. OCR is used to recognize the individual characters in text if it’s coming in as an image rather than as words made of characters. Tokenization refers to breaking down a continuous text into individual tokens, often words.

  2. Morphological analysis—Breaking down complex words into their components to better understand their meaning.

    For example, you can break down “incomprehensible” into its component parts.
    • “in”—not
    • “comprehens”—to understand or comprehend
    • “ible”—indicates that this word is an adjective, describing whether something can be comprehended
  3. Syntactic analysis—Trying to understand the structure of sentences by looking at how the words work together. This step is like diagramming a sentence, where you identify the role each word is playing in the sentence.
  4. Semantic interpretation—Working out the meaning of a sentence by combining the meaning of individual words with their syntactic roles in the sentence.
  5. Discourse processing—Understanding the context around a sentence to fully process what it means.

To successfully help a computer understand and produce natural language, NLP must tackle language at all its levels. It’s not enough just to create a dictionary and define each word, the computer must understand how those words work together grammatically and in context. In this trail, we spend most of our time talking about steps 3 and 4, syntactic analysis and semantic interpretation.

What Can Natural Language Processing Do?

Natural language processing makes possible many tools and technologies we use every day, from the (relatively) simple to the complex. For example, spell checkers, autocomplete functions, and keyword search (especially when that search automatically includes synonyms) all use simple forms of NLP.

More complex forms of NLP let us do things like automatically extract addresses and other company information from websites (for example, to display on an online map), or automatically classify documents by sentiment or reading level. We use the most powerful NLP tools for tasks like machine translation (like Google Translate), chatbots with natural speech, and complex virtual assistants (like Siri, Google Assistant, and Amazon’s Alexa).

Since mobile devices became widespread, interest in NLP has exploded. Unlike the full keyboard of a computer, mobile devices usually have very small keyboards, which are difficult to use to enter long strings of text. Interacting by voice with a mobile device hugely expands what that device can do. The field of NLP has made big strides in recent years, but there’s a long way to go before computers have perfect understanding of human language.

Language Encodes Meaning

A lot of machine learning data is made up of real-world data sets without much order. For instance, sales records, movie viewing data, or drive times for common commutes are all examples of data sets that have been fruitful for machine learning. Humans who create this type of data aren’t thinking about the patterns it makes or what it communicates. However speakers and writers use human language specifically to communicate information. Rather than being a random mixture of data, language is intentionally constructed to convey meaning to other human beings.

Human language can seem simple because you use it every day and learned it as a small child, but it’s also a symbolic system capable of encoding deep, subtle, and nuanced meaning. Language uses symbolic signaling (also sometimes known as discrete or categorical signaling) to deliberately convey the speaker or writer’s meaning. Words correspond to specific concepts. For example, the word dog corresponds to the idea of a dog, the word violin to the idea of a violin, and so forth. Language is composed of symbols.

Language encodes those symbols into continuous substrates like sound (for example, speech), gesture (like sign language), and images (writing). So although language is composed of discrete, separate symbols, we communicate it and understand it as one long continuous encoded pattern.

When one human uses language to communicate to another, they take a continuous pattern of internal thought, convert it into discrete symbols, and encode it in a continuous substrate as speech, writing, or gesture.

We encode thought as symbols, and symbols into a substrate.

The human who receives that communication takes in the continuous communication, parses it as discrete symbols, and understands it in a continuous pattern of thought. So language translates meaning from a continuous pattern into discrete symbols and back to a continuous pattern.

Language Is Ambiguous

Human languages aren’t like programming languages. While a programming language is constructed to be as clear-cut and explicit as possible, human languages are inherently ambiguous. Many sentences leave a lot of interpretation up to context and experience. In other words, we often rely on our audience to decode which "if" goes with a which "else."

Here are some real examples of ambiguous language in the headlines.

  • Boy paralyzed after tumor fights back to gain black belt—Who has the black belt here, the boy, or the tumor? Was the boy paralyzed before or after the fighting back? Who won the fight?
  • Republicans Grill IRS Chief Over Lost Emails—Did the Republicans burn the emails to cook the IRS chief, or did they question him intensely?
  • Scientists study whales from space—Did the whales come from space? Are the scientists in space?

In addition to ambiguous phrasing, human languages leave a lot unsaid. Where code written in a programming language must include everything the computer needs in order to run, human language often assumes the listener or reader can insert important context and information on their own. In this way, it works sort of like a code snippet. When you speak, you assume your listener is familiar with the "boilerplate code" of language and culture, and include only the minimum information to get your point across.

For example, how do you choose whether to describe yourself as "familiar with," "knowledgeable about," or "an expert in" a topic? Each of those concepts is distinct, but you understand them on a continuum, and you know how they relate to each other. You know, for example, that while there isn’t a sharp line between those descriptors, someone who is familiar with a topic knows much less about it than someone who is an expert. When you distill those ideas down into words, though, you lose a lot of information about how they're related. But rather than spending many more words precisely describing your background, you rely on other humans’ knowledge about the continuum from "familiar with" to "expert in" to interpret how much you know.

Idiom works similarly. When you say you’re "ready to jump right in" at the beginning of a meeting, you don’t mean that you’re going to physically jump into the room, or take a dive off a pool in the office. You rely on your listeners’ knowledge of culture and idiom to interpret that statement in context, and know that you mean you’re ready to get right to the important part of the meeting.

All of this ambiguity and context is difficult to explain with code. Most earlier attempts at natural language processing tried to explicitly define all the words in a language, and hand-code rules for interpreting meaning. This usually didn’t work very well, though. There are just too many corner cases, and too much unspoken context to realistically hand code it all. Using deep learning for natural language processing avoids the problem of describing ambiguous language clearly with code.

Resources