trailhead

Create the Dataset

Learning Objectives

After completing this unit, you’ll be able to:
  • Describe why data is important in a deep learning model.
  • Create a dataset that contains intent labels and examples.
  • Use the API to query the status of the dataset.

Data and Deep Learning

Good data—and lots of it—is a key component in successful deep learning. Data is what the algorithms use to “learn.” So you want to make sure your data addresses the type of predictions you expect the model to make. In addition to the right data, ensure that the data is correctly labeled. The amount and quality of data have a direct relationship with the accuracy of the final model.

The Einstein Language APIs support data in these file formats:
  • .csv (comma-separated values)
  • .tsv (tab-separated values)
  • .json

In this unit, you create a dataset from a .csv file. To make things easier, we provide a .csv file you use to create the dataset. The data file contains the kind of questions that Cloud Kicks gets from the service request form on their site.

The data in the .csv file takes this format: "intent string",label. Here are some examples of data from the file.
"need to reset my password",Password Help
"when will my order arrive?",Shipping Info
"can I buy another pair, these are great?",Sales Opportunity

The intent string—for example, “need to reset my password”—is the text you think a user might type in their service request.

The label—in this case, “Password Help”—is the type of request. In the dataset you create in the next section, an intent string example is associated with a label. In the Cloud Kicks solution, the website uses the label to route the request to the right department.

Call the API to Create the Dataset

Now that all the prep work is complete, the real work begins. The first step is to create a dataset. You create the dataset from the .csv file called case_routing_intent.csv, referenced in the following cURL command. Each intent string in the .csv file is a single example associated with a label. The dataset is the basis of the model that you create later.
  1. Open a command line window.
  2. Enter this cURL command, replacing <TOKEN> with the token you generated. You can use a text editor throughout these steps to copy the commands and store the various IDs you generate.
    curl -X POST -H "Authorization: Bearer <TOKEN>" -H "Cache-Control: no-cache" -H "Content-Type: multipart/form-data" -F "type=text-intent" -F "path=http://einstein.ai/text/case_routing_intent.csv" https://api.einstein.ai/v2/language/datasets/upload
    Note

    Note

    All the cURL commands in this module should work on Mac OSX, Windows, and Linux. Some operating systems handle double quotes in different ways. So if you see any cURL errors, you may need to reformat the quotes.

    You just made your first call to the Einstein Intent API and created a dataset! The response from the API looks something like this JSON.
    {
      "id": 1010060,
      "name": "case_routing_intent.csv",
      "createdAt": "2017-08-18T18:20:37.000+0000",
      "updatedAt": "2017-08-18T18:20:37.000+0000",
      "labelSummary": {
        "labels": []
      },
      "totalExamples": 0,
      "available": false,
      "statusMsg": "UPLOADING",
      "type": "text-intent",
      "object": "dataset"
    }

    Make a note of the id field value. This value is the dataset ID, and you use this ID throughout the module to work with the dataset.

    The case routing dataset is quite small—about 150 records. Typically, you would gather much more data so that your model is as accurate as possible.

Get the Dataset Status

The call you made to create the dataset is asynchronous. This means that the API returns a response with the dataset ID right away. But in the background, it could still be loading data. This is especially true if you have a large dataset.

When you create a dataset, the two fields that tell you the status of the dataset are: available and statusMsg.

The available field tells you whether you can use the dataset or not. Calls that reference the dataset succeed only when available is true. The statusMsg field gives you more information about what’s happening with the dataset. A value of UPLOADING means that the API is uploading the data from the .csv file. If you see a statusMsg of QUEUED, it means that the API hasn’t started uploading the data yet, but the request is in line.

Now you make a call to get the status of the dataset.

  1. In the following cURL command, replace <TOKEN> with your token and <DATASET_ID> with the dataset ID. Then run the command in the command line window.
    curl -X GET -H "Authorization: Bearer <TOKEN>" -H "Cache-Control: no-cache" https://api.einstein.ai/v2/language/datasets/<DATASET_ID>
    The API response looks similar to this JSON. You can see the labels that were added from the .csv file.
    {
      "id": 1010060,
      "name": "case_routing_intent.csv",
      "createdAt": "2017-08-18T18:20:37.000+0000",
      "updatedAt": "2017-08-18T18:20:39.000+0000",
      "labelSummary": {
        "labels": [
          {
            "id": 94034,
            "datasetId": 1010060,
            "name": "Order Change",
            "numExamples": 26
          },
          {
            "id": 94035,
            "datasetId": 1010060,
            "name": "Sales Opportunity",
            "numExamples": 44
          },
          {
            "id": 94036,
            "datasetId": 1010060,
            "name": "Billing",
            "numExamples": 24
          },
          {
            "id": 94037,
            "datasetId": 1010060,
            "name": "Shipping Info",
            "numExamples": 30
          },
          {
            "id": 94038,
            "datasetId": 1010060,
            "name": "Password Help",
            "numExamples": 26
          }
        ]
      },
      "totalExamples": 150,
      "totalLabels": 5,
      "available": true,
      "statusMsg": "SUCCEEDED",
      "type": "text-intent",
      "object": "dataset"
    }

    The dataset name is derived from the file name you use to create the dataset. However, you can pass in a name parameter and give it an explicit name.

    When available is true and statusMsg is SUCCEEDED, then the dataset is ready for training. When statusMsg is SUCCEEDED, you can proceed to the next unit.

Resources

retargeting