Create the Dataset
Learning Objectives
- Describe why data is important in a deep learning model.
- Create a dataset that contains intent labels and examples.
- Use the API to query the status of the dataset.
Data and Deep Learning
Good data—and lots of it—is a key component in successful deep learning. Data is what the algorithms use to “learn.” So you want to make sure your data addresses the type of predictions you expect the model to make. In addition to the right data, ensure that the data is correctly labeled. The amount and quality of data have a direct relationship with the accuracy of the final model.
- .csv (comma-separated values)
- .tsv (tab-separated values)
- .json
In this unit, you create a dataset from a .csv file. To make things easier, we provide a .csv file you use to create the dataset. The data file contains the kind of questions that Cloud Kicks gets from the service request form on their site.
"need to reset my password",Password Help "when will my order arrive?",Shipping Info "can I buy another pair, these are great?",Sales Opportunity
The intent string—for example, “need to reset my password”—is the text you think a user might type in their service request.
The label—in this case, “Password Help”—is the type of request. In the dataset you create in the next section, an intent string example is associated with a label. In the Cloud Kicks solution, the website uses the label to route the request to the right department.
Call the API to Create the Dataset
- Open a command line window.
-
Enter this cURL command, replacing <TOKEN> with the token you generated. You can use a text
editor throughout these steps to copy the commands and store the various IDs you
generate.
curl -X POST -H "Authorization: Bearer <TOKEN>" -H "Cache-Control: no-cache" -H "Content-Type: multipart/form-data" -F "type=text-intent" -F "path=http://einstein.ai/text/case_routing_intent.csv" https://api.einstein.ai/v2/language/datasets/upload
You just made your first call to the Einstein Intent API and created a dataset! The response from the API looks something like this JSON.{ "id": 1010060, "name": "case_routing_intent.csv", "createdAt": "2017-08-18T18:20:37.000+0000", "updatedAt": "2017-08-18T18:20:37.000+0000", "labelSummary": { "labels": [] }, "totalExamples": 0, "available": false, "statusMsg": "UPLOADING", "type": "text-intent", "object": "dataset" }
Make a note of the id field value. This value is the dataset ID, and you use this ID throughout the module to work with the dataset.
The case routing dataset is quite small—about 150 records. Typically, you would gather much more data so that your model is as accurate as possible.
Get the Dataset Status
When you create a dataset, the two fields that tell you the status of the dataset are: available and statusMsg.
The available field tells you whether you can use the dataset or not. Calls that reference the dataset succeed only when available is true. The statusMsg field gives you more information about what’s happening with the dataset. A value of UPLOADING means that the API is uploading the data from the .csv file. If you see a statusMsg of QUEUED, it means that the API hasn’t started uploading the data yet, but the request is in line.
Now you make a call to get the status of the dataset.
-
In the following cURL command, replace <TOKEN> with your token and <DATASET_ID> with the dataset ID. Then run the command in the
command line window.
curl -X GET -H "Authorization: Bearer <TOKEN>" -H "Cache-Control: no-cache" https://api.einstein.ai/v2/language/datasets/<DATASET_ID>
The API response looks similar to this JSON. You can see the labels that were added from the .csv file.{ "id": 1010060, "name": "case_routing_intent.csv", "createdAt": "2017-08-18T18:20:37.000+0000", "updatedAt": "2017-08-18T18:20:39.000+0000", "labelSummary": { "labels": [ { "id": 94034, "datasetId": 1010060, "name": "Order Change", "numExamples": 26 }, { "id": 94035, "datasetId": 1010060, "name": "Sales Opportunity", "numExamples": 44 }, { "id": 94036, "datasetId": 1010060, "name": "Billing", "numExamples": 24 }, { "id": 94037, "datasetId": 1010060, "name": "Shipping Info", "numExamples": 30 }, { "id": 94038, "datasetId": 1010060, "name": "Password Help", "numExamples": 26 } ] }, "totalExamples": 150, "totalLabels": 5, "available": true, "statusMsg": "SUCCEEDED", "type": "text-intent", "object": "dataset" }
The dataset name is derived from the file name you use to create the dataset. However, you can pass in a name parameter and give it an explicit name.
When available is true and statusMsg is SUCCEEDED, then the dataset is ready for training. When statusMsg is SUCCEEDED, you can proceed to the next unit.
Resources
- Einstein Platform Developer Guide: Create a Dataset From a File Asynchronously
- Einstein Platform Developer Guide: Get a Dataset