Skip to main content
Build the future with Agentforce at TDX in San Francisco or on Salesforce+ on March 5–6. Register now.

Understand Data and Its Significance

Learning Objectives

After completing this unit, you’ll be able to:

  • Explain types of data and formats, such as tabular, text, images, audios, and videos.
  • Use techniques to identify types of data sources and data collection methods.
  • Understand the impact of bad data on decision-making.

Data Classification and Types

With data being an essential component of industries today, it’s important to understand the different types of data, data sources and collection methods, and the importance of data in AI.

Data Classification

Data can be classified into three main categories: structured, unstructured, and semi-structured. 

  • Structured data is organized and formatted in a specific way, such as in tables or spreadsheets. It has a well-defined format and is easily searchable and analyzable. Examples of structured data include spreadsheets, databases, data lakes, and warehouses.
  • Unstructured data, on the other hand, is not formatted in a specific way and can include text documents, images, audios, and videos. Unstructured data is more difficult to analyze, but it can provide valuable insights into customer behavior and market trends. Examples of unstructured data include social media posts, customer reviews, and email messages.
  • Semi-structured data is a combination of structured and unstructured data. It has some defined structure, but it may also contain unstructured elements. Examples of semi-structured data include XML (Extensible Markup Language) or JSON (JavaScript Object Notation) files.

Data Format

Data can also be classified by its format. 

  • Tabular data is structured data that is organized in rows and columns, such as in a spreadsheet.
  • Text data includes unstructured data in the form of text documents, such as emails or reports.
  • Image data can include visual information in the form of a brand logo, charts, and infographics.
  • Geospatial data refers to geographic coordinates and the shape of country maps, representing essential information about the Earth’s surface.
  • Time-series data refers to data that can contain information over a period of time, for example, daily stock prices over the past year.

Types of Data

Another way to classify data is by its type, which can be quantitative or qualitative. 

  • Quantitative data is numerical and can be measured and analyzed statistically. Examples of quantitative data include sales figures, customer counts based on geographical location, and website traffic.
  • Qualitative data, on the other hand, is non-numerical and includes text, images, and videos. In many cases, qualitative data can be more difficult to analyze, but it can provide valuable insights into customer preferences and opinions. Examples of qualitative data include customer reviews, social media posts, and survey responses.

Both quantitative and qualitative data are important in the field of data analytics across a wide range of industries. For more detail on this topic, check out the Variables and Field Types Trailhead module.

Understanding different data types and classifications is important for effective data analysis. By categorizing data into structured, unstructured, and semi-structured categories, and differentiating between quantitative and qualitative data, organizations can more effectively choose the right analysis approach for gaining insights from it. Exploring different formats, such as tabular, text, and images, makes data analysis and interpretation more effective.

Data Collection Methods

Identifying data sources is an important step in data analysis. Data can be obtained from various sources, including internal, external, and public datasets. Internal data sources include data that is generated within an organization, such as sales data and customer data. External data sources include data that is obtained from outside the organization, such as market research and social media data. Public datasets are freely available datasets that can be used for analysis and research.

Data collection, labeling, and cleaning are important steps in data analysis. 

  • Data collection is the process of gathering data from various sources.
  • Data labeling is assigning tags or labels to data to make it more easily searchable and analyzable. This can include assigning categories to data, such as age groups or product categories.
  • Data cleaning is the process of removing or correcting errors and inconsistencies in the data to improve its quality and accuracy. Data cleaning can include removing duplicate data, correcting spelling errors, and filling in missing data.

Various techniques can be used for collecting data, such as surveys, interviews, observation, and web scraping. 

  • Surveys collect data from a group of people using a set of questions. They can be conducted online or in-person, and are often used to collect data on customer preferences and opinions.
  • Interviews collect data from individuals through one-on-one conversations. They can provide more detailed data than surveys, but they can also be time-consuming.
  • Observation collects data by watching and listening to people or events. This can provide valuable data on customer behavior and product interactions.
  • Web scraping collects data from websites using software tools. It can be used to collect data on competitors, market trends, and customer reviews.

Exploratory data analysis (EDA) is usually the first step in any data project. The goal of EDA is to learn about general patterns in data and understand the insights and key characteristics about it.

The Importance of Data in AI

Data is an essential component of AI, and the quality and validity of data are critical to the success of AI applications. Considerations for data quality and validity include ensuring that the data is accurate, complete, and representative of the population being studied. Bad data can have a significant impact on decision-making and AI, leading to inaccurate or biased results.

Data quality is important from the beginning of an AI project. Here are a few areas of consideration that highlight the importance of data and data quality in AI.

  • Training and performance: The quality of the data used for training AI models directly impacts their performance. High-quality data ensures that the model learns accurate and representative patterns, leading to more reliable predictions and better decision-making.
  • Accuracy and bias: Data quality is vital in mitigating bias within AI systems. Biased or inaccurate data can lead to biased outcomes, reinforcing existing inequalities or perpetuating unfair practices. By ensuring data quality, organizations can strive for fairness and minimize discriminatory outcomes.
  • Generalization and robustness: AI models should be able to handle new and unfamiliar data effectively, and consistently perform well in different situations. High-quality data ensures that the model learns relevant and diverse patterns, enabling it to make accurate predictions and handle new situations effectively.
  • Trust and transparency: Data quality is closely tied to the trustworthiness and transparency of AI systems. Stakeholders must have confidence in the data used and the processes involved. Transparent data practices, along with data quality assurance, help build trust and foster accountability.
  • Data governance and compliance: Proper data quality measures are essential for maintaining data governance and compliance with regulatory requirements. Organizations must ensure that the data used in AI systems adheres to privacy, security, and legal standards.

To achieve high data quality in AI, a robust data lifecycle is needed with focus on data diversity, representativeness, and addressing potential biases. There are various stages in the data lifecycle, and data quality is important in all of the stages. The data lifecycle includes collection, storage, processing, analysis, sharing, retention and disposal. You get more detail on the data lifecycle in the next unit. 

In this unit, you learned about different types of data, data sources and collection methods, and the importance of data in AI. Next, get the basics on machine learning and how it’s different from traditional programming. And learn about AI techniques and their applications in the real world.

Resources 

Share your Trailhead feedback over on Salesforce Help.

We'd love to hear about your experience with Trailhead - you can now access the new feedback form anytime from the Salesforce Help site.

Learn More Continue to Share Feedback