This article is part of a series on Data Science.

Six W's and One H in Data Science

In data science, effective analysis begins with asking the right questions. A common framework used for this purpose is known as the "Six W's and One H": Who, What, When, Where, Why, Which, and How. This approach helps data scientists frame the problem space, identify relevant variables, and understand the broader context of the data.

Who

This refers to the individuals or entities involved in or associated with the data. It answers questions like: Who generated the data? Who is the subject of the data? Who is the stakeholder or end user?
Examples: Who wrote the article? Who captured the image? Who was the cashier in a transaction?

What

This defines the subject or object of analysis. It can refer to an event, product, document, media item, transaction, or any measurable entity.
Examples: What is the article about? What kind of image is being analyzed? What transaction took place?

When

The temporal context of the data—when the event occurred, or the time period the data represents. Temporal precision may vary based on context and instrumentation.
Examples: When was the article published? When was the video recorded? When did a sensor take a measurement? This may range from rough time windows (e.g., month/year) to exact timestamps (e.g., milliseconds or nanoseconds).

Where

The spatial or geographical location relevant to the data. This can be physical (latitude/longitude), administrative (city, region), or even virtual (URL, IP address).
Examples: Where was the photo taken? Where was the sensor located? Where was the transaction processed?

Why

The reason or motivation behind the existence or analysis of the data. Understanding the "why" helps frame the goals of the analysis.
Examples: Why was the event documented? Why is the analysis necessary? Was it to monitor market behavior, analyze customer feedback, or understand social trends?

Which

This addresses the specific aspects or features of the data that are of interest. Often, not all available data is relevant for analysis.
Examples: Which features are selected—height, population, brightness, or sentiment score? Which subset of the data is relevant for a given model or query?

How

This concerns the methodology or tools used for collecting, processing, and analyzing the data. It includes statistical methods, machine learning algorithms, data mining techniques, and the instrumentation used.
Examples: How was the data collected—using what device or API? How was the sensor calibrated? How is the data being processed—manual cleaning, automated parsing, or deep learning models?

Precision and Granularity

It is important to recognize that not all W’s and H are captured with equal precision. The granularity of data can vary significantly depending on the source or purpose.
Examples:

The required precision often depends on the use case—high-resolution timestamps might be essential for sensor fusion in autonomous vehicles, whereas city-level data might suffice for economic trend analysis.

Application in the Data Science Pipeline

Before initiating any data analysis, it's essential to determine which of these seven dimensions are relevant for your problem and what level of detail is required for each. Identifying the key W’s and H helps in:

This structured approach ensures not only more accurate analysis but also actionable insights with clear context.

References

  1. Wikipedia: Five Ws
  2. Harvard Business Review: The Art of Asking Smarter Questions
  3. How To Ask The Right Questions As A Data Scientist