Six W's and One H in Data Science
In data science, effective analysis begins with asking the right questions. A common framework used for this purpose is known as the "Six W's and One H": Who, What, When, Where, Why, Which, and How. This approach helps data scientists frame the problem space, identify relevant variables, and understand the broader context of the data.
Who
This refers to the individuals or entities involved in or associated with the data. It answers
questions like: Who generated the data? Who is the subject of the data? Who is the stakeholder or
end user?
Examples: Who wrote the article? Who captured the image? Who was the cashier in a
transaction?
What
This defines the subject or object of analysis. It can refer to an event, product, document, media
item, transaction, or any measurable entity.
Examples: What is the article about? What kind of image is being analyzed? What
transaction took place?
When
The temporal context of the data—when the event occurred, or the time period the data represents.
Temporal precision may vary based on context and instrumentation.
Examples: When was the article published? When was the video recorded? When did a
sensor take a measurement? This may range from rough time windows (e.g., month/year) to exact
timestamps (e.g., milliseconds or nanoseconds).
Where
The spatial or geographical location relevant to the data. This can be physical
(latitude/longitude), administrative (city, region), or even virtual (URL, IP address).
Examples: Where was the photo taken? Where was the sensor located? Where was the
transaction processed?
Why
The reason or motivation behind the existence or analysis of the data. Understanding the "why" helps
frame the goals of the analysis.
Examples: Why was the event documented? Why is the analysis necessary? Was it to
monitor market behavior, analyze customer feedback, or understand social trends?
Which
This addresses the specific aspects or features of the data that are of interest. Often, not all
available data is relevant for analysis.
Examples: Which features are selected—height, population, brightness, or sentiment
score? Which subset of the data is relevant for a given model or query?
How
This concerns the methodology or tools used for collecting, processing, and analyzing the data. It
includes statistical methods, machine learning algorithms, data mining techniques, and the
instrumentation used.
Examples: How was the data collected—using what device or API? How was the sensor
calibrated? How is the data being processed—manual cleaning, automated parsing, or deep learning
models?
Precision and Granularity
It is important to recognize that not all W’s and H are captured with equal precision. The
granularity of data can vary significantly depending on the source or purpose.
Examples:
- Time may be recorded as a general date or down to nanoseconds, depending on the instrument.
- Location may be identified as a country, city, street, building, or exact GPS coordinates.
Application in the Data Science Pipeline
Before initiating any data analysis, it's essential to determine which of these seven dimensions are relevant for your problem and what level of detail is required for each. Identifying the key W’s and H helps in:
- Data collection: Choosing the right sources and sensors
- Feature selection: Prioritizing relevant attributes
- Modeling: Tailoring algorithms to available and meaningful data
- Interpretation: Understanding the context and significance of results