Data Cleaning: John Samuel

This article is part of a series on Data Science.

Data cleaning is the foundational process of identifying and correcting errors, inconsistencies, and gaps in datasets to ensure high-quality, reliable analysis. Poor data quality can compromise the validity of machine learning models, reporting dashboards, and business decisions.

Types of Data Quality Issues

Common data issues can be classified into three broad categories:

Syntax Errors – Issues related to incorrect data structure, formatting, or type mismatches.
Semantic Errors – Logical inconsistencies or violations of integrity constraints within the data.
Coverage Errors – Missing values or entire missing records that reduce the completeness of the dataset.

Syntax Errors

Syntax errors occur when data does not conform to the expected structural rules—such as data types, formatting, or encoding. These issues often arise during data entry, extraction from legacy systems, or file format conversion. They can typically be detected using schema validation, regular expressions, or parsing tools.

Lexical Errors – These involve the misuse of characters or symbols that violate type expectations.
Example: Entering the string "ten" instead of the numeric value 10 in a numeric field.
Formatting Errors – Errors where the structure or layout deviates from expected standards.
Example: Inconsistent name formatting, such as storing "Last, First" instead of "First Last".
Irregular Data Values – These arise when units or conventions are inconsistently applied.
Example: A column for weight includes both 70 kg and 154 lbs.

Detection and Correction of Syntax Errors

Schema Validation – Tools such as XML Schema (XSD) or JSON Schema help enforce structural constraints on data.
Example: An XML document can be validated against an XSD file to ensure correct tag order, types, and required fields.
Data Transformation – When inconsistencies are detected, transformations can normalize the data. Tools like XSLT (for XML) or Pandas (for tabular data) are commonly used.

Example: XSD for Schema Validation

<xs:schema attributeFormDefault="unqualified"
    elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="root" type="rootType"/>
    <xs:complexType name="elementType">
        <xs:sequence>
            <xs:element type="xs:string" name="languageLabel"/>
            <xs:element type="xs:short" name="year"/>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="rootType">
        <xs:sequence>
            <xs:element type="elementType" name="element" maxOccurs="unbounded" minOccurs="0"/>
        </xs:sequence>
    </xs:complexType>
</xs:schema>

This XSD schema defines the expected structure for elements containing a language label and a year. Validation tools will reject any document that violates this structure—for instance, by providing text instead of a number in the year field.

Semantic Errors

Semantic errors refer to logical inconsistencies or violations of meaning within the data. These errors often pass format validation but result in contradictions, invalid relationships, or violations of business rules. Correcting semantic errors typically requires deeper understanding of the domain and data dependencies.

Integrity Constraint Violations – These occur when the data fails to respect constraints that ensure coherence across fields or tables.
Example: A rule states that a student can only be enrolled in one course at a time, but a student record lists multiple concurrent courses.
Contradictory Values – Inconsistent entries that contradict each other, often due to integration from multiple sources.
Example: A product is marked as "in stock" in one field and "out of stock" in another.
Duplicate Records – Redundant entries that represent the same real-world entity, degrading data efficiency and leading to misleading aggregates.
Example: Two identical customer records differing only by email capitalization.
Invalid Values – Semantically incorrect entries that may be syntactically valid.
Example: Storing a date as free text ("twenty twenty") instead of a date object ("2020-01-01").

Correction of Semantic Errors

Functional Dependencies – Functional dependencies describe relationships where one attribute uniquely determines another. These are fundamental to data cleaning and normalization.
Example: If languageLabel uniquely determines year, then conflicting years under the same label suggest an error.
Data Profiling – Profiling identifies relationships, distributions, and anomalies in data. Tools such as Great Expectations or OpenRefine can detect outliers, dependency violations, and duplicates.

Example: Functional Dependencies

ID	languageLabel	year
1	ENIAC coding system	1943

From the data:

ID → languageLabel

languageLabel → year

Thus: ID → year

If a second row with the same languageLabel maps to a different year, it indicates a semantic inconsistency that violates the dependency.

Example: Duplicate Detection with Pandas (Python)

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice Smith', 'alice smith', 'Bob Jones'],
    'email': ['alice@example.com', 'ALICE@EXAMPLE.COM', 'bob@example.com']
})

# Normalize and deduplicate
df['email'] = df['email'].str.lower()
df = df.drop_duplicates()

This snippet demonstrates deduplication after normalization. In real-world scenarios, fuzzy matching algorithms such as Levenshtein distance or clustering can further refine duplicate detection.

Coverage Errors

Coverage errors reduce the completeness of a dataset. These issues arise when values or entire records are absent, preventing full representation of the target population or phenomena. Incomplete data can bias statistical models, compromise machine learning training, and hinder reproducibility.

Missing Values – Occur when a cell is empty in a column where a value is required.
Example: A product catalog where the "Price" field is missing for certain items.
Missing Records – Refer to the total absence of an entry that should exist in the database.
Example: A customer record is missing from the customer table despite related transactions in the sales table.

Techniques for Handling Coverage Errors

Interpolation – Estimating missing values using known data points. Common in time series and sensor data.
Example: Applying linear interpolation to fill in missing temperature readings from a weather dataset.
Cross-Referencing External Data – Augmenting or validating datasets by integrating trusted external sources.
Example: Completing missing address information using postal service databases or public APIs.
Imputation – Substituting missing values using statistical estimates (mean, median, mode) or model-based approaches.
Example: Using SimpleImputer from scikit-learn to fill missing values in numeric columns.

Example: Filling Missing Values with Pandas

import pandas as pd
from sklearn.impute import SimpleImputer

df = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': [100, None, 150]
})

imputer = SimpleImputer(strategy='mean')
df[['price']] = imputer.fit_transform(df[['price']])

This code fills the missing price with the mean of the known values. More advanced imputations may use KNN, regression, or domain-specific heuristics.

Administrative Handling of Data Errors

Beyond algorithmic detection, effective data quality management also involves administrative strategies. These rely on both automated systems and human intervention to identify, report, and correct anomalies. Incorporating users, domain experts, and automated controls into the cleaning workflow can significantly enhance data integrity.

User Feedback and Community-Based Correction – Engaging users to report errors and contribute corrections increases data quality over time.
Example: Platforms like OpenStreetMap and Wikipedia allow users to flag and fix erroneous entries collaboratively, often using revision history and moderation layers.
Automated Alerts and Triggers – Setting up rules to detect inconsistencies at the time of data entry or modification.
Example: A database trigger can raise an alert if a new order is inserted without a corresponding customer ID. Alerts can be logged, emailed to administrators, or used to block invalid transactions.

Example: SQL Trigger for Data Validation

CREATE TRIGGER validate_customer_id
BEFORE INSERT ON orders
FOR EACH ROW
WHEN (NEW.customer_id IS NULL)
BEGIN
    RAISE EXCEPTION 'Customer ID cannot be null';
END;

This SQL trigger ensures that no order is inserted without an associated customer. Administrative enforcement of rules like this reduces semantic and coverage errors proactively.

Conclusion

Data cleaning is a critical step in any data science or analytics workflow. It involves identifying and correcting errors in syntax, semantics, and coverage to ensure the quality, reliability, and usability of datasets. These errors may appear trivial at first glance but can propagate through pipelines and distort analyses or predictions in profound ways.

Modern data cleaning integrates both automated and manual processes, leveraging schema validation, statistical imputation, domain knowledge, and collaborative user input. Functional dependencies, data profiling techniques, and tooling ecosystems such as Pandas, OpenRefine, etc. provide powerful solutions to address the wide spectrum of errors encountered in real-world datasets.

Effective data quality management should not be treated as a one-time process but as an ongoing practice—embedded into the lifecycle of data collection, transformation, modeling, and governance.

Data Cleaning

John Samuel