Data Science for Chemists
IPL Summer School, CPE Lyon
3. Data Analysis and visualization
John Samuel
CPE Lyon
Year
: 2024-2025
Email
: john.samuel@cpe.fr
3.1. Data Acquistion and Storage
Data acquisition
Surveys
Manual surveys
Online surveys
Sensors
1
Temperature, pressure, humidity, rainfall
Acoustic, navigation
Proximity, presence sensors
Social networks
Video surveillance cameras
Web
https://en.wikipedia.org/wiki/List_of_sensors
3.2. Data Acquistion and Storage
Data storage formats
Binary and Textual Files
CSV/TSV
XML
JSON
Media (Images/Audio/Video)
3.2. Data Acquistion and Storage
Types of data stores
Structured data stores
Relational databases
Object-oriented databases
Unstructured data stores
Filesystems
Content-management systems
Document collections
Semi-structured data stores
Filesystems
NoSQL data stores
Unstructured vs. Structured vs. Semi-structured
3.2. Data Acquistion and Storage
ACID Transactions
1
Atomicity
: Each transaction must be "all or nothing".
Consistency
: Any transaction must bring database from one valid state to another.
Isolation
: Both concurrent execution and sequential execution of transactions must bring the database to same state.
Durability
: Irrespective of power losses, crashes, a transaction once committed to the database must remain in that state.
https://en.wikipedia.org/wiki/ACID
3.2. Data Acquistion and Storage
ACID Transactions
Ensure validity of databases even in case of errors, power failures
Important in banking sector
3.2. Data Acquistion and Storage
Types of data stores
Relational databases
Object-oriented databases
NoSQL (Not only SQL) data stores
NewSQL
3.2. Data Acquistion and Storage
NoSQL
Comprises consistency
Focus on availability and speed
3.2. Data Acquistion and Storage
Types of NoSQL stores
Column-oriented database
Document-oriented database
Key-value database
Graph-oriented database
3.3. Data Extraction and Integration
Data extraction techniques
Data dumps
Downloading complete data dumps
Downloading selective data dumps
Periodical polling of data feeds (e.g., blogs, news feeds)
Data streams
Subscrbing to data streams (push notifications)
3.3. Data Extraction and Integration
Query interfaces
Query endpoints supporting declarative languages
SQL
SPARQL
Automated Manual search (and filter) options
3.3. Data Extraction and Integration
3.3. Crawlers for web pages
Web crawlers: navigating the entire using hyperlinks
3.3. Data Extraction and Integration
Application Programming Interface (API)
Web operations (CRUD) to manipulate externally managed resources
Requires programmers to develop wrappers for web service integration
API (Interface de programmation)
3.4. Pre-treatement of Data
Data Cleaning: Types of Errors
Syntactical errors
Semantical errors
Data coverage errors
3.4. Pre-treatement of Data
Syntactical errors
Lexical errors (e.g., user entered a string instead of a number)
Data format errors (e.g, order of last name, first name)
Irregular data errors (e.g., usage of different metrics)
3.4. Pre-treatement of Data
Semantic errors
Violation of integrity constraints
Contradiction
Duplication
Invalid data (unable to detect despite presence of triggers and integrity constraints)
3.4. Pre-treatement of Data
Coverage errors
Missing values
Missing data
3.4. Pre-treatement of Data
Handling Syntactical errors
Validation using schema (e.g., XSD, JSONP)
Data transformation
3.4. Pre-treatement of Data
Handling Semantic errors
Duplicate elimination using techniques like specifying integrity constraints like functional dependencies
3.4. Pre-treatement of Data
Handling Coverage errors
Interpolation techniques
External data sources
3.4. Pre-treatement of Data
Administrators and handling errors
User feedback
Alerts and triggers
3.5. Data Transformation
Languages
Template languages
XSLT
AWK
Sed
Programming languages like PERL
3.6. ETL
ETL (Extraction Transformation and Loading)
Data Extraction
Data Cleaning
Data Transformation
Loading data to information stores
3.6. ETL
Models for data analysis
Multidimensional data analysis
Dimensions
Attributes
Levels
Hierarchies
Facts
Measures
3.6. ETL
Models for data analysis
Multidimensional data analysis: Examples
Dimensions (e.g.Spatio-temporal dimensions, Product)
Attributes (e.g. Name, Manufactures etc.)
Levels (e.g., Day, Month, Quarter, Store, City, Country etc.)
Hierarchies (e.g., Day-Month-Quarter-Year, Store-City-Country etc.)
Facts
Measures (e.g., Number of products sold/unsold)
3.6. ETL
Star Schema
3.6. ETL
Data Cubes
Data cubes for online analytical processing (OLAP)
OLAP Cube operations
Slice
Dice
Drill up/down
Pivot
3.6. ETL
Snow Schema
3.6. ETL
ETL: From one data store to another
From: Data sources
Internal or external databases
Web Services
To: Data warehouses
Enterprise warehouses
Web warehouses
3.7. Data Analysis
Activities of data analysis
Retrieving values
Filter
Compute derived values
Find extremum
Sort
Determine range
Characterize distribution
Find analysis
Cluster
Correlate
Contextualization
https://en.wikipedia.org/wiki/Data_analysis
3.8. Data Visualization
Data Visualization
Time-series
Ranking
Part-to-whole
Deviation
Sort
Frequency distribution
Correlation
Nominal comparison
Geographic or geospatial
https://en.wikipedia.org/wiki/Data_visualization
3.8. Data Visualization
Data Visualization: Examples
Bar-chart (Nominal comparison)
Pie-chart (part-to-whole)
Histograms (frequency-distribution)
Scatter-plot (correlation)
Network
Line-chart (time-series)
Treemap
Gantt chart
Heatmap
3.8. Data Visualization
Pie Chart
3.8. Data Visualization
Programming Language Paradigms (Bubble Chart)
3.8. Data Visualization
Timeline of Programming Languages (using Histropedia)
3.8. Data Visualization
Influence Graph of Programming Languages
3.8. Data Visualization
k Predominant colours
3.8. Data Visualization
RGB Scatter plots (Comparison)
References
Sites web
https://jupyter.org/
https://www.wikidata.org/
Couleurs
Color Tool - Material Design
Images
Wikimedia Commons