Data Science for Chemists

IPL Summer School, CPE Lyon

1. Introduction to Data Science

John Samuel
CPE Lyon

Year: 2024-2025
Email: john.samuel@cpe.fr

Creative Commons License

Data Science for Chemists

Goals: Introduction to Data Science

1.1. History of Data Science and computing

Sundial

  • Time Measurement: Used to indicate the time based on the position of the sun and to track seasonal changes and equinoxes.
  • Basic Principle: Shadow cast by a gnomon on a graduated surface.

Sundials are a testament to the scientific ingenuity of antiquity. They influenced the subsequent development of astronomical instruments.

Ancient Egyptian sundial (1500 B.C.)

1.1. History of Data Science and computing

Numeration system

1.1. History of Data Science and computing

Numeration System

Examples of Ancient Systems

1.1. History of Data Science and computing

Numeration System

Applications

Transition to Modern Systems

1.1. History of Data Science and computing

Typewriter
Electronic Typewriter

1.1. History of Data Science and computing

Typewriter

Electronic Typewriter

Partial Automation: Reduction of manual tasks in data entry.

1.1. History of Data Science and computing

Blaise Pascal's six-digit calculating machine
Charles Babbage's Difference Engine

1.1. History of Data Science and computing

Blaise Pascal's Calculating Machine

Charles Babbage's Difference Engine

Automation of Calculations: Reduction in the time required to perform complex calculations.

Scientific Advances: Facilitation of scientific research through more efficient calculation tools.

1.1. History of Data Science and computing

The ENIAC (photo taken between 1947 and 1955).
IBM PC 5150 in 1983

1.1. History of Data Science and computing

ENIAC (1947-1955)

IBM PC 5150 (1983)

Democratization of Computing: Transition to widespread accessibility and use of computers.

Predecessors of Current Technologies: Foundation of modern computer systems.

1.1. History of Data Science and computing

  • Development: Introduced in the 1970s.
  • Impact Technology: Use of impact printing heads to form characters.
  • Versatility: Adapted for printing documents and reports.

Data Output: Facilitation of visualization of processed information.

Commercial Use: Widely adopted in professional environments.

Dot matrix printer (Panasonic)

1.1. History of Data Science and computing

8-inch, 5.25-inch, and 3.5-inch floppy disks
The inside of a hard disk drive

1.1. History of Data Science and computing

Floppy Disks (8-inch, 5.25-inch, and 3.5-inch)

Hard Disk Drive

Portable and Massive Storage: Floppy disks for portable data and massive permanent storage.

1.1. History of Data Science and computing

Servers: Impact on data storage methods and contribution to centralized data management.

  • Origins: Emergence of servers in the early days of computing.
  • Centralization of Resources: Use of servers to centralize storage and data management.
  • Network Connectivity: Integration of servers into network environments.
Storage: Servers

1.1. History of Data Science and computing

Evolution of Server Technologies

  • Capacity Improvements: Increase in server storage capacity over time.
  • Virtualization: Introduction of virtualization technologies for more efficient resource utilization.
  • Cloud Storage: Transition to cloud-based storage solutions.

Impact

  • Centralization and Sharing: Facilitation of data centralization and sharing.
  • Security and Redundancy: Use of servers to ensure data security and redundancy.
  • Precursors to Modern Data Infrastructures: Foundation of current storage systems.

1.1. History of Data Science and computing

The global growth in data storage capacity and information

1.2. Computer Architecture and Systems

Growth of Storage Capacities

Emerging Storage Technologies

1.2. Computer Architecture and Systems

Systems

Distributed Systems (a,b)

  • Origins: Development of distributed systems concepts in the 1960s.
  • Characteristics: Distribution of tasks across network-connected machines.
  • Modern Advancements: Use in contemporary cloud applications and distributed networks.
Distributed Systems
https://commons.wikimedia.org/wiki/File:Distributed-parallel.svg

1.2. Computer Architecture and Systems

Systems

Parallel Systems (c)

  • Development: Emergence of parallel systems to execute simultaneous tasks.
  • Parallel Processing: Use of multiple processors to accelerate processing.
  • Current Applications: Integration into supercomputers and intensive computing environments.
Parallel Systems
https://commons.wikimedia.org/wiki/File:Distributed-parallel.svg

1.2. Computer Architecture and Systems

Distributed Computing

The following projects have utilized the processing power of personal computers for various purposes:

1.2. Computer Architecture and Systems

Google Search Trends (November 2020): Big Data

1.2. Computer Architecture and Systems

Google Search Trends (November 2020): Big Data and Artificial Intelligence

1.3. Major phases of data analysis

Lifecycle of Data

  1. Data
  2. Knowledge
  3. Insights
  4. Actions
Data Lifecycle

1.3. Major phases of data analysis

From Data to Knowledge

  1. Data acquisition
  2. Data Extraction
  3. Data Cleaning
  4. Data Transformation
  5. Data analysis modeling
  6. Data Storage
  7. Analysis
  8. Visualisation
Major steps of data analysis

1.3. Major phases of data analysis

Data Acquistion

1.3. Major phases of data analysis

ETL (Extraction Transformation and Loading)

  1. Data Extraction
  2. Data Cleaning
  3. Data Transformation
  4. Loading data to information stores
ETL (Extraction, Transformation and Loading)

1.3. Major phases of data analysis

Data Analysis

1.1.3. Data analysis

1.3. Major phases of data analysis

Data Visualization

1.3. Major phases of data analysis

Data Visualization

1.4. Algorithms for data acquisition and process control

Sampling Techniques

Signal Processing Techniques

1.4. Algorithms for data acquisition and process control

Chemometric Methods

Data Fusion and Integration

1.4. Algorithms for data acquisition and process control

Chemical Reaction Control

1.4. Algorithms for data acquisition and process control

Quality Control and Monitoring

1.4. Algorithms for data acquisition and process control

Modeling and Simulation

Machine Learning

1.4. Algorithms for data acquisition and process control

Importance

1.5. Applications

Sustainable Cities

  1. Urban Planning and Design: Spatial analysis, predictive modeling, simulation, urban digital twins.
  2. Energy Efficiency and Management: Smart grids, building energy management, renewable energy integration.
  3. Transportation and Mobility: Traffic management, public transport optimization, urban mobility applications.
  4. Waste Management and Recycling: Smart waste collection, waste tracking.
  5. Water Management: Smart water systems, water conservation.
  6. Environmental Monitoring and Air Quality: Sensor networks, early warning systems.
  7. Community Engagement and Decision Support: Data-driven systems, citizen feedback systems.

1.5. Applications

Energy Transition

  1. Renewable Energy Integration and Optimization
  2. Smart Grids and Energy Management
  3. Energy Efficiency and Conservation
  4. Policy Making and Decision Support
  5. Consumer Engagement and Behavior Change
  6. Financial and Investment Decisions

References

Colors

Images