Data Science for Chemists

IPL Summer School, CPE Lyon

4. Data Mining

John Samuel
CPE Lyon

Year: 2024-2025

Creative Commons License

Data Mining


1. Patterns

1. Patterns

Patterns in Nature

1. Patterns

Patterns by Humans

1. Patterns

Pattern creation

1. Patterns


1. Patterns

Pattern Recognition: Approaches

1. Patterns


1. Patterns


1. Patterns

Examples of Features

1. Patterns


1. Patterns


Feature construction is an essential step in the data preprocessing pipeline in machine learning, as it can help make data more informative for learning algorithms.

1. Patterns

Formalization: Supervised learning

1. Patterns

Formalization: Unsupervised learning

1. Patterns

Formalization: Semi-supervised learning

2. Data Mining

Common Tasks in Data Mining

  1. Classification
  2. Clustering
  3. Regression

2.1. Classification


2.1. Classification


2.1. Classification

Formal definition

2.1. Classification


2.1. Classification


2.1. Classification



2.1. Classification

Confusion Matrix

Confusion Matrix for a SVM classifier of handwritten digits (MNIST)

2.1. Classification

Confusion Matrix

Confusion Matrix plot of a Perception of handwritten digits (MNIST)

2.1. Classification

Multiclass classification

Multiclass classification

2.2. Clustering

2.2. Clustering


2.2. Clustering

Formal definition

2.2. Clustering

Cluster models

2.3. Regression

2.3. Regression


2.3. Regression

Formal definition

2.3. Regression

Linear regression

2.3. Regression

Linear regression

3. Algorithms

  1. Support Vector Machines (SVM)
  2. Decision Trees
  3. Ensemble Methods (Random Forest)

3.1. Support Vector Machines (SVM)


3.1. Support Vector Machines (SVM)


3.1. Support Vector Machines (SVM)

Normal vector

3.1. Support Vector Machines (SVM)

3.1. Support Vector Machines (SVM)

Formal definition

3.1. Support Vector Machines (SVM)

Formal definition

3.1. Support Vector Machines (SVM)

Data mining tasks

3.1. Support Vector Machines (SVM)


3.2. Decision Trees

Decision Trees

3.2. Decision Trees

Decision Trees

Decision trees are a powerful decision support tool that uses a tree-like model to represent decisions and their possible consequences.

3.2. Decision Trees


3.3. Ensemble Methods (Random Forest)

Ensemble learning

Ensemble learning, particularly decision tree forests, is a technique that combines multiple learning models to improve predictive performance compared to a single model. Decision tree forests are obtained by constructing multiple decision trees during the training phase.

3.3. Ensemble Methods (Random Forest)

4. Feature Selection

Feature selection is a process aimed at choosing a subset of relevant features from a large number of available features.

4. Feature Selection


4. Feature Selection

Formal defintion[8]


Research articles

  1. From data mining to knowledge discovery in databases, Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, AI Magazine Volume 17 Number 3 (1996)
  2. Survey of Clustering Data Mining Techniques, Pavel Berkhin
  3. Mining association rules between sets of items in large databases, Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami. Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD 1993. p. 207.
  4. Comparisons of Sequence Labeling Algorithms and Extensions, Nguyen, Nam, and Yunsong Guo. Proceedings of the 24th international conference on Machine learning. ACM, 2007.


Research articles

  1. An Analysis of Active Learning Strategies for Sequence Labeling Tasks, Settles, Burr, and Mark Craven. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.
  2. Anomaly detection in crowded scenes, Mahadevan; Vijay et al. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010
  3. A Study of Global Inference Algorithms in Multi-Document Summarization. McDonald, Ryan. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2007.
  4. Feature selection algorithms: A survey and experimental evaluation., Molina, Luis Carlos, Lluís Belanche, and Àngela Nebot. Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002.
  5. Support vector machines, Hearst, Marti A., et al. IEEE Intelligent Systems and their applications 13.4 (1998): 18-28.


Online resources (English Wikipedia)


Online resources (English Wikipedia)


