Data Science for Chemists

IPL Summer School, CPE Lyon

4. Data Mining

John Samuel
CPE Lyon

Year: 2023-2024
Email: john.samuel@cpe.fr

Creative Commons License

Data Mining

Goals

1. Patterns

1. Patterns

Patterns in Nature

1. Patterns

Patterns by Humans

1. Patterns

Pattern creation

1. Patterns

Synonyms

1. Patterns

Pattern Recognition: Approaches

1. Patterns

Formalization

1. Patterns

Formalization

1. Patterns

Examples of Features

1. Patterns

Formalization

1. Patterns

Example

Feature construction is an essential step in the data preprocessing pipeline in machine learning, as it can help make data more informative for learning algorithms.

1. Patterns

Formalization: Supervised learning

1. Patterns

Formalization: Unsupervised learning

1. Patterns

Formalization: Semi-supervised learning

2. Data Mining

Common Tasks in Data Mining

  1. Classification
  2. Clustering
  3. Regression

2.1. Classification

Classifiers

2.1. Classification

Applications

2.1. Classification

Formal definition

2.1. Classification

Classifiers

2.1. Classification

Classifiers

2.1. Classification

Let

Then

2.1. Classification

Confusion Matrix

Confusion Matrix for a SVM classifier of handwritten digits (MNIST)

2.1. Classification

Confusion Matrix

Confusion Matrix plot of a Perception of handwritten digits (MNIST)

2.1. Classification

Multiclass classification

Multiclass classification

2.2. Clustering

2.2. Clustering

Applications

2.2. Clustering

Formal definition

2.2. Clustering

Cluster models

2.3. Regression

2.3. Regression

Applications

2.3. Regression

Formal definition

2.3. Regression

Linear regression

2.3. Regression

Linear regression

3. Algorithms

  1. Support Vector Machines (SVM)
  2. Decision Trees
  3. Ensemble Methods (Random Forest)

3.1. Support Vector Machines (SVM)

Introduction

3.1. Support Vector Machines (SVM)

Hyperplane

3.1. Support Vector Machines (SVM)

Normal vector

3.1. Support Vector Machines (SVM)

3.1. Support Vector Machines (SVM)

Formal definition

3.1. Support Vector Machines (SVM)

Formal definition

3.1. Support Vector Machines (SVM)

Data mining tasks

3.1. Support Vector Machines (SVM)

Applications

3.2. Decision Trees

Decision Trees

3.2. Decision Trees

Decision Trees

Decision trees are a powerful decision support tool that uses a tree-like model to represent decisions and their possible consequences.

3.2. Decision Trees

Applications

3.3. Ensemble Methods (Random Forest)

Ensemble learning

Ensemble learning, particularly decision tree forests, is a technique that combines multiple learning models to improve predictive performance compared to a single model. Decision tree forests are obtained by constructing multiple decision trees during the training phase.

3.3. Ensemble Methods (Random Forest)

4. Feature Selection

Feature selection is a process aimed at choosing a subset of relevant features from a large number of available features.

4. Feature Selection

Applications

4. Feature Selection

Formal defintion[8]

References

Research articles

  1. From data mining to knowledge discovery in databases, Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, AI Magazine Volume 17 Number 3 (1996)
  2. Survey of Clustering Data Mining Techniques, Pavel Berkhin
  3. Mining association rules between sets of items in large databases, Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami. Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD 1993. p. 207.
  4. Comparisons of Sequence Labeling Algorithms and Extensions, Nguyen, Nam, and Yunsong Guo. Proceedings of the 24th international conference on Machine learning. ACM, 2007.

References

Research articles

  1. An Analysis of Active Learning Strategies for Sequence Labeling Tasks, Settles, Burr, and Mark Craven. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.
  2. Anomaly detection in crowded scenes, Mahadevan; Vijay et al. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010
  3. A Study of Global Inference Algorithms in Multi-Document Summarization. McDonald, Ryan. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2007.
  4. Feature selection algorithms: A survey and experimental evaluation., Molina, Luis Carlos, Lluís Belanche, and Àngela Nebot. Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002.
  5. Support vector machines, Hearst, Marti A., et al. IEEE Intelligent Systems and their applications 13.4 (1998): 18-28.

References

Online resources (English Wikipedia)

References

Online resources (English Wikipedia)

References

Colors

Images