Data Mining

John Samuel
CPE Lyon

Year: 2019-2020
Email: john(dot)samuel(at)cpe(dot)fr

Creative Commons License

Data Mining

Goals

1. Patterns

1. Patterns

Patterns in Nature

1. Patterns

Patterns by Humans

1. Patterns

Pattern creation

1. Patterns

Synonyms

1. Patterns

Pattern Recognition

1. Patterns

Formalization

Examples: Features

1. Patterns

Formalization

Example

  1. https://en.wikipedia.org/wiki/Feature_vector

1. Patterns

Formalization: Supervised learning

1. Patterns

Formalization: Unsupervised learning

1. Patterns

Formalization: Semi-supervised learning

2. Data Mining

Tasks in Data Mining

  1. Classification
  2. Clustering
  3. Regression
  4. Sequence Labeling
  5. Association Rules
  6. Anomaly Detection
  7. Summarization

2.1. Classification

2.1. Classification

Applications

2.1. Classification

Formal definition

2.1. Classification

Classifiers

2.1. Classification

Linear Classifiers

2.1. Classification

Classifiers

2.1. Classification

Classifiers

2.1. Classification

Let

Then

2.1. Classification

Confusion Matrix

Confusion Matrix for a SVM classifier of handwritten digits (MNIST)

2.1. Classification

Confusion Matrix

Confusion Matrix plot of a Perception of handwritten digits (MNIST)

2.1. Classification

Multiclass classification

Multiclass classification

2.1. Classification

Multiclass classification

2.1. Classification

One-vs.-rest (One-vs.-all) strategy

One-vs.-rest strategy for Multiclass classification

2.1. Classification

One-vs.-one strategy

One-vs.-one strategy for Multiclass classification

2.2. Clustering

2.2. Clustering

Applications

2.2. Clustering

Formal definition

2.2. Clustering

Cluster models

2.3. Regression

2.3. Regression

Applications

2.3. Regression

Formal definition

2.3. Regression

Linear regression

2.3. Regression

Linear regression

2.4. Sequence Labeling

2.4. Sequence Labeling

Applications

2.4. Sequence Labeling

Formal definition

2.5. Association Rules

Association Rules

2.5. Association Rules

Applications

2.5. Association Rules

Formal definition

2.5. Association Rules

Formal definition

2.5. Association Rules

Example

2.6. Anomaly Detection

2.6. Anomaly Detection

Applications

2.6. Anomaly Detection

Characteristics

2.6. Anomaly Detection

Formalization

2.7. Summarization

2.7. Summarization

Applications

2.7. Summarization

Formalization: Multidocument summarization

2.7. Summarization

Formalization: Multidocument summarization

2.7. Summarization

2.7. Summarization

Extractive summarization

3. Algorithms

  1. Support Vector Machines (SVM)
  2. Stochastic Gradient Descent (SGD)
  3. Nearest-Neighbours
  4. Naive Bayes
  5. Decision Trees
  6. Ensemble Methods (Random Forest)

3.1. Support Vector Machines (SVM)

Introduction

3.1. Support Vector Machines (SVM)

Hyperplane

3.1. Support Vector Machines (SVM)

Formal definition

Normal vector

3.1. Support Vector Machines (SVM)

Formal definition

3.1. Support Vector Machines (SVM)

Formal definition

3.1. Support Vector Machines (SVM)

Data mining tasks

3.1. Support Vector Machines (SVM)

Applications

3.2. Stochastic Gradient Descent (SGD)

3.2. Stochastic Gradient Descent

Gradient

3.2. Stochastic Gradient Descent

Gradient vs Derivative

3.2. Stochastic Gradient Descent

Gradient descent

3.2. Stochastic Gradient Descent

Standard gradient descent method

3.2. Stochastic Gradient Descent

Iterative method

3.2. Stochastic Gradient Descent

Applications

3.3. Nearest-Neighbours

k-nearest neighbors algorithm

3.3. Nearest-Neighbours

Applications

3.4. Naive Bayes classifiers

3.4. Naive Bayes classifiers

Applications

3.4. Naive Bayes classifiers

Bayes' Theorem

3.5. Decision Trees

3.5. Decision Trees

Applications

3.6. Ensemble Methods (Random Forest)

Defintion

3.6. Ensemble Methods (Random Forest)

4. Feature Selection

Definition

4. Feature Selection

Applications

4. Feature Selection

Formal defintion[8]

References

Research articles

  1. From data mining to knowledge discovery in databases, Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, AI Magazine Volume 17 Number 3 (1996)
  2. Survey of Clustering Data Mining Techniques, Pavel Berkhin
  3. Mining association rules between sets of items in large databases, Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami. Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD 1993. p. 207.
  4. Comparisons of Sequence Labeling Algorithms and Extensions, Nguyen, Nam, and Yunsong Guo. Proceedings of the 24th international conference on Machine learning. ACM, 2007.

References

Research articles

  1. An Analysis of Active Learning Strategies for Sequence Labeling Tasks, Settles, Burr, and Mark Craven. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.
  2. Anomaly detection in crowded scenes, Mahadevan; Vijay et al. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010
  3. A Study of Global Inference Algorithms in Multi-Document Summarization. McDonald, Ryan. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2007.
  4. Feature selection algorithms: A survey and experimental evaluation., Molina, Luis Carlos, Lluís Belanche, and Àngela Nebot. Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002.
  5. Support vector machines, Hearst, Marti A., et al. IEEE Intelligent Systems and their applications 13.4 (1998): 18-28.

References

Online resources

References

Online resources

References

Colors

Images