# Data Science for Chemists

## IPL Summer School, CPE Lyon

## 4. Data Mining

**John Samuel**

CPE Lyon

**Year**: 2023-2024

**Email**: john.samuel@cpe.fr

- Understanding Patterns
- Data mining tasks
- Algorithms for data mining
- Feature Selection

- Symmetry
- Trees, Fractals
- Spirals
- Chaos
- Waves
- Bubbles, Foam
- Tesselations
- Cracks
- Spots, stripes

**Buildings (Symmetry)**: Human-made structures with symmetrical patterns. Example: Gothic cathedrals, modern skyscrapers.**Cities**: Planned or organic agglomerations inhabited by humans. Example: Paris, New York.**Virtual Environment (e.g., video games)**: Digitally created spaces for human interaction. Example: Open worlds in video games, virtual simulations.**Human artifacts**: Objects made by humans in various fields. Example: Prehistoric tools, contemporary artworks.

- Repetition
**Fractals**: Self-similar mathematical structures at different scales.- Julia set: A fractal set defined by an iterative function
*f(z) = z*^{2}+ c **Characteristics**: produces complex repetitive patterns when visualized.

- Julia set: A fractal set defined by an iterative function

- Pattern recognition
- Knowledge discovery in databases
- Data mining
- Machine learning

**Supervised Learning**: The model is trained on a labeled dataset where input examples are associated with desired outputs. The model learns to make predictions on new data based on these associations.**Unsupervised Learning**: The model is exposed to unlabeled data and seeks to discover patterns, structures, or intrinsic relationships within the data.**Semi-Supervised Learning**: A combination of the two above, using both labeled and unlabeled data for training.**Reinforcement Learning**: The model learns to make decisions by interacting with its environment. It receives rewards or penalties based on its actions, which guides its learning.

**Euclidean vector**:- An Euclidean vector is a geometric object characterized by its magnitude (length) and direction.
- Euclidean vectors are commonly used to represent data as points in a multidimensional space, where each dimension corresponds to a feature or variable.

**Vector space**:- A vector space is a collection of vectors that can be added together and multiplied by numbers (scalars).

**Feature vector**:- A feature vector is an n-dimensional vector that represents the features or attributes of an entity.

**Feature space**:- The feature space is the vector space associated with feature vectors.
- Each dimension of this space represents a particular feature, and vectors are used to position data in this space based on their features.

**Images**: In the context of images, feature vectors can be constructed from pixel values. Each pixel can be considered as a dimension, and a feature vector will contain the values of all pixels, thus representing an image as a vector.**Texts**: For texts, feature vectors are often constructed based on the frequency of words, phrases, or tokens in a document. This allows the representation of textual content using numerical values, which is essential for text analysis and information retrieval.

**Feature construction**:^{1}- Feature construction involves creating new variables or attributes from those already present in the data.
- This step can be crucial for improving the performance of machine learning models by introducing relevant information and eliminating noise.

**Feature construction operators**:- Construction operators are functions or mathematical operations used to create new features from existing ones.
- Commonly used operators include equality operators (comparisons), arithmetic operators (addition, subtraction, multiplication, division), array operators (min, max, mean, median, etc.), transformation functions, etc.

- https://en.wikipedia.org/wiki/Feature_vector

- Let
**Year of Birth**and**Year of Death**be two existing features. - A new feature called
**age**is created.**age**=**Year of Death**-**Year of Birth**

Feature construction is an essential step in the data preprocessing pipeline in machine learning, as it can help make data more informative for learning algorithms.

- Let
be the number of training examples**N** - Let
be the input feature space**X** - Let
be the output feature space (of labels)**Y** - Let {
*(x*} be the_{1}, y_{1}),...,(x_{N}, y_{N})training examples, where**N***x*is the feature vector of_{i}*i*training example.^{th}*y*is its label._{i}

- The goal of supervised learning algorithm is to find
*g:*, where**X**→**Y***g*is one of the functions from the set of possible functions*G*(hypotheses space)

**Scoring function**denote the space of scoring functions, where*F**f:*such that**X**×**Y**→**R***g*returns the highest scoring function.

- Let
be the input feature space**X** - Let
be the output feature space (of labels)**Y** - The goal of unsupervised learning algorithm is to
- find mapping
**X**→**Y**

- find mapping

- Let
be the input feature space**X** - Let
be the output feature space (of labels)**Y** - Let {
*(x*} be the_{1}, y_{1}),...,(x_{l}, y_{l})be the set of labeled training examples**l** - Let {
*x*} be the_{l+1},...,x_{l+u}be the set of unlabeled feature vectors of**u**.**X** - The goal of semi-supervised learning algorithm is to do
**Transductive learning**, i.e., find correct labels for {*x*}. OR_{l+1},...,x_{l+u}**Inductive learning**, i.e., find correct mapping**X**→**Y**

- Classification
- Clustering
- Regression

**Algorithmic categorization of objects**: Process of assigning classes or categories to objects via algorithms. The goal is to organize data into distinct groups to facilitate analysis and decision-making.**Class assignment**: Assigning a class or category to each object (or individual).**Types of classification:****Binary classification**: Assignment to two classes.**Multi-class classification**: Assignment to multiple classes simultaneously.

- Spam vs Non-spam
- Document classification
- Handwriting recognition
- Speech Recognition
- Internet Search Engines

- Let
be the input feature space**X** - Let
be the output feature space (of labels)**Y** - The goal of classification algorithm (or classifier) is to find
{
*(x*}, i.e., assigning a known label to every input feature vector, where_{1}, y_{1}),...,(x_{l}, y_{k})*x*_{i}∈**X***y*_{i}∈**Y**- |
|**X***= l* - |
|**Y***= k* - l >= k

Let

*tp*: number of true postives*fp*: number of false postives*fn*: number of false negatives

Then

- Accuracy
*a = (tp + tn) / (tp + tn + fp + fn)* - Precision
*p = tp / (tp + fp)* - Recall
*r = tp / (tp + fn)* - Specificity
*r = tn / (tp + fn)* - F1-score
*f1 = 2 * ((p * r) / (p + r))*

- Discovering groups and structures in the data without using known structures in the data
- Objects in a cluster are more similar to each other than the objects in the other cluster

- Social network analysis
- Image segmentation
- Recommender systems
- Grouping of shopping items

- Let
be the input feature space**X** - The goal of clustering is to find k subsets of
, in such a way that**X***C*and_{1}.. ∪ ..C_{k}∪ C_{outliers}=**X***C*_{i}∩ C_{j}= ϕ, i ≠ j; 1 <i,j <k*C*may consist of outlier instances (data anomaly)_{outliers}

**Centroid models**: cluster represented by a single mean vector**Connectivity models**: distance connectivity- Distribution models: clusters modeled using statistical distributions
- Density models: clusters as connected dense regions in the data space
- Subspace models
- Group models
- Graph-based models
- Neural models

- Finding a function which models the data
- Assigns a real-valued output to each input
- Estimating the relationships among variables
- Relationship between a dependent variable ('criterion variable') and one or more independent variables ('predictors').

- Prediction
- Forecasting
- Machine learning
- Finance

- A function that maps a data item to a prediction variable
- Let
be the independent variables**X** - Let
be the dependent variables**Y** - Let
be the unknown parameters (scalar or vector)**β** - The goal of regression model is to approximate
with**Y**, i.e.,**X**,**β****Y**≅ f(**X**,**β**)

- straight line:
*y*OR_{i}= β_{0}+ β_{1}x_{i}+ ε_{i} - parabola:
*y*_{i}= β_{0}+ β_{1}x_{i}+ β_{1}x_{i}^{2}+ε_{i}

- straight line:
*y*OR_{i}= β_{0}+ β_{1}x_{i}+ ε_{i} *ŷ*OR_{i}= β_{0}+ β_{1}_{i}- Residual:
*e*_{i}= ŷ_{i}- y_{i} - Sum of squared residuals, SSE = Σ e
_{i}, where 1 < i < n - The goal is to minimize SSE

- Support Vector Machines (SVM)
- Decision Trees
- Ensemble Methods (Random Forest)

- Supervised learning approach
- Binary classification algorithm
- Constructs a hyperplane ensuring the maximum separation between two classes

- Hyperplane of n-dimensional space is a subspace of dimension
*n-1* - Examples
- Hyperplane of a 2-dimensional space is 1-dimensional line
- Hyperplane of a 3-dimensional space is 2-dimensional plane

- The goal of a SVM is to estimate a function
*f: R*, i.e.,^{N}⨉ {+1,-1}- If
*x*∈_{1},...,x_{l}*R*are the^{N}*N*input data points, - the goal is to find
*(x*∈_{1},y_{1}),...,(x_{l},y_{l})*R*^{N}⨉ {+1,-1}

- If
- Any hyperplane can be written by the equation using set of input points
**x**, where**w**.**x**- b = 0∈ R**w**^{N}, a normal vector to the plane*b ∈ R*

- A decision function is given by
*f(x) = sign(***w**.**x**- b )

- If the training data are linearly separable, two hyperplanes can be selected
- They separate the two classes of data,

so that distance between them is as large as possible. - The hyperplanes can be given by the equations
**w**.**x**- b = 1**w**.**x**- b = -1

- The distance between the two hyperplanes can be given by
*2/||***w**||

- Region between these two hyperplanes is called margin.
- Maximum-margin hyperplane is the hyperplane

that lies halfway between them. - In order to prevent data points from falling into the margin, following constraints are added
, if**w**.**x**_{i}- b >= 1*y*_{i}= 1, if**w**.**x**_{i}- b <= -1*y*_{i}= -1

*y*for 1<= i <= n_{i}(**w**.**x**_{i}- b) >= 1

- The goal is to minimize ||
**w**|| subject to*y*for 1<= i <= n_{i}(**w**.**x**_{i}- b) >= 1 - Solving for both
and**w***b*gives our classifier*f(x) = sign(***w**.**x**- b) - Max-margin hyperplane is completely determined by the points that lie nearest to it, called the
**support vectors**

- Classification (Multi-class classification)
- Regression
- Anomaly detection

- Text and hypertext categorization
- Image classification
- Handwriting recognition

Decision trees are a powerful decision support tool that uses a tree-like model to represent decisions and their possible consequences.

**Tree model:**Decision trees represent decisions in a tree structure, where each internal node represents a feature (or attribute), each branch represents a decision based on that feature, and each leaf represents an outcome or class.**Easy to interpret:**Decision trees are easy to understand and interpret, making them popular for decision-making in many fields.**Adaptability:**They can be used to model both classification and regression problems.**Use of simple rules:**Decisions are made by following simple rules based on feature values, making interpretation of results easy even for non-experts.

- Classification
- Regression
- Decision Analysis: identifying strategies to reach a goal
- Operations Research

Ensemble learning, particularly decision tree forests, is a technique that combines multiple learning models to improve predictive performance compared to a single model. Decision tree forests are obtained by constructing multiple decision trees during the training phase.

**Construction of decision trees**: During the training phase, multiple decision trees are constructed using different subsets of data and/or features. Each tree is independently trained.**Majority vote**: For classification, each decision tree votes for the predicted class of a new example. The final class assigned to the example is determined by a majority vote among all the trees in the forest. For regression, the predicted values by each tree are averaged to obtain the final value.**Stability**: Decision tree forests are generally more stable than individual decision trees because they are less sensitive to variations in the training data.

**Multi-class classification**: Decision tree forests are used to classify instances into one of several mutually exclusive classes. For example, classifying images into different categories such as animals, vehicles, household objects, etc.**Multi-label classification**: Unlike multi-class classification, multi-label classification allows an instance to be assigned to multiple classes simultaneously. For example, classifying documents where a document can be associated with multiple categories such as "science", "technology", "politics", etc.**Regression**: Decision tree forests can also be used for regression tasks, where the output is a continuous value rather than a discrete class. For example, predicting the price of a house based on its features.**Anomaly detection**: They can also be employed to detect anomalies or unusual behaviors in data. This can be useful in various domains such as detecting financial fraud, identifying failures in industrial systems, etc.

**Feature selection** is a process aimed at choosing a subset of relevant features from a large number
of available features.

- This technique is widely used in fields where the number of features is high compared to the sample size, as this can lead to overfitting and high computation time.
- Feature selection is also considered a dimensionality reduction method, as it aims to reduce the number of dimensions in the feature space without losing discriminative information.
- Feature selection aims to:
- Identify the most relevant features that contribute the most to data variability or model prediction capability.
- Reduce the dimensionality of the feature space to improve the performance of machine learning models in terms of computation time and overfitting prevention.

**Analysis of written texts**: In the analysis of written texts, feature selection is effectively used to extract the most relevant and informative elements from textual data. This can include identifying keywords, named entities, specific linguistic patterns, or other features essential for text analysis tasks such as document classification, information extraction, or summary generation. Feature selection in this context aims to reduce the dimensionality of textual data while preserving their relevance and expressiveness for subsequent analysis tasks.**Analysis of DNA microarray data**: In genomics and bioinformatics, DNA microarrays generate highly dimensional datasets that often require dimensionality reduction to identify the most significant genes or DNA sequences associated with particular phenotypes, such as diseases or biological responses.

- Let
*X*be the original set of n features, i.e., |*X*| = n - Let
*w*be the weight assigned to feature_{i}*x*∈_{i}*X* - Binary feature selection assigns binary weights whereas continuous feature selection assigns weights preserving the order of its relevance.
- Let
*J(X')*be an evaluation measure, defined as*J: X' ⊆ X → R* - Feature selection problem may be defined in three following ways
- |
*X'*| =*m < n*. Find*X'*⊂*X*such that*J(X')*is maximum - Choose
*J*, Find_{0}*X'*⊆*X*, such that*J(X') >= J*_{0} - Find a compromise among minimizing |
*X'*| and maximizing*J(X')*

- |

- From data mining to knowledge discovery in databases, Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, AI Magazine Volume 17 Number 3 (1996)
- Survey of Clustering Data Mining Techniques, Pavel Berkhin
- Mining association rules between sets of items in large databases, Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami. Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD 1993. p. 207.
- Comparisons of Sequence Labeling Algorithms and Extensions, Nguyen, Nam, and Yunsong Guo. Proceedings of the 24th international conference on Machine learning. ACM, 2007.

- An Analysis of Active Learning Strategies for Sequence Labeling Tasks, Settles, Burr, and Mark Craven. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.
- Anomaly detection in crowded scenes, Mahadevan; Vijay et al. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010
- A Study of Global Inference Algorithms in Multi-Document Summarization. McDonald, Ryan. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2007.
- Feature selection algorithms: A survey and experimental evaluation., Molina, Luis Carlos, Lluís Belanche, and Àngela Nebot. Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002.
- Support vector machines, Hearst, Marti A., et al. IEEE Intelligent Systems and their applications 13.4 (1998): 18-28.