Data Science

John Samuel
CPE Lyon

Year: 2025-2026
Email: john.samuel@cpe.fr

Objectives

Regularities
Data Exploration
Algorithms
Feature Selection

Natural Regularities

Symmetry
Trees, fractals
Spirals
Chaos
Waves
Bubbles, foam
Tilings
Cracks
Spots, stripes

Human Creations

Buildings (Symmetry): Man-made structures with symmetry patterns. Example: Gothic cathedrals, modern skyscrapers.
Cities: Planned or organic agglomerations inhabited by humans. Example: Paris, New York.
Virtual Environment (e.g., video games): Digitally created spaces for human interaction. Example: Open worlds in video games, virtual simulations.
Human Artifacts: Objects made by humans in various fields. Example: Prehistoric tools, contemporary works of art.

Approaches

Supervised Learning: The model is trained on a labeled dataset where input examples are associated with desired outputs. The model learns to make predictions on new data based on these associations.
Unsupervised Learning: The model is exposed to unlabeled data and seeks to discover patterns, structures or intrinsic relationships in the data.
Semi-supervised Learning: A combination of the two preceding approaches, using both labeled and unlabeled data for training.
Reinforcement Learning: The model learns to make decisions by interacting with its environment. It receives rewards or penalties based on its actions, which guides its learning.

Activities

Classification
Clustering
Regression
Anomaly Detection

Formalization

Euclidean Vector:
- A Euclidean vector is a geometric object characterized by its magnitude (length) and direction. Euclidean vectors often represent data points in a multidimensional feature space.
Vector Space:
- A vector space is a collection of vectors that can be added together and multiplied by numbers (scalars).
Feature Vector (features):
- A feature vector is an n-dimensional vector that represents the features or attributes of an entity.
Feature Space:
- The feature space is the vector space associated with feature vectors. Each dimension represents one feature and helps position data points.

Concrete Examples

Euclidean vector: A displacement of 3 m east and 4 m north is represented by \((3, 4)\).
Vector space: All points of the form \((x, y)\), such as \((1, 2)\) or \((5, -1)\), belong to \(\mathbb{R}^2\).
Feature vector: A student can be represented by \((20, 1.75, 14)\) = (age, height, grade).
Feature space: The axes age, height, and grade define the space of vectors like \((19, 1.68, 12)\).

Image and Text Representation

Image: A 2x2 grayscale image with pixel values \(\begin{bmatrix}255 & 0 \\ 128 & 64\end{bmatrix}\) can be represented by the vector \((255, 0, 128, 64)\).
Text: For the vocabulary (data, science, model), the sentence "data science model data" can be represented by \((2, 1, 1)\).
Interpretation: Each component of the vector corresponds to one measurable feature.
Use: These vectors can then be compared, classified, or clustered by learning algorithms.

Formalization

Feature Construction¹:
- Feature construction creates new variables from existing data.
- It can improve model performance by adding useful information and reducing noise.
Construction Operators for Features
- These are functions or operations used to create new features from existing ones.
- Examples include comparisons, arithmetic operations, aggregates such as min or mean, and transformation functions.

https://en.wikipedia.org/wiki/Feature_vector

Formalization: Supervised Learning

This formalization is at the core of supervised learning, where the objective is to learn from labeled examples and find a function that can accurately predict labels for new unseen data.

Let \(N\) be the number of training examples
Let \(X\) be the input feature space
Let \(Y\) be the output feature space (of labels)
Let \({(x_1, y_1),...,(x_N, y_N)}\) be the \(N\) training examples, where
- \(x_i\) is the feature vector of the i^th training example.
- \(y_i\) is its label.

Formalization: Supervised Learning

The objective of the supervised learning algorithm is to find \(g: X → Y\), where
- g is one of the functions from the set of possible functions G (hypothesis space)
Evaluation Function F indicates the space of evaluation functions, where
- \(f: X × Y → R\) such that g returns the highest evaluation function.

Formalization: Unsupervised Learning

Input feature space (\(X\)): This is the set of input vectors given to the algorithm. Example: each customer may be represented by \((age, income, spending)\).
Output space (\(Y\)): In unsupervised learning, \(Y\) is not a set of known labels. It may contain clusters, reduced coordinates, or latent representations.
Clustering example: From customer vectors in \(X\), the algorithm may output groups such as cluster 1, cluster 2, and cluster 3.
Dimensionality reduction example: A vector \((x_1, x_2, x_3, x_4)\) in \(X\) may be mapped to \((z_1, z_2)\) in \(Y\).

Formalization: Unsupervised Learning

Objective: Find a mapping from input space \(X\) to output space \(Y\) without predefined labels.
Possible outputs in \(Y\):
- Dimensionality reduction: Map a large vector to a smaller one, e.g. \((x_1, x_2, x_3, x_4)\rightarrow(z_1, z_2)\).
- Automatic classification of unlabeled data: Group similar points into clusters, e.g. customer groups 1, 2, and 3.
- Anomaly detection: Mark unusual points, e.g. suspicious transactions.
- Segmentation: Split data into coherent parts, e.g. regions of an image.
- Latent representation: Learn a compact hidden code that preserves useful structure.

Raw data rarely satisfies the assumptions of ML algorithms. Data preparation transforms the raw feature matrix into a form suitable for learning.

Feature scaling: Many algorithms (SVM, k-NN, gradient descent-based methods) are sensitive to the scale of features.
- Standardization (Z-score): \(x' = \frac{x - \mu}{\sigma}\). Transforms features to zero mean and unit variance.
- Min-Max normalization: \(x' = \frac{x - x_{min}}{x_{max} - x_{min}}\). Scales features to the interval [0, 1].
- Robust scaling: Uses the median and interquartile range, robust to outliers.
Categorical encoding: Convert categorical variables to numerical representations (one-hot encoding, label encoding, target encoding).

Missing Data

Mechanisms of missingness:
- MCAR (Missing Completely At Random): absence is unrelated to the data.
- MAR (Missing At Random): absence depends on observed variables.
- MNAR (Missing Not At Random): absence depends on the missing value itself.
Imputation strategies:
- Mean / median / mode imputation: simple but can distort distributions.
- k-NN imputation: estimates missing values from the k nearest complete observations.
- Iterative / multiple imputation: models each feature with missing values as a function of the others (e.g., MICE).
Deletion: Listwise deletion (remove rows) or pairwise deletion; acceptable when the proportion of missing data is small.
Missing indicator: Add a binary flag indicating that a value was missing; can preserve information about the missingness pattern.

Outliers and Feature Construction

Outlier detection and treatment:
- Statistical methods: Z-score threshold, IQR-based fences.
- Model-based methods: Isolation Forest, Local Outlier Factor (LOF).
- Treatment options: removal, capping/winsorization, transformation (log, Box-Cox), or use of robust estimators.
Feature construction: Create new informative features from existing ones.
- Interaction terms: products or ratios of existing features.
- Polynomial features: add higher-degree terms to capture non-linear relationships.
- Domain-specific features: derived from expert knowledge (e.g., body mass index from height and weight).
- Temporal features: extract time-of-day, day-of-week, or rolling statistics from timestamps.

Running Example: Customer Campaign Data

We use a small customer dataset for a marketing task: prepare the data before predicting high-value customers.

customer_id	age	income	city	visits	annual_spend
C01	22	32000	Paris	4	420
C02	25	36000	Lyon	5	510
C03	29	missing	Paris	7	760
C04	31	54000	Marseille	6	680
C05	38	61000	missing	9	980
C06	45	58000	Lyon	3	4800

Immediate issues: different scales, missing values, and an extreme spender (C06).

Example: Why Scaling and Encoding Matter

age, income, and annual_spend live on very different scales. Without preprocessing, large-valued columns dominate.

Standardization: center each numerical feature around 0 with unit variance.
Robust scaling: use median and IQR when outliers exist.
One-hot encoding: transform city into binary columns.

customer_id	age	income	annual_spend	problem before preprocessing
C01	22	32000	420	income dominates age
C03	29	missing	760	missing income

Example: Scaling and Encoding in Pandas

The following code prepares the numerical and categorical columns before model training.

prep = df.copy()
prep["income"] = prep["income"].fillna(prep["income"].median())
prep["city"] = prep["city"].fillna("Unknown")

scaled = prep[["age", "income", "annual_spend"]]
scaled = (scaled - scaled.mean()) / scaled.std(ddof=0)

city_dummies = pd.get_dummies(
    prep["city"], prefix="city", dtype=int
)

scaled contains standardized numerical columns, while city_dummies contains binary variables for each city category.

Example: Scaled and Encoded Values

After preprocessing, the numerical variables are comparable and the categorical variable becomes a set of binary columns.

customer_id	age_z	income_z	spend_z	active city column
C01	-1.24	-1.55	-0.61	city_Paris = 1
C04	-0.09	0.44	-0.44	city_Marseille = 1
C06	1.71	0.80	2.22	city_Lyon = 1

Even after standardization, customer C06 still appears unusual because the spending pattern is genuinely extreme, not just large in scale.

Example: Missing Data on the Same Dataset

MCAR: a value is missing for purely random reasons.
MAR: missingness depends on another observed variable.
MNAR: missingness depends on the missing value itself.

Here we keep a missing indicator, impute income with the median, and store missing cities as Unknown.

customer_id	income	income_missing	city	city_missing
C03	54000	1	Paris	0
C05	61000	0	Unknown	1

Observed income median: 54000.

Example: Missing Data in Pandas

prep = df.copy()
prep["income_missing"] = prep["income"].isna().astype(int)
prep["city_missing"] = prep["city"].isna().astype(int)

prep["income"] = prep["income"].fillna(prep["income"].median())
prep["city"] = prep["city"].fillna("Unknown")

clean_subset = prep[[
    "customer_id", "income", "income_missing", "city", "city_missing"
]]

If the amount of missing data were tiny, we could also compare this approach with df.dropna() and discuss the loss of observations.

Example: Outliers and Feature Construction

The value 4800 is much larger than the other spending values. We cap it with the IQR rule, then derive features that are easier for a model to exploit.

q1 = prep["annual_spend"].quantile(0.25)
q3 = prep["annual_spend"].quantile(0.75)
iqr = q3 - q1
upper = q3 + 1.5 * iqr

prep["annual_spend_capped"] = prep["annual_spend"].clip(upper=upper)
prep["spend_per_visit"] = (
    prep["annual_spend_capped"] / prep["visits"]
).round(1)
prep["is_high_value"] = (prep["annual_spend_capped"] >= 900).astype(int)

This creates one cleaned variable and two derived variables that are more directly useful for a classifier or a clustering algorithm.

Example: Outlier Treatment Results

customer_id	annual_spend	capped	spend_per_visit	is_high_value
C05	980	980.00	108.9	1
C06	4800	1483.75	494.6	1

Here, Q1 = 552.5, Q3 = 925.0, so the IQR upper bound is 1483.75. The new features summarize behavior more directly than the raw columns.

2.1.1 Introduction

Algorithmic categorization of objects: Process of assigning classes or categories to objects via algorithms. The objective is to organize data into distinct groups to facilitate analysis and decision-making.
Class assignment: Assign a class or category to each object (or individual).
Types of classification:
- Binary classification: Assignment to two classes.
- Multi-class classification: Assignment to multiple classes simultaneously.

Applications

Content filtering (e.g., spam): Identify and filter unwanted or undesirable emails. Example: Spam filtering in email inboxes.
Document classification: Organize and categorize documents based on their content. Example: Automatic classification of news articles by topic.
Handwriting recognition: Automatic interpretation of handwritten characters. Example: Recognition of numbers on bank checks.
Automatic speech recognition: Convert speech to written text automatically. Example: Voice commands for virtual assistants like Siri or Alexa.
Search engines: Rank and organize search results based on relevance. Example: Ranking web pages in search engine results.

Classification: Formal Definition

Let \(X\) be the input feature space
Let \(Y\) be the output feature space (of labels)
The objective of the classification algorithm (or classifier) is to find \({(x_1, y_1),...,(x_l, y_k)}\), i.e., the assignment of a known label to each input feature vector, where
- \(x_i ∈ X \)
- \(y_i ∈ Y \)
- \(|X| = l \)
- \(|Y| = k \)
- \(l >= k\)

Binary Classification

Binaryclassifier — Binary classification

Linear Classifiers

Linear function assigning a score to each possible category by combining an instance's feature vector with a weight vector, using a dot product.
Formalization:
- Let X be the input feature space and x_i ∈ X
- Let β_k be a weight vector for category k
- score(x_i, k) = x_i.β_k, score for assigning category k to instance x_i. The category giving the highest score is assigned to the instance's category.

Example: Linear Scores for Email Classification

Suppose we classify emails into two categories using two features: x = (contains_offer, exclamation_count).

email	contains_offer	exclamation_count	true class
E1	1	4	spam
E2	1	1	spam
E3	0	0	not_spam
E4	0	1	not_spam

Choose one weight vector per class: βspam = (2.0, 0.8) and βnot_spam = (0.2, 0.1).

Example: Computing the Scores with Pandas

emails = pd.DataFrame({
    "email": ["E1", "E2", "E3", "E4"],
    "contains_offer": [1, 1, 0, 0],
    "exclamation_count": [4, 1, 0, 1],
})

beta_spam = [2.0, 0.8]
beta_not_spam = [0.2, 0.1]
X = emails[["contains_offer", "exclamation_count"]]

emails["score_spam"] = X.dot(beta_spam)
emails["score_not_spam"] = X.dot(beta_not_spam)
emails["predicted_class"] = np.where(
    emails["score_spam"] > emails["score_not_spam"], "spam", "not_spam"
)

This produces one score per class for every email.

Example: Score Comparison and Prediction

email	score_spam	score_not_spam	predicted class
E1	5.2	0.6	spam
E2	2.8	0.3	spam
E3	0.0	0.0	not_spam
E4	0.8	0.1	spam

The class with the highest dot-product score is selected for each email; ties are sent to not_spam here.

Reflection: E4 is predicted as spam even though its true class is not_spam. This is useful: a linear classifier is simple and interpretable, but with only two features it may confuse legitimate but emphatic emails with spam.

Evaluation

Positivenegative — True positives and true negatives

Evaluation: confusion matrix

The confusion matrix is an essential tool for evaluating the performance of a classification system. It provides a detailed view of the predictions made by the model relative to the actual classes.

Each row of the matrix represents instances of a predicted class.
Each column represents instances of an actual class.
All correct predictions are located on the diagonal of the table.
Prediction errors are represented by values located outside the main diagonal.

Evaluation: confusion matrix

Confusionmatrix1 — Confusion matrix for a perceptron for handwritten digits (MNIST)

Example: Binary Classification with scikit-learn

We predict whether a student passes using hours_studied and exercise_score.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

binary = pd.DataFrame({...})
Xb = binary[["hours_studied", "exercise_score"]]
yb = binary["passed"]

Example: Fitting and Matrix

model_b = LogisticRegression(random_state=0).fit(Xb, yb)
pred_b = model_b.predict(Xb)
cm_b = confusion_matrix(yb, pred_b)

The result is [[3, 1], [1, 3]]: one false positive and one false negative.

Example: Reading the Binary Confusion Matrix

	Predicted 0	Predicted 1
Actual 0	3	1
Actual 1	1	3

Diagonal values: 3 true negatives and 3 true positives.
Outside the diagonal: 1 false positive and 1 false negative.
Interpretation: the model is useful, but not perfect.

Example: Multiclass Confusion Matrix

The same idea extends to three classes. Here we classify students into Arts, Literature, and Science.

multi = pd.DataFrame({
    "math_score": [2, 3, 4, 6, 7, 8, 4, 5, 6],
    "writing_score": [8, 7, 6, 5, 4, 3, 5, 6, 4],
    "track": ["Literature", "Literature", "Literature", "Science", "Science", "Science", "Arts", "Arts", "Arts"],
})

Xm = multi[["math_score", "writing_score"]]
ym = multi["track"]
model_m = LogisticRegression(random_state=0, max_iter=1000).fit(Xm, ym)
pred_m = model_m.predict(Xm)
cm_m = confusion_matrix(ym, pred_m, labels=["Arts", "Literature", "Science"])

The matrix is [[2, 0, 1], [0, 3, 0], [0, 0, 3]]: most predictions are correct, but one Arts student is confused with Science.

Binary Classification

Multi-class Classification

Multiclassclassifier — Multi-class classification

Multi-class Classification [Aly 2005]

Transformation into binary classification:
- One-vs-rest approach (One-vs-all): Each class is treated as a positive class and all others as a negative class.
- One-vs-one approach: A binary classifier is built for each pair of classes.
Extension of binary classification:
- Neural networks: Adapting architectures to predict multiple classes simultaneously.
- k-nearest neighbors: Extension of the algorithm to handle multiple classes.
Hierarchical classification: Organizing classes in a tree structure for finer and more precise classification.

Introduction

The Support Vector Machine (SVM) is a supervised learning method. SVM seeks to find the best decision boundary that optimizes class separation, allowing accurate classification even in complex data spaces.

It is primarily used for binary classification, although it can be extended to multiclass classification problems.
The main objective of SVM is to build a hyperplane that maximizes the separation margin between the two classes. The hyperplane is the decision boundary that separates the data into two distinct classes.

Hyperplane

The hyperplane in n-dimensional space is an (n-1)-dimensional subspace that separates the data into two classes.

In a two-dimensional space, the hyperplane is a one-dimensional line that separates the data into two regions.
In a three-dimensional space, the hyperplane is a two-dimensional plane that divides the space into two distinct parts.
The hyperplane of a three-dimensional space is a two-dimensional plane

Formal Definition

SVM learns a classifier \(f: \mathbb{R}^N \rightarrow \{+1,-1\}\) from labeled examples \((x_i, y_i)\).
A separating hyperplane is written as \(w \cdot x - b = 0\).
\(w \in \mathbb{R}^N\) is the normal vector and \(b \in \mathbb{R}\) is the bias.
The decision function is \(f(x) = sign(w \cdot x - b)\).

Surface normal illustration — Normal vector

Example: A Small Linear SVM Dataset

point	x1	x2	class
P1	1	7	-1
P2	2	6	-1
P3	2	5	-1
P4	3	6	-1
P5	4	3	+1
P6	5	2	+1
P7	5	3	+1
P8	6	2	+1

Example: Why This Dataset Works

Separable classes: The points labeled -1 and +1 can be split by one straight line.
Margin idea: SVM does not choose any separating line; it chooses the one with the largest safety margin.
Expected outcome: The learned separator should lie between the two groups, with the closest points defining the margin.

Example: Training a Linear SVM with scikit-learn

from sklearn.svm import SVC

svm_df = pd.DataFrame({
    "x1": [1, 2, 2, 3, 4, 5, 5, 6],
    "x2": [7, 6, 5, 6, 3, 2, 3, 2],
    "label": [-1, -1, -1, -1, 1, 1, 1, 1],
})

X = svm_df[["x1", "x2"]]
y = svm_df["label"]
svm_model = SVC(kernel="linear", C=1.0).fit(X, y)

w = svm_model.coef_[0]
b = -svm_model.intercept_[0]
support = svm_model.support_vectors_

Interpretation: Here the line is close to x2 = x1 + 1, with support vectors near (2, 5) and (4, 3).

Data mining

Classification: SVM can be used for binary classification as well as multiclass classification, where it seeks to separate data into several distinct categories by constructing hyperplanes in a multidimensional space.
Regression: SVM can also be applied to regression problems, where it seeks to predict a continuous value rather than classifying data into discrete categories.
Anomaly Detection: SVM can be used to detect anomalies in data by identifying data points that are significantly different from the rest of the dataset, making it a valuable tool for detecting fraud or errors in data.

Applications

Text and hypertext categorization: SVMs are widely used to automatically classify text documents into different categories, such as classifying emails as spam or non-spam, categorizing news articles, etc.
Image classification: SVM is effective for classifying images into predefined categories, such as classifying medical images into different diseases, face recognition, object detection in images, etc.
Handwriting recognition: SVM is also used in handwriting recognition systems to identify handwritten characters or words and transcribe them into digital text.

The k-nearest neighbors (kNN) method and k-means clustering are two important techniques in machine learning and data mining:

K-Nearest Neighbors (kNN): This is a supervised learning algorithm used for classification and regression. The main idea behind kNN is to find the k training samples closest to the test data point and predict the class label based on the majority class among these neighbors. For regression, the prediction is the average of the target values of the k nearest neighbors.
K-means clustering: This is an unsupervised method for partitioning data into k distinct groups. The algorithm works by repeating two steps: first, it assigns each data point to the group whose centroid is closest, then it updates the centroids by computing the average of all points assigned to each group. These steps are repeated until convergence is reached and the centroids no longer change significantly.

K-Nearest Neighbors Method

The k-nearest neighbors (k-NN) method is a supervised learning algorithm used for both classification and regression.

k-NN Classification: In this case, the output is a class membership. To classify a new object, the k-NN algorithm examines the k closest examples in the training set and determines the majority class among these neighbors. More precisely, each neighbor contributes a vote, and the most frequent class among the k neighbors is assigned to the object being classified. This is an example of majority voting among the nearest neighbors.
k-NN Regression: Unlike classification, in k-NN regression the output is a property value of the object. To predict the value of a new observation, the k-NN algorithm computes the mean (or median) of the target values of the k nearest neighbors. Therefore, instead of voting for a majority class, the target values of the k neighbors are used to predict the target value of the object being estimated.

Example: Customer Dataset

customer	age	visits	segment	spend_per_visit
C1	23	2	Occasional	35
C2	25	3	Occasional	40
C3	28	4	Regular	55
C4	35	7	Regular	75
C5	38	8	Premium	95
C6	42	9	Premium	110

Example: Query Point

We study a new customer with (age=30, visits=6) using k = 3.

The algorithm will compare this point with the stored customers and use the three closest neighbors to make a prediction.

Example: k-NN Classification with scikit-learn

from sklearn.neighbors import KNeighborsClassifier

customers = pd.DataFrame({...})
X = customers[["age", "visits"]]
y_class = customers["segment"]

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X, y_class)

new_customer = pd.DataFrame({"age": [30], "visits": [6]})
predicted_segment = knn_clf.predict(new_customer)[0]

Nearest neighbors: C3, C4, and C2.
Neighbor labels: Regular, Regular, Occasional.
Prediction: the majority vote gives Regular.

Example: k-NN Regression with scikit-learn

from sklearn.neighbors import KNeighborsRegressor

y_reg = customers["spend_per_visit"]
knn_reg = KNeighborsRegressor(n_neighbors=3)
knn_reg.fit(X, y_reg)

predicted_spend = knn_reg.predict(new_customer)[0]

The same neighbors (C3, C4, C2) have spend values 55, 75, and 40.

The regression prediction is their average: (55 + 75 + 40) / 3 = 56.7.

This is the key difference: classification votes on labels, while regression averages numerical targets.

K-Nearest Neighbors Method

Consider labeled 2D training points. We want to classify the new point \((4, 3)\).

points = pd.DataFrame({...})
new_point = pd.DataFrame({"x": [4], "y": [3]})

K-Nearest Neighbors Method

Point	x coordinate	y coordinate	Class
A	2	3	Red
B	4	4	Red
C	3	2	Blue
D	6	5	Red
E	5	3	Blue

K-Nearest Neighbors Method

from sklearn.neighbors import KNeighborsClassifier

points["distance"] = ((points["x"] - 4)**2 + (points["y"] - 3)**2)**0.5
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(points[["x", "y"]], points["label"])
prediction = knn.predict(new_point)[0]

Choice of k: Here we choose k = 3.
Distance computation: We compare the new point with every training point.

K-Nearest Neighbors Method

Distances: A = 2, B = 1, C = 1.41, D = 2.83, E = 1.
Nearest neighbors: B, E, and C.
Prediction: the model predicts Blue.

Applications

Regression: Using the nearest neighbors method for regression, one can estimate the value of a target variable for a new observation by taking the mean of the target values of the k nearest neighbors. For example, in k-NN regression, one can predict the price of a house by taking the average of the prices of the k nearest houses in terms of similar features (area, number of rooms, etc.).
Anomaly Detection: The nearest neighbors method can also be used to detect anomalies in data. Observations that are very different from their nearest neighbors can be considered anomalies. For example, in health monitoring, unusual vital sign values compared to the nearest neighbors may indicate a potential health problem and thus be considered anomalies.

Naive Bayes classification is a simple probabilistic classification method based on the application of Bayes' theorem with a strong independence assumption between features.

Bayes' Theorem: Naive Bayes classification relies on Bayes' theorem, which is a formula for computing conditional probabilities. It allows computing the probability that an observation belongs to a given class using the probabilities of the features given each class.

\[ P(C \mid x)=\frac{P(x \mid C)P(C)}{P(x)} \]

P(C) is the prior, P(x | C) the likelihood, and P(C | x) the posterior.

Naive idea: Given the class C, the features are treated as independent.
Consequence: The joint likelihood becomes a product of simpler one-feature terms.

\[ P(x_1,\dots,x_d \mid C)\approx \prod_{i=1}^{d} P(x_i \mid C) \]

Prediction rule: \(\hat{C}=\arg\max_C P(C)\prod_i P(x_i \mid C)\).

Example: Spam Detection Features

We encode short email messages as binary features and train a Bernoulli Naive Bayes classifier.

email	contains_free	contains_win	many_caps	label
E1	1	1	1	spam
E2	1	0	1	spam
E3	0	0	0	ham
E4	0	0	1	ham
E5	1	1	0	spam

The independence assumption means the model combines the evidence from each feature as if they were conditionally independent given the class.

Example: Training with pandas and scikit-learn

emails = pd.DataFrame({
    "contains_free": [1, 1, 0, 0, 1],
    "contains_win": [1, 0, 0, 0, 1],
    "many_caps": [1, 1, 0, 1, 0],
    "label": ["spam", "spam", "ham", "ham", "spam"],
})

X = emails[["contains_free", "contains_win", "many_caps"]]
y = emails["label"]
model = BernoulliNB()
model.fit(X, y)

Example: Predicting New Messages

new_messages = pd.DataFrame({
    "contains_free": [1, 0],
    "contains_win": [0, 1],
    "many_caps": [1, 0],
}, index=["M1", "M2"])

pred = model.predict(new_messages)
proba = model.predict_proba(new_messages)

Example: Predictions and Posterior Intuition

message	features	predicted class	comment
M1	(1, 0, 1)	spam	`free` and capitals push the posterior toward spam
M2	(0, 1, 0)	spam	`win` is rare in ham and increases spam probability

Naive Bayes estimates \(P(class \mid features)\) and chooses the largest posterior. Even with a tiny dataset, we can explain the prediction by looking at how each observed feature changes the class probability.

Decision trees represent decisions and consequences as a tree.

Tree model: internal nodes test features, branches represent decisions, and leaves give outcomes.
Easy to interpret: rules are readable and easy to explain.
Adaptable: works for both classification and regression.
Simple rules: predictions follow straightforward feature-based tests.

Example: Loan Approval Rules

applicant	income_k	debt_ratio	late_payments	approved
A1	32	0.52	4	no
A2	58	0.18	0	yes
A3	45	0.30	1	yes
A4	28	0.61	3	no
A5	64	0.25	0	yes

The tree tests candidate splits such as debt_ratio < 0.4 or late_payments < 2 and keeps the split that best separates yes from no.

Example: Fitting and Reading the Rules

loan = pd.DataFrame({
    "income_k": [32, 58, 45, 28, 64],
    "debt_ratio": [0.52, 0.18, 0.30, 0.61, 0.25],
    "late_payments": [4, 0, 1, 3, 0],
    "approved": ["no", "yes", "yes", "no", "yes"],
})

X = loan[["income_k", "debt_ratio", "late_payments"]]
y = loan["approved"]
tree = DecisionTreeClassifier(criterion="gini", max_depth=2, random_state=0)
tree.fit(X, y)

At each node, the classifier checks several thresholds and keeps the split with the largest impurity reduction.

Example: Tree Output

rules = export_text(tree, feature_names=list(X.columns))

# possible simplified output
|--- debt_ratio <= 0.41
|   |--- late_payments <= 1.50: yes
|   |--- late_payments > 1.50: no
|--- debt_ratio > 0.41: no

The root split is most informative; later splits refine only the remaining mixed groups.

Build recursively: Start at the root, choose the best split, then repeat on each child node.
Stop growing: Stop when a node is pure, too small, or the allowed depth is reached.
Leaf meaning: Each leaf stores the majority class of the training examples that reached it.
Prediction: A new sample follows the yes/no tests from root to leaf and inherits that leaf label.

Example of building: at the root, a split like debt_ratio ≤ 0.41 is useful because it puts A2,A3,A5 on one side and all three are yes, while A1,A4 go to the other side and both are no. The child nodes are then split again only if they still contain mixed labels.

In the context of decision trees, data is generally represented as vectors where each element of the vector corresponds to a feature or independent variable, and the dependent variable is the target that one seeks to predict or classify.

Data as vectors: Each observation or example in the dataset is represented as a vector, where each component of the vector corresponds to a feature or explanatory variable. For example, if we examine a dataset on bank loans, the features could include income, loan amount, number of years of professional experience, etc.
The data is available in the form \[(\textbf{x},Y) = (x_1, x_2, x_3, ..., x_k, Y)\]
The vector \(\textbf{x}\) is composed of the following features \(x_1, x_2, x_3, ...\)
\(Y\) is the dependent variable that may depend on \(\textbf{x}\)

Applications

Classification: Decision trees are commonly used for classification, where the objective is to categorize observations into predefined classes or categories based on their features. For example, in the medical field, decision trees can be used to classify patients based on their diagnosis.
Regression: Decision trees can also be used for regression, where the objective is to predict a continuous numerical value based on features. For example, in finance, decision trees can be used to predict the price of a house based on its features.
Decision analysis: Decision trees can help identify the most effective strategies or sequences of actions to achieve a specific objective. For example, in business planning, decision trees can be used to determine the best decisions to make in a complex decision-making process.

Ensemble learning, in particular random forests, is a technique that combines multiple learning models to improve predictive performance compared to a single model. Random forests are obtained by building multiple decision trees during the training phase.

Building decision trees: During the training phase, multiple decision trees are built using different subsets of data and/or features. Each tree is trained independently.

Example: Default Prediction with Many Small Trees

Suppose we predict whether a customer will default using spending and repayment behaviour.

customer	income_k	balance_k	missed_payments	default
C1	70	5	0	no
C2	42	16	2	yes
C3	90	7	0	no
C4	38	18	3	yes
C5	55	10	1	no

Each tree sees a slightly different bootstrap sample, so the final model relies on many weakly different decision rules instead of 1 fragile tree.

Example: Training a Random Forest

credit = pd.DataFrame({
    "income_k": [70, 42, 90, 38, 55, 48, 80, 36],
    "balance_k": [5, 16, 7, 18, 10, 14, 6, 20],
    "missed_payments": [0, 2, 0, 3, 1, 2, 0, 4],
    "default": ["no", "yes", "no", "yes", "no", "yes", "no", "yes"],
})

X = credit[["income_k", "balance_k", "missed_payments"]]
y = credit["default"]

This slide prepares the training data: X contains the input features, and y contains the label to predict, namely whether the customer defaults.

Example: Feature Importance

forest = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=0)
forest.fit(X, y)

importance = pd.Series(forest.feature_importances_, index=X.columns).sort_values(ascending=False)

After fitting many trees, the forest estimates which features were most useful for separating default=yes from default=no.

Example: Aggregated Prediction

For a new customer with income_k=50, balance_k=15, missed_payments=2:

A single tree may overreact to one split and predict differently depending on the training sample.
The random forest combines many tree votes and usually produces a more stable class such as yes for default risk in this example.
The feature-importance ranking often places missed_payments and balance_k above income_k, which matches domain intuition.

This is the practical value of ensemble learning: lower variance and more robust predictions.

4. Regression

Process aimed at finding a mathematical function that models relationships between variables. The objective is to estimate relationships and predict values of one variable based on other variables.
Modeling function: Find a function that best represents the observed data with the objective of predicting or estimating values of a target variable based on explanatory variables.
Relationship analysis: Examine the relationship between a target variable and one or more explanatory variables. Methods: Identify trends, correlations and dependencies between variables.
Value assignment: Assign real values to each input to model real-world phenomena.

Applications

Weather: predict temperature, rainfall, or wind from past observations.
Sales: estimate future sales from trends, seasonality, and campaigns.
Machine learning: use regression to predict continuous outputs.
Finance: estimate returns, prices, or credit risk from financial indicators.

Formal Definition

Regression is represented by a function that maps a data element to a prediction variable.
It can be expressed in terms of independent variables \(X\), dependent variables \(Y\) and unknown parameters \(β\).
The regression model aims to approximate the relationship between \(X\) and \(Y\) with a function \(f(X, β)\), where \(β\) represents the model parameters.
The objective is to obtain an approximation \(Y ≈ f(X, β)\) that minimizes the deviation between predicted values and observed values.

Linear Regression

Linear regression is a mathematical model that represents a linear relationship between an independent variable \(x_i\) and a dependent variable \(y_i\). The model takes the form of a straight line (for simple linear regression) or a parabola (for multiple linear regression).

Straight line: \(y_i = β_0 + β_1x_i + ε_i\) where \(β_0\) and \(β_1\) are the regression coefficients, \(x_i\) is the independent variable, and \(ε_i\) is the residual error.
Parabola: \(y_i = β_0 + β_1x_i + β_2x_i^2 +ε_i\) where \(β_0\), \(β_1\), and \(β_2\) are the regression coefficients for each term, \(x_i\) is the independent variable, and \(ε_i\) is the residual error.

Linear Regression

Straight line: \(y_i = β_0 + β_1x_i + ε_i\) where \(β_0\) and \(β_1\) are the regression coefficients, \(x_i\) is the independent variable, and \(ε_i\) is the residual error.

To minimize the error:

Computing predictions: \( ŷ_i = β_0 + β_{1}x_i \)
Computing residuals: \(e_i = ŷ_i - y_i\)
Computing the sum of squared residuals (SSE) to evaluate the model fit: \(SSE = Σ e_i\), where \(1 < i < n\)

The objective is to minimize SSE to obtain the best approximation of the linear relationship between variables.

SGD is an iterative method that updates the model to reduce a loss function, often one example at a time.

For regression, the objective is often the average loss over the dataset: \[ J(w)=\frac{1}{n}\sum_{i=1}^{n} L(w;x_i,y_i) \]

\(J(w)\) is the total objective, and \(L(w;x_i,y_i)\) is the error on one training example.

Example: Predicting Apartment Rent

flat	surface_m2	distance_km	rooms	rent_eur
F1	28	7.5	1	820
F2	35	5.0	2	980
F3	52	3.0	2	1350
F4	65	2.0	3	1680
F5	42	6.5	2	1100

We model rent with a linear prediction such as \(\hat{y}=w_1x_1+w_2x_2+w_3x_3+b\), where the features are surface, distance, and rooms.

A common loss for one apartment is the squared error \(L(w;x_i,y_i)=(\hat{y}_i-y_i)^2\), so large mistakes receive a larger penalty.

Example: Scaling then Fitting SGDRegressor

rent = pd.DataFrame({
    "surface_m2": [28, 35, 52, 65, 42, 58, 31, 47],
    "distance_km": [7.5, 5.0, 3.0, 2.0, 6.5, 4.0, 8.0, 3.5],
    "rooms": [1, 2, 2, 3, 2, 3, 1, 2],
    "rent_eur": [820, 980, 1350, 1680, 1100, 1490, 860, 1280],
})

X = rent[["surface_m2", "distance_km", "rooms"]]
y = rent["rent_eur"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Now SGD uses one training example at a time to reduce the loss: \(w \leftarrow w - \eta \nabla L(w;x_i,y_i)\).

Here, \(\nabla L\) tells us which direction increases error, and \(\eta\) controls how large the correction step is. Scaling keeps one feature from dominating the update.

Example: SGD Fit

sgd = SGDRegressor(max_iter=20000, eta0=0.01, learning_rate="adaptive", random_state=0)
sgd.fit(X_scaled, y)

One pass idea: Read one apartment, compute its error, and slightly adjust the weights.
Adaptive learning rate: If progress slows, smaller updates reduce oscillation.
Why SGD is useful: It is fast on large datasets because it avoids solving the full problem at once.

Example: Prediction and Interpretation

For a flat with surface_m2=50, distance_km=4, and rooms=2, we transform the features with the same scaler and call sgd.predict(...).
A positive coefficient for surface_m2 means larger flats increase rent, while a negative coefficient for distance_km means locations farther from the center tend to reduce rent.
The prediction is a weighted sum of the scaled inputs, updated over many iterations to reduce the regression loss.

5. Clustering

Data partitioning is the process of dividing a dataset into different homogeneous subsets or groups.
Objective: Group data sharing similar characteristics into each subset.

Formal Definition

Let \(X\) be the input feature space
The objective of clustering is to find \(k\) subsets of \(X\), such that

\[ C_1.. ∪ ..C_k ∪ C_{outliers} = X \] and

\[ C_i ∩ C_j = ϕ, i ≠ j; 1 <i,j <k \]

\(C_{outliers}\) may consist of extreme cases (data anomalies)

Clustering Models

Centroid models: Groups are represented by a single mean vector (centroid), so each point joins the nearest center. Example: K-Means, K-Median.
Connectivity models: Clusters are determined by the proximity of connectivity between points, so nearby objects are merged step by step. Example: Hierarchical Agglomerative.
Distribution models: Clusters are modeled using statistical distributions, so each cluster is viewed as coming from a probability law. Example: Gaussian Mixtures.
Density models: Clusters are defined by connected dense regions in the data space, so sparse regions naturally separate groups. Example: DBSCAN, OPTICS.

K-Means Example

Brief Algorithm

Choose k: Decide how many clusters to build.
Initialize: Place k initial centroids.
Assign: Send each point to the nearest centroid.
Update: Replace each centroid by its cluster mean.
Repeat: Continue until assignments barely change.

Example: Customer Segments

customer	annual_spend	visits_per_month
C1	220	2
C2	260	3
C3	980	12
C4	1020	11
C5	540	6
C6	560	7

The data are unlabeled. We want the algorithm to discover groups such as low-, medium-, and high-value customers.

Example: How K-Means Works

segments = pd.DataFrame({
    "annual_spend": [220, 260, 980, 1020, 540, 560, 2400],
    "visits_per_month": [2, 3, 12, 11, 6, 7, 1],
}, index=["C1", "C2", "C3", "C4", "C5", "C6", "C7"])

kmeans = KMeans(n_clusters=3, random_state=0, n_init=10)
segments["kmeans_cluster"] = kmeans.fit_predict(segments)

Start with k: Here we ask for 3 clusters before learning begins.
Assignment step: Each customer goes to the cluster with the nearest centroid.
Update step: Each centroid is replaced by the mean of its current members.
Result here: The data naturally suggest low-spend, medium-spend, and high-spend groups.

Example: How DBSCAN Works

dbscan = DBSCAN(eps=180, min_samples=2)
segments["dbscan_cluster"] = dbscan.fit_predict(segments[["annual_spend", "visits_per_month"]])

sil = silhouette_score(segments[["annual_spend", "visits_per_month"]], segments["kmeans_cluster"])

Density rule: A point starts a cluster only if enough neighbors lie within eps.
Cluster growth: Neighboring dense points are linked together into one region.
Noise handling: Isolated points are not forced into a cluster and receive label -1.
Result here: C7 is far from the dense customer groups, so DBSCAN can treat it as noise.

Example: Evaluating the Clusters

K-Means output: Every point must belong to one cluster, even if a point is unusually far from the others.
Silhouette score: A larger value means points are closer to their own cluster than to neighboring clusters.
DBSCAN output: A label of -1 means "noise" rather than "force into a cluster".
Practical contrast: Use K-Means for compact groups, and DBSCAN when outliers or irregular shapes matter.

Anomaly detection, also known as outlier detection, involves identifying unusual or divergent data in a dataset. Here are some common approaches to detecting anomalies:

Supervised detection: The model is trained on a labeled dataset with examples of anomalies and normal data. The model is then used to predict whether new data is abnormal or normal based on these labels.
Unsupervised detection: Unlike supervised detection, this approach does not use labels in the training dataset. Instead, it identifies anomalies by examining the statistical characteristics of the data and searching for data points that differ significantly from the rest of the dataset.
Semi-supervised detection: This approach combines elements of both previous methods. It uses both labeled and unlabeled data to train the model. This can be useful when only a few anomalies are available for training, but the dataset is primarily unlabeled.

Applications

Intrusion detection: Identify malicious or unauthorized activities in computer networks to protect systems against cyberattacks.
Fraud detection: Spot suspicious financial transactions or fraudulent activities in online transactions, credit cards, or insurance.
System health monitoring: Continuously monitor the health of computer systems, industrial machines or medical equipment to detect potential failures or breakdowns.
Event detection in sensor networks: Identify unusual events or abnormal behaviors in environmental sensor networks, such as air quality monitoring or intrusion detection in security systems.
Abuse detection in information systems: Identify users or activities that abuse or violate security policies in information systems, online applications or social media platforms.

Characteristics

Unexpected spikes: Anomalies can manifest as unexpected spikes or bursts in data. For example, a sudden increase in web traffic may indicate a denial of service (DDoS) attack in the case of network traffic monitoring, or an abnormal increase in financial transactions may signal fraud.

The characteristics of data vary depending on the application domain and the specific types of anomalies sought. Identifying unusual patterns or aberrant behaviors in data can help detect anomalies and take appropriate measures to manage them.

Formalization

Let \(Y\) be a set of measurements. This represents the data or observed variables that are monitored to detect anomalies.
Let \(P_Y(y)\) be a statistical model for the distribution of \(Y\) under "normal" conditions. Normal data is typically modeled by a statistical distribution such as the normal (Gaussian) distribution. This model is used to estimate the probability that the observed data is normal.
Let \(T\) be a user-defined threshold. This is a threshold value set by the user that determines at what probability a measurement is considered abnormal. Measurements whose estimated probability is below this threshold are considered anomalies.
A measurement \(x\) is an outlier if \(P_Y(x) < T\). This condition specifies that if the probability of a measurement is below the defined threshold, that measurement is considered isolated or abnormal relative to the other observations.

Anomaly detection algorithms identify rare observations that differ significantly from the majority of the data.

Isolation Forest idea: Instead of modeling normality directly, it tries to isolate each point using many random splits.
Why anomalies stand out: Rare or extreme points are separated after only a few splits, while normal points need deeper paths.
Anomaly score: The average path length across many random trees is converted into a score; shorter paths mean more suspicious observations.
Main advantage: It works well without labeled anomalies and handles high-dimensional data efficiently.

Example: Suspicious Transactions

transactions = pd.DataFrame({
    "amount_eur": [18, 22, 25, 31, 410, 27, 19, 520],
    "login_gap_min": [120, 95, 80, 110, 2, 105, 130, 1],
}, index=["T1", "T2", "T3", "T4", "T5", "T6", "T7", "T8"])

features = transactions.copy()

iso = IsolationForest(contamination=0.25, random_state=0)
transactions["anomaly_flag"] = iso.fit_predict(features)
transactions["anomaly_score"] = iso.decision_function(features)
transactions.sort_values("anomaly_score")

Features: Each transaction is represented by amount and time since login.
Model step: fit_predict builds many random trees and labels likely anomalies as -1.
Interpretation: Large amount plus very short login gap makes T5 and T8 easier to isolate than typical transactions.
Score meaning: Lower scores indicate more suspicious behavior.

Example: Reading the Output

transaction	flag	interpretation
T5	-1	large amount and very short delay after login
T8	-1	another isolated point with extreme behaviour
T1-T4, T6-T7	1	closer to normal transaction patterns

Isolation Forest builds many random trees. Unusual points get isolated after only a few splits, so they have shorter average path lengths and lower scores.

Feature selection is a process aimed at choosing a subset of relevant features from a large number of available features.

This technique is widely used in domains where the number of features is large relative to the size of the data sample, as this can lead to overfitting problems and high computation time.
Feature selection is also considered a dimensionality reduction method, as it aims to reduce the number of dimensions of the feature space without losing discriminative information.
Feature selection aims to:
- Identify the most relevant features that contribute most to the variability of the data or the predictive capability of the model.
- Reduce the dimensionality of the feature space to improve the performance of machine learning models in terms of computation time and prevention of overfitting.

Example: Selecting the Most Useful Predictors

customers = pd.DataFrame({
    "visits": [2, 3, 9, 10, 4, 8, 1, 7],
    "avg_basket": [18, 22, 65, 72, 25, 60, 15, 58],
    "support_calls": [5, 4, 1, 0, 3, 1, 6, 1],
    "coupon_clicks": [0, 1, 6, 5, 1, 4, 0, 5],
    "loyal": [0, 0, 1, 1, 0, 1, 0, 1],
})

X = customers.drop(columns="loyal")
y = customers["loyal"]
selector = SelectKBest(score_func=f_classif, k=2)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]

The idea is to keep only the variables that best separate loyal and non-loyal customers.

Example: Interpreting the Selection

If selected_features = ["avg_basket", "coupon_clicks"], these are the two strongest variables under the chosen scoring rule.
We can now train a classifier on fewer columns, which often reduces noise, computation time, and overfitting risk.
Feature selection does not create new variables; it keeps the most informative subset of the original ones.

A typical ML project moves from question to monitored model.

Problem: Translate the scientific question into an ML task.
Collect + explore: Gather data and inspect quality, trends, and bias.
Prepare: Clean data and build the feature matrix.
Train: Choose models and fit them on training data.
Evaluate: Measure performance on held-out data.
Iterate: Refine features, data, or model choices.
Deploy + monitor: Use the model and track drift over time.

Translating a scientific question into a well-defined ML problem is a critical first step. The choice of problem type determines which algorithms, metrics, and evaluation strategies are appropriate.

Classification: Predict a discrete class label for each input instance.
- Binary classification: two classes (e.g., diseased vs. healthy).
- Multiclass classification: more than two mutually exclusive classes.
- Multilabel classification: multiple classes can be assigned simultaneously.
Regression: Predict a continuous numerical value (e.g., temperature, concentration, price).
Anomaly Detection: Identify observations that deviate significantly from the expected pattern (e.g., sensor faults, fraud, rare events).

Key Considerations

Data leakage: Occurs when information from the test set (or future data) is inadvertently used during training. It leads to overly optimistic evaluation results and poor generalization in production.
- Common causes: scaling/normalizing with statistics computed on the full dataset, including target-derived features, temporal data split errors.
- Prevention: fit all preprocessing transformers only on training data and apply them to validation/test data without refitting.
Class imbalance: When one class is much more frequent than others, standard accuracy can be misleading. Strategies include oversampling (SMOTE), undersampling, cost-sensitive learning, and appropriate metrics (F1, AUC-ROC).
Target definition: Ambiguity in defining the prediction target can undermine the entire workflow; the target variable must reflect the scientific question precisely.

Proper data partitioning is essential to obtain unbiased estimates of model performance and to avoid overfitting.

Training set: Used to fit the model parameters. Typically 60–80% of the data.
Validation set: Used to tune hyperparameters and select among competing models. Kept separate from the test set to avoid selection bias.
Test set: Used only once for the final evaluation of the chosen model. Must remain untouched during model development. Typically 10–20% of the data.
Stratification: When splitting, stratified sampling ensures each partition preserves the original class distribution, which is especially important under class imbalance.
Temporal splits: For time-series data, splits must respect chronological order to prevent future data leaking into the training set.

Cross-validation is a resampling technique used to evaluate a model's generalization performance when the dataset is too small for a separate validation set.

k-Fold Cross-Validation: The dataset is divided into k equally sized folds. The model is trained on k-1 folds and evaluated on the remaining fold; this process is repeated k times. The final performance estimate is the average over the k runs. Common choices: k = 5 or k = 10.
Stratified k-Fold: Each fold preserves the class distribution, recommended for classification tasks with imbalanced classes.
Leave-One-Out (LOO): A special case of k-fold where k equals the number of samples. Provides an almost unbiased estimate but is computationally expensive for large datasets.
Nested Cross-Validation: An outer loop estimates generalization performance while an inner loop tunes hyperparameters, providing unbiased model selection.

This section introduced advanced topics in the data science and machine learning workflow.

ML Workflow: A structured pipeline from problem definition to model deployment ensures reproducibility and sound generalization.
Problem Formulation: Carefully mapping scientific questions to ML task types (classification, regression, anomaly detection) and guarding against data leakage.
Data Partitioning: Train/validation/test splits and cross-validation provide reliable performance estimates and avoid overfitting.
Data Preparation: Scaling, missing-value imputation, outlier treatment, and feature engineering are prerequisites for effective learning.
Clustering: K-Means, Hierarchical Clustering, and DBSCAN offer complementary approaches to discovering structure in unlabeled data.
Anomaly Detection: One-Class SVM and Isolation Forest enable the detection of rare, atypical observations in a variety of scientific domains.

Research Articles

From data mining to knowledge discovery in databases, Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, AI Magazine Volume 17 Number 3 (1996)
Survey of Clustering Data Mining Techniques, Pavel Berkhin
Mining association rules between sets of items in large databases, Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami. Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD 1993. p. 207.
Comparisons of Sequence Labeling Algorithms and Extensions, Nguyen, Nam, and Yunsong Guo. Proceedings of the 24th international conference on Machine learning. ACM, 2007.

Research Articles

An Analysis of Active Learning Strategies for Sequence Labeling Tasks, Settles, Burr, and Mark Craven. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.
Anomaly detection in crowded scenes, Mahadevan; Vijay et al. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010
A Study of Global Inference Algorithms in Multi-Document Summarization. McDonald, Ryan. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2007.
Feature selection algorithms: A survey and experimental evaluation., Molina, Luis Carlos, Lluís Belanche, and Àngela Nebot. Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002.
Support vector machines, Hearst, Marti A., et al. IEEE Intelligent Systems and their applications 13.4 (1998): 18-28.

Data Science

1. Data Mining

Objectives

1.1. Patterns in Data

Natural Regularities

1.1. Patterns in Data

Human Creations

1.2. Machine Learning Approaches

Approaches

1.3. Data Mining Activities

Activities

2. Data Representation and Formalization

Formalization

2. Data Representation and Formalization

Concrete Examples

2. Data Representation and Formalization

Image and Text Representation

2. Data Representation and Formalization

Formalization

2. Data Representation and Formalization

Formalization: Supervised Learning

2. Data Representation and Formalization

Formalization: Supervised Learning

2. Data Representation and Formalization

Formalization: Unsupervised Learning

2. Data Representation and Formalization

Formalization: Unsupervised Learning

2.1. Data Preparation

2.1. Data Preparation

Missing Data

2.1. Data Preparation

Outliers and Feature Construction

2.1. Data Preparation

Running Example: Customer Campaign Data

2.1. Data Preparation

Example: Why Scaling and Encoding Matter

2.1. Data Preparation

Example: Scaling and Encoding in Pandas

2.1. Data Preparation

Example: Scaled and Encoded Values

2.1. Data Preparation

Example: Missing Data on the Same Dataset

2.1. Data Preparation

Example: Missing Data in Pandas

2.1. Data Preparation

Example: Outliers and Feature Construction

2.1. Data Preparation

Example: Outlier Treatment Results

3. Classification

2.1.1 Introduction

3. Classification

Applications

3. Classification

Classification: Formal Definition

3. Classification

Binary Classification

3. Classification

Linear Classifiers

3. Classification

Example: Linear Scores for Email Classification

3. Classification

Example: Computing the Scores with Pandas

3. Classification

Example: Score Comparison and Prediction

3. Classification

Evaluation

3. Classification

Evaluation: confusion matrix

3. Classification

Evaluation: confusion matrix

3. Classification

Example: Binary Classification with scikit-learn

3. Classification

Example: Fitting and Matrix

3. Classification

Example: Reading the Binary Confusion Matrix

3. Classification

Example: Multiclass Confusion Matrix

3. Classification

Binary Classification