# Data Mining

**John Samuel**

CPE Lyon

**Year**: 2017-2018

**Email**: john(dot)samuel(at)cpe(dot)fr

- Understanding Patterns
- Data mining tasks
- Algorithms for data mining
- Feature Selection

- Symmetry
- Trees, Fractals
- Spirals
- Chaos
- Waves
- Bubbles, Foam
- Tesselations
- Cracks
- Spots, stripes

- Buildings (Symmetry)
- Cities
- Virtual environments (e.g., video games)
- Human artifacts

- Repitition
- Fractals
- Julia set:
*f(z) = z*^{2}+ c

- Julia set:

- Pattern recognition
- Knowledge discovery in databases
- Data mining
- Machine learning

- Goal is to detect patterns and regularities in data
- Approaches
**Supervised learning**: Availability of labeled training data**Unsupervised learning**: No labeled training data available**Semi-supervised learning**: Small set of labeled training data and a large amount of unlabeled data

**Euclidean vector**: geometric object with magnitude and direction**Vector space**: collection of vectors that can be added together and multiplied by numbers.**Feature vector**: n-dimensional vector**Feature space**: Vector space associated with the vectors

**Images**: pixel values.**Texts**: Frequency of occurence of textual phrases.

**Feature construction**: construction of new features from already available features^{1}**Feature construction operators**- Equality operators, arithmetic operators, array operators (min, max, average etc.)...

- Let
**Year of Birth**and**Year of Death**be two existing features. - A new feature called
**Age**= Year of Birth - Year of Death

- https://en.wikipedia.org/wiki/Feature_vector

- Let
be the number of training examples**N** - Let
be the input feature space**X** - Let
be the output feature space (of labels)**Y** - Let {
*(x*} be the_{1}, y_{1}),...,(x_{N}, y_{N})training examples, where**N***x*is the feature vector of_{i}*i*training example.^{th}*y*is its label._{i}

- The goal of supervised learning algorithm is to find
*g:*, where**X**→**Y***g*is one of the functions from the set of possible functions*G*(hypotheses space)

**Scoring function**denote the space of scoring functions, where*F**f:*such that**X**×**Y**→**R***g*returns the highest scoring function.

- Let
be the input feature space**X** - Let
be the output feature space (of labels)**Y** - The goal of unsupervised learning algorithm is to
- find mapping
**X**→**Y**

- find mapping

- Let
be the input feature space**X** - Let
be the output feature space (of labels)**Y** - Let {
*(x*} be the_{1}, y_{1}),...,(x_{l}, y_{l})be the set of labeled training examples**l** - Let {
*x*} be the_{l+1},...,x_{l+u}be the set of unlabeled feature vectors of**u**.**X** - The goal of semi-supervised learning algorithm is to do
**Transductive learning**, i.e., find correct labels for {*x*}. OR_{l+1},...,x_{l+u}**Inductive learning**, i.e., find correct mapping**X**→**Y**

- Classification
- Clustering
- Regression
- Sequence Labeling
- Association Rules
- Anomaly Detection
- Summarization

- Generalizing known structure to apply to new data
- Identifying the set of categories to which an object belongs
- Binary vs. Multiclass classification

- Spam vs Non-spam
- Document classification
- Handwriting recognition
- Speech Recognition
- Internet Search Engines

- Let
be the input feature space**X** - Let
be the output feature space (of labels)**Y** - The goal of classification algorithm (or classifier) is to find
{
*(x*}, i.e., assigning a known label to every input feature vector, where_{1}, y_{1}),...,(x_{l}, y_{k})*x*_{i}∈**X***y*_{i}∈**Y**- |
|**X***= l* - |
|**Y***= k* - l >= k

- Classifying Algorithm
- Two types of classifiers:
**Binary classifiers**assigning an object to any of two classes**Multiclass classifiers**assigning an object to one of several classes

- A linear function assigning a score to each possible category by combining the feature vector of an instance with a vector of weights, using a dot product.
- Formalization:
- Let
be the input feature space and**X****x**_{i}∈**X** - Let
be vector of weights for category**β**_{k}*k* *score(*, score for assigning category**x**_{i}, k) =**x**_{i}.**β**_{k}*k*to instance. The category that gives the highest score is assigned as the category of the instance.**x**_{i}

- Let

- Discovering groups and structures in the data without using known structures in the data
- Objects in a cluster are more similar to each other than the objects in the other cluster

- Social network analysis
- Image segmentation
- Recommender systems
- Grouping of shopping items

- Let
be the input feature space**X** - The goal of clustering is to find k subsets of
, in such a way that**X***C*and_{1}.. ∪ ..C_{k}∪ C_{outliers}=**X***C*_{i}∩ C_{j}= ϕ, i ≠ j; 1 <i,j <k*C*may consist of outlier instances (data anomaly)_{outliers}

**Centroid models**: cluster represented by a single mean vector**Connectivity models**: distance connectivity- Distribution models: clusters modeled using statistical distributions
- Density models: clusters as connected dense regions in the data space
- Subspace models
- Group models
- Graph-based models
- Neural models

- Finding a function which models the data
- Assigns a real-valued output to each input
- Estimating the relationships among variables
- Relationship between a dependent variable ('criterion variable') and one or more independent variables ('predictors').

- Prediction
- Forecasting
- Machine learning
- Finance

- A function that maps a data item to a prediction variable
- Let
be the independent variables**X** - Let
be the dependent variables**Y** - Let
be the unknown parameters (scalar or vector)**β** - The goal of regression model is to approximate
with**Y**, i.e.,**X**,**β****Y**≅ f(**X**,**β**)

- straight line:
*y*OR_{i}= β_{0}+ β_{1}x_{i}+ ε_{i} - parabola:
*y*_{i}= β_{0}+ β_{1}x_{i}+ β_{1}x_{i}^{2}+ε_{i}

- straight line:
*y*OR_{i}= β_{0}+ β_{1}x_{i}+ ε_{i} *ŷ*OR_{i}= β_{0}+ β_{1}_{i}- Residual:
*e*_{i}= ŷ_{i}- y_{i} - Sum of squared residuals, SSE = Σ e
_{i}, where 1 < i < n - The goal is to minimize SSE

- Assigning a class to each member of a sequence of values

- Part of speech tagging
- Linguistic translation
- Video analysis
- Handwriting recognition
- Information extraction

- Let
be the input feature space**X** - Let
be the output feature space (of labels)**Y** - Let
*〈x*be a sequence of length_{1},...,x_{T}〉*T*. - The goal of sequence labeling is to generate a corresponding sequnce
*〈y*of labels_{1},...,y_{T〉}*x*_{i}∈**X***y*_{j}∈**Y**

- Searches for relationships between variables

- Web usage mining
- Intrusion detection
- Affinity analysis

- Let
be a set of**I***n*binary attributes called items - Let
be a set of**T***m*transactions called database - Let
= {**I***(i*} and_{1},...,i_{n})= {**T***(t*}_{1},...,t_{m}) - The goal of association rule learning is to find
, where**X**⇒**Y****X**⇒**Y**⊆**I**is the antecedent**X**is the consequent**Y**

- Support: how frequently an itemset appears in the database
*supp(*;**X**) = |t ∈**T****X**⊆ t| / |**T**|

- Confidence: how frequently the rule has been found to be true.
*conf(***X**⇒**Y**) = supp(**X**∪**Y**)/supp(**X**)

- Lift: the ratio of the observed support to that of the expected if X and Y were independent
*lift(***X**⇒**Y**) = supp(**X**∪**Y**)/(supp(**X**) ⨉ supp(**Y**))

- {
**bread**,**butter**} ⇒ {**milk**}

- Identification of unusual data records
- Approaches
- Unsupervised anomaly detection
- Supervised anomaly detection
- Semi-supervised anomaly detection

- Intrusion detection
- Fraud detection
- Remove anomalous data
- System health monitoring
- Event detection in sensor networks
- Misuse detection

- Unexpected bursts

- Let
be a set of measurements**Y** - Let
*P*be a statistical model for the distribution of_{Y}(y)under 'normal' conditions.**Y** - Let
be a user-defined threshold.**T** - A measurement
*x*is an outlier if*P*_{Y}(x) <**T**

- Providing a more compact representation of the data set
- Report Generation

- Keyphrase extraction
- Document summarization
- Search engines
- Image summarization
- Video summarization: Finding important events from videos

- Let {
} be a document collection of k documents**D**= D_{1}, ..., D_{k} - A Document {
*D = t*} consists of m textual units (words, sentences, paragraphs etc.)_{1}, ..., t_{m} - Let {
} be the complete set of all textual units from all documents, where**D**= t_{1}, ..., t_{n}*t*if and only if_{i}∈**D**,*∃ D*such that_{j}*t*_{i}∈ D_{j}

*S ⊆*constitutes a summary**D**- Two scoring functions
*Rel(i)*: relevance of textual unit*i*in the summary*Red(i,j)*: Redundancy between two textual units*t*, t_{i}_{j}

- Scoring for a summary S
- S =
*argmax*_{s ⊆ D}s(S) - S =
*argmax*)_{s ⊆ D}(Σ_{ti∈ S}Rel(i) - Σ_{ti,tj ∈ S,i <j }Red(i,j) - such that
*Σ*_{ti∈ S}l(i) <= K *l(i)*is the length of the textual unit i*K*is the fixed length of the summary

- S =

- Finding a subset from the entire subset
- Approaches
**Extraction**: Selecting a subset of existing words, phrases, or sentences in the original text without any modification**Abstraction**: Build an internal semantic representation and then use natural language generation techniques

- Approaches
**Generic summarization**: Obtaining a generic summary**Query relevant summarization**: Summary relevant to a query

- Support Vector Machines (SVM)
- Stochastic Gradient Descent (SGD)
- Nearest-Neighbours
- Naive Bayes
- Decision Trees
- Ensemble Methods (Random Forest)

- Supervised learning approach
- Binary classification algorithm
- Constructs a hyperplane ensuring the maximum separation between two classes

- Hyperplane of n-dimensional space is a subspace of dimension
*n-1* - Examples
- Hyperplane of a 2-dimensional space is 1-dimensional line
- Hyperplane of a 3-dimensional space is 2-dimensional plane

- The goal of a SVM is to estimate a function
*f: R*, i.e.,^{N}⨉ {+1,-1}- If
*x*∈_{1},...,x_{l}*R*are the^{N}*N*input data points, - the goal is to find
*(x*∈_{1},y_{1}),...,(x_{l},y_{l})*R*^{N}⨉ {+1,-1}

- If
- Any hyperplane can be written by the equation using set of input points
**x**, where**w**.**x**- b = 0∈ R**w**^{N}, a normal vector to the plane*b ∈ R*

- A decision function is given by
*f(x) = sign(***w**.**x**- b )

- If the training data are linearly separable, two hyperplanes can be selected
- They separate the two classes of data, so that distance between them is as large as possible.
- The hyperplanes can be given by the equations
**w**.**x**- b = 1**w**.**x**- b = -1

- The distance between the two hyperplanes can be given by
*2/||***w**|| - Region between these two hyperplanes is called margin.
- Maximum-margin hyperplane is the hyperplane that lies halfway between them.

- In order to prevent data points from falling into the margin, following constraints are added
, if**w**.**x**_{i}- b >= 1*y*_{i}= 1, if**w**.**x**_{i}- b <= -1*y*_{i}= -1

*y*for 1<= i <= n_{i}(**w**.**x**_{i}- b) >= 1- The goal is to minimize ||
**w**|| subject to*y*for 1<= i <= n_{i}(**w**.**x**_{i}- b) >= 1 - Solving for both
and**w***b*gives our classifier*f(x) = sign(***w**.**x**- b) - Max-margin hyperplane is completely determined by the points that lie nearest to it, called the
**support vectors**

- Classification (Multi-class classification)
- Regression
- Anomaly detection

- Text and hypertext categorization
- Image classification
- Handwriting recognition

- A stochastic approximation of the gradient descent optimization
- Iterative method for minimizing an objective function that is written as a sum of differentiable functions.
- Finds minima or maxima by iteration

- Multi-variable generalization of the derivative.
- Gives slope of the tangent of the graph of a function
- Gradient points in the direction of the greatest rate of increase of a function
- Magnitude of gradient is the slope of the graph in that direction

- Derivatives defined on functions of single variable
- Gradient defined on functions of multiple variables
- Gradient is a vector-valued function (range is a vector)
- Derivative is a scalar-valued function

- First-order iterative optimization algorithm for finding the minimum of a function.
- Finding a local minima involves taking steps proportional to

the negative of the gradient of the function at the current point.

- Let's take the problem of minimizing an objective function
*Q(w) = 1/n (ΣQ*_{i}(w)), 1<=i<n- Summand function
*Q*associated with_{i}*i*observation in the data set.^{th}

*w = w - η.∇ Q(w)*

- Choose an initial vector of parameters
and learning rate η. - Repeat until an approximate minimum is obtained:
- Randomly shuffle examples in the training set.
*w = w - η.∇ Q*for i=1...n_{i}(w),

- Classification
- Regression

- k-NN classification: output is a class membership

(object is classified by a majority vote of its neighbors.) - k-NN regression: output is the property value for the object

(average values of its k nearest neighbors)

- Regression
- Anomaly detection

- Collection of simple probabilistic classifiers based on applying Bayes' theorem with strong independence assumption between the features.

- Document classification (spam/non-spam)

- If A and B are events.
- P(A), P(B) are probabilities of observing A and B independently of each other..
- P(A|B) is conditional probability, the likelihood of event A occurring given that B is true
- P(B|A) is conditional probability, the likelihood of event B occurring given that A is true
- P(B) ≠ 0
*P(A|B) = (P(B|A).P(A))/P(B)*

- Decision support tool
- Tree-like model of decisions and their possible consequences

- Classification
- Regression
- Decision Analysis: identifying strategies to reach a goal
- Operations Research

- Collection of multiple learning algorithms to obtain better predictive performance than could be obtained from one of the constituting algorithms alone.
- Random forests are obtained by building multiple decision trees at training time

- Multiclass classification
- Multilabel classification (the problem of assigning one or more label to each instance. There is no limit on the number of classes an instance can be assigned to.)
- Regression
- Anomaly detection

- Process of selecting a subset of relevant features
- Used in domains with large number of features and comparatively few sample points

- Analysis of written texts
- Analysis of DNA microarray data

- Let
*X*be the original set of n features, i.e., |*X*| = n - Let
*w*be the weight assigned to feature_{i}*x*∈_{i}*X* - Binary feature selection assigns binary weights whereas continuous feature selection assigns weights preserving the order of its relevance.
- Let
*J(X')*be an evaluation measure, defined as*J: X' ⊆ X → R* - Feature selection problem may be defined in three following ways
- |
*X'*| =*m < n*. Find*X'*⊂*X*such that*J(X')*is maximum - Choose
*J*, Find_{0}*X'*⊆*X*, such that*J(X') >= J*_{0} - Find a compromise among minimizing |
*X'*| and maximizing*J(X')*

- |

- From data mining to knowledge discovery in databases, Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, AI Magazine Volume 17 Number 3 (1996)
- Survey of Clustering Data Mining Techniques, Pavel Berkhin
- Mining association rules between sets of items in large databases, Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami. Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD 1993. p. 207.
- Comparisons of Sequence Labeling Algorithms and Extensions, Nguyen, Nam, and Yunsong Guo. Proceedings of the 24th international conference on Machine learning. ACM, 2007.
- An Analysis of Active Learning Strategies for Sequence Labeling Tasks, Settles, Burr, and Mark Craven. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.

- Anomaly detection in crowded scenes, Mahadevan; Vijay et al. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010
- A Study of Global Inference Algorithms in Multi-Document Summarization. McDonald, Ryan. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2007.
- Feature selection algorithms: A survey and experimental evaluation., Molina, Luis Carlos, Lluís Belanche, and Àngela Nebot. Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002.
- Support vector machines, Hearst, Marti A., et al. IEEE Intelligent Systems and their applications 13.4 (1998): 18-28.