NOTE: Article in Progress

From data to insights

1. Lifecycle of data

Lifecycle of data: data, knowledge, insights, action

Data to knowledge

  1. Data acquisition
  2. Data Extraction
  3. Data Cleaning
  4. Data Transformation
  5. ETL
  6. Data analysis modeling
  7. Data Storage
  8. Analysis
  9. Visualisation
Major steps of data analysis

Knowledge to Insights

Knowledge to Knowledge

Insights to Action

Insights to Knowledge

Action to Data

Data analysis

Data visualisation

finding known and unknown patterns from data

ETL (Extraction Transformation and Loading)

  1. Data Extraction
  2. Data Cleaning
  3. Data Transformation
  4. Loading data to information stores

2. Data Acquistion and Storage

2.1. Data acquisition

  1. Surveys
    • Manual surveys
    • Online surveys
  2. Sensors1
    • Temperature, pressure, humidity, rainfall
    • Acoustic, navigation
    • Proximity, presence sensors
  3. Social networks
  4. Video surveillance cameras
  5. Web
  1. https://en.wikipedia.org/wiki/List_of_sensors

2.2. Data storage formats

2.3 Types of data stores

  1. Structured data stores
    • Relational databases
    • Object-oriented databases
  2. Unstructured data stores
    • Filesystems
    • Content-management systems
    • Document collections
  3. Semi-structured data stores
    • Filesystems
    • NoSQL data stores
Unstructured vs. Structured vs. Semi-structured

NoSQL versus SQL5: no strict schemas and no horizontal scaling for NoSQL data stores.

2.3.1. ACID Transactions1

  1. https://en.wikipedia.org/wiki/ACID

2.3.2. Types of data stores

2.3.3. NoSQL

2.3.4. Types of NoSQL stores

3. Data Extraction and Integration

3.1. Data extraction techniques

3.2. Query interfaces

3.3. Crawlers for web pages

Web crawlers: navigating the entire using hyperlinks

3.4. Application Programming Interface (API)

API (Interface de programmation)

4. Pre-treatement of Data

4.1 Data Cleaning: Types of Errors

4.1.1. Syntactical errors

4.1.2. Semantic errors

4.1.3. Coverage errors

4.2. Data Cleaning: Handling Errors

4.2.1. Handling Syntactical errors

4.2.2. Handling Semantic errors

4.2.3. Handling Coverage errors

4.2.4. Administrators and handling errors

5. Data Transformation

Languages

6. ETL

6.1. ETL (Extraction Transformation and Loading)

  1. Data Extraction
  2. Data Cleaning
  3. Data Transformation
  4. Loading data to information stores
ETL (Extraction, Transformation and Loading)

6.2.1. Models for data analysis

6.2.2. Star Schema

6.2.3. Data Cubes

6.2.4. Snow Schema

6.2. ETL: From one data store to another

7. Data Analysis

1.1.3. Data analysis

Activities of data analysis

  1. Retrieving values
  2. Filter
  3. Compute derived values
  4. Find extremum
  5. Sort
  6. Determine range
  7. Characterize distribution
  8. Find analysis
  9. Cluster
  10. Correlate
  11. Contextualization
  1. https://en.wikipedia.org/wiki/Data_analysis

8. Data Visualization

8.1. Types of Data Visualization

  1. Time-series
  2. Ranking
  3. Part-to-whole
  4. Deviation
  5. Sort
  6. Frequency distribution
  7. Correlation
  8. Nominal comparison
  9. Geographic or geospatial
  1. https://en.wikipedia.org/wiki/Data_visualization

8.2. Data Visualization: Examples

  1. Bar-chart (Nominal comparison)
  2. Pie-chart (part-to-whole)
  3. Histograms (frequency-distribution)
  4. Scatter-plot (correlation)
  5. Network
  6. Line-chart (time-series)
  7. Treemap
  8. Gantt chart
  9. Heatmap

Pie Chart

Programming Language Paradigms (Bubble Chart)
Timeline of Programming Languages (using Histropedia)
Influence Graph of Programming Languages

k Predominant colours

RGB Scatter plots (Comparison)

9. Patterns

9.1. Patterns in Nature

9.2. Patterns by Humans

Pattern creation

Synonyms

Data mining trends2 future (2007)3 finding patterns in data4

Pattern Recognition

Formalization

Examples: Features

Formalization

Example

  1. https://en.wikipedia.org/wiki/Feature_vector

Formalization: Supervised learning

Formalization: Unsupervised learning

Formalization: Semi-supervised learning

10. Data Mining

Tasks in Data Mining

  1. Classification
  2. Clustering
  3. Regression
  4. Sequence Labeling
  5. Association Rules
  6. Anomaly Detection
  7. Summarization

10.1. Classification

Applications

Formal definition

Classifiers

Linear Classifiers

Classifiers

Let

Then

Confusion Matrix for a SVM classifier of handwritten digits (MNIST)
Multiclass classification
One-vs.-rest strategy for Multiclass classification
One-vs.-one strategy for Multiclass classification

10.2. Clustering

Applications

Formal definition

Cluster models

10.3. Regression

Applications

Formal definition

Linear regression

Linear regression

10.4. Sequence Labeling

Applications

Formal definition

10.5. Association Rules

Association Rules

Applications

Formal definition

Formal definition

Example

10.6. Anomaly Detection

Applications

Characteristics

Formalization

10.7. Summarization

Applications

Formalization: Multidocument summarization

Extractive summarization

11. Algorithms

  1. Support Vector Machines (SVM)
  2. Stochastic Gradient Descent (SGD)
  3. Nearest-Neighbours
  4. Naive Bayes
  5. Decision Trees
  6. Ensemble Methods (Random Forest)

11.1. Support Vector Machines (SVM)

Introduction

Hyperplane

Formal definition

Normal vector

Formal definition

Formal definition

Data mining tasks

Applications

11.2. Stochastic Gradient Descent (SGD)

Gradient

Gradient vs Derivative

Gradient descent

Standard gradient descent method

Iterative method

Applications

11.3. Nearest-Neighbours

k-nearest neighbors algorithm

Applications

11.4. Naive Bayes classifiers

Applications

Bayes' Theorem

11.5. Decision Trees

Applications

11.6. Ensemble Methods (Random Forest)

Defintion

12. Feature Selection

Definition

Applications

Formal defintion[8]

Data Mining

Goals

  1. Artifical Neural Networks
  2. Deep Learning
  3. Reinforcement Learning
  4. Data Licences, Ethics and Privacy

13. Artificial Neural Networks

Artificial neural networks

Perceptron

Artificial neural networks

Perceptron: Formal definition

Perceptron: Steps

  1. Initialize weights and threshold
  2. For each example (xj, dj) in training set
    • Calculate the weight: yj(t)=f[w(t).xj]
    • Update the weights: wi(t + 1) = wi(t) + (dj-yj(t))xj,i
  3. Repeat step 2 until the iteration error 1/s (Σ |dj - yj(t)|) is less than user-specified threshold.

Backpropagation

14. Deep Learning

Deep neural networks

Applications

Convolutional deep neural networks

15. Reinforcement Learning

16. Data Licences, Ethics and Privacy

Privacy

Big Data

Open Data
Linked Open data cloud
Archived data

References

  1. Data Mining course by John Samuel (2017)
  2. Piatetsky-Shapiro, Gregory. “Data Mining and Knowledge Discovery 1996 to 2005: Overcoming the Hype and Moving from ‘University’ to ‘Business’ and ‘Analytics.’” Data Mining and Knowledge Discovery, vol. 15, no. 1, July 2007, pp. 99–105. DOI.org (Crossref), doi:10.1007/s10618-006-0058-2.
  3. Fayyad, Usama, et al. “Knowledge Discovery and Data Mining: Towards a Unifying Framework.” Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, 1996, pp. 82–88.
  4. NoSQL vs. SQL