Data Mining (2018-2019): Practicals 4: John Samuel

Goals

Work on decision trees and random forests.
Work on online machine training.
Work on neural network models using Tensorflow.
Finish work on the recommender system and writing of project report.

Scoring

Every exercise has an associated difficulty level. Easy and medium-difficult exercises help you understand the fundamentals and give you ideas to work on difficult exercises. It is highly recommended that you finish easy and medium-difficult exercises to have a good score. Given below is the difficulty scale that will be marked with every exercise:

★: Easy
★★: Medium
★★★: Difficult

Guidelines

To get complete guidance from the mentors, it is highly recommended that you work on today's practical session and not on the preceding ones.
Make sure that you rename your submissions properly and correctly. Double-check your submissions.
Please check the references.
There are several ways to achieve a task. Hence there are many possible solutions. But try to make maximum use of the libraries that have been suggested to you for your exercises.

Installation

Please refer installation page. In this practical session, we will also use graphviz and pydotplus.

Exercise 4.1 ★

In this practical session, we start experimenting with decision trees. We will first build a Decision Tree Classifier using a very simple example. Like the classifiers we have seen before, we will first try to fit our data and then predict a class for a previously unseen value.

from sklearn import tree
data = [[0, 0], 
        [1, 1],
        [1, 0]]
result = [1, 0, 1]
dtc = tree.DecisionTreeClassifier()
dtc = dtc.fit(data, result)
dtc.predict([[1, 1]])

Our next goal is to visualize the decision tree. Look at the following code and see how we have given names to the two columns of the above data. We also gave name names to the result data entries, calling them class1 and class2.

from sklearn import tree
import graphviz
import pydotplus
from IPython.display import Image, display

data = [[0, 0], 
        [1, 1],
        [1, 0]]
result = [1, 0, 1]
dtc = tree.DecisionTreeClassifier()
dtc = dtc.fit(data, result)

dot_data = tree.export_graphviz(dtc, out_file=None,
                                feature_names=['column1', 'column2'],
                                filled=True, rounded=True, 
                                class_names = ['class1', 'class2']
                                ) 
graph = graphviz.Source(dot_data)
pydot_graph = pydotplus.graph_from_dot_data(dot_data)
img = Image(pydot_graph.create_png())
display(img)

Now, let's take some realistic example. In the following code, we consider 13 photographs marked by a user as 'Favorite' and 'NotFavorite'. For every photograph: we have four information: color, tag, size (medium sized, thumbnail etc.) and mode in which the photograph was taken (portrait or landscape). We will build a Decision Tree Classifier with this data. We will then predict whether our user will like a photograph of nature which has a predominant color red, of thumbnail size and taken in portrait mode.

In the following code, we display two values. We predict whether the user will favorite the photograph or not. We also display the importance of each of the features: color, tag, size and mode.

from sklearn import tree
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = [
        ['green', 'nature', 'thumbnail', 'landscape'], 
        ['blue', 'architecture', 'medium', 'portrait'],
        ['blue', 'people', 'medium', 'landscape'],
        ['yellow', 'nature', 'medium', 'portrait'],
        ['green', 'nature', 'thumbnail', 'landscape'],
        ['blue', 'people', 'medium', 'landscape'],
        ['blue', 'nature', 'thumbnail', 'portrait'],
        ['yellow', 'architecture', 'thumbnail', 'landscape'],
        ['blue', 'people', 'medium', 'portrait'],
        ['yellow', 'nature', 'medium', 'landscape'],
        ['yellow', 'people', 'thumbnail', 'portrait'],
        ['blue', 'people', 'medium', 'landscape'],
        ['red', 'architecture', 'thumbnail','landscape']]
result = [
          'Favorite',
          'NotFavorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'NotFavorite',
          'NotFavorite',
          'Favorite',
          'Favorite',
          'NotFavorite',
          'NotFavorite'
          ]


#creating dataframes
dataframe = pd.DataFrame(data, columns=['color', 'tag', 'size', 'mode'])
resultframe = pd.DataFrame(result, columns=['favorite'])

#generating numerical labels
le1 = LabelEncoder()
dataframe['color'] = le1.fit_transform(dataframe['color'])

le2 = LabelEncoder()
dataframe['tag'] = le2.fit_transform(dataframe['tag'])

le3 = LabelEncoder()
dataframe['size'] = le3.fit_transform(dataframe['size'])

le4 = LabelEncoder()
dataframe['mode'] = le4.fit_transform(dataframe['mode'])

le5 = LabelEncoder()
resultframe['favorite'] = le5.fit_transform(resultframe['favorite'])

#Use of decision tree classifiers
dtc = tree.DecisionTreeClassifier()
dtc = dtc.fit(dataframe, resultframe)

#prediction
prediction = dtc.predict([
    [le1.transform(['red'])[0], le2.transform(['nature'])[0],
     le3.transform(['thumbnail'])[0], le4.transform(['portrait'])[0]]])
print(le5.inverse_transform(prediction))
print(dtc.feature_importances_)

What are your observations?

Our next goal is to visualize the above decision tree. Test the code below. It's similar to the code we tested before. Take a look at the classes and the features.

from sklearn import tree
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import graphviz
import pydotplus
from IPython.display import Image, display

data = [
        ['green', 'nature', 'thumbnail', 'landscape'], 
        ['blue', 'architecture', 'medium', 'portrait'],
        ['blue', 'people', 'medium', 'landscape'],
        ['yellow', 'nature', 'medium', 'portrait'],
        ['green', 'nature', 'thumbnail', 'landscape'],
        ['blue', 'people', 'medium', 'landscape'],
        ['blue', 'nature', 'thumbnail', 'portrait'],
        ['yellow', 'architecture', 'thumbnail', 'landscape'],
        ['blue', 'people', 'medium', 'portrait'],
        ['yellow', 'nature', 'medium', 'landscape'],
        ['yellow', 'people', 'thumbnail', 'portrait'],
        ['blue', 'people', 'medium', 'landscape'],
        ['red', 'architecture', 'thumbnail','landscape']]
result = [
          'Favorite',
          'NotFavorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'NotFavorite',
          'NotFavorite',
          'Favorite',
          'Favorite',
          'NotFavorite',
          'NotFavorite'
          ]


#creating dataframes
dataframe = pd.DataFrame(data, columns=['color', 'tag', 'size', 'mode'])
resultframe = pd.DataFrame(result, columns=['favorite'])

#generating numerical labels
le1 = LabelEncoder()
dataframe['color'] = le1.fit_transform(dataframe['color'])

le2 = LabelEncoder()
dataframe['tag'] = le2.fit_transform(dataframe['tag'])

le3 = LabelEncoder()
dataframe['size'] = le3.fit_transform(dataframe['size'])

le4 = LabelEncoder()
dataframe['mode'] = le4.fit_transform(dataframe['mode'])

le5 = LabelEncoder()
resultframe['favorite'] = le5.fit_transform(resultframe['favorite'])

#Use of decision tree classifiers
dtc = tree.DecisionTreeClassifier()
dtc = dtc.fit(dataframe, resultframe)

dot_data = tree.export_graphviz(dtc, out_file=None,
                     feature_names=dataframe.columns,
                     filled=True, rounded=True, 
                     class_names =
                     le5.inverse_transform(
                       resultframe.favorite.unique())
                    ) 
graph = graphviz.Source(dot_data) 
 
pydot_graph = pydotplus.graph_from_dot_data(dot_data)
img = Image(pydot_graph.create_png())
display(img)

What if we had a possibility of multiple decision trees? Let's predict using a Random Forest Classifier (can be seen as a collection of multiple decision trees). Check the predicted value as well as the importance of the different features. Note that we are asking to create 10 such estimators using a maximum depth of 2 for each of the estimator.

from sklearn import tree
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import graphviz
import pydotplus
from IPython.display import Image, display

data = [
        ['green', 'nature', 'thumbnail', 'landscape'], 
        ['blue', 'architecture', 'medium', 'portrait'],
        ['blue', 'people', 'medium', 'landscape'],
        ['yellow', 'nature', 'medium', 'portrait'],
        ['green', 'nature', 'thumbnail', 'landscape'],
        ['blue', 'people', 'medium', 'landscape'],
        ['blue', 'nature', 'thumbnail', 'portrait'],
        ['yellow', 'architecture', 'thumbnail', 'landscape'],
        ['blue', 'people', 'medium', 'portrait'],
        ['yellow', 'nature', 'medium', 'landscape'],
        ['yellow', 'people', 'thumbnail', 'portrait'],
        ['blue', 'people', 'medium', 'landscape'],
        ['red', 'architecture', 'thumbnail','landscape']]
result = [
          'Favorite',
          'NotFavorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'NotFavorite',
          'NotFavorite',
          'Favorite',
          'Favorite',
          'NotFavorite',
          'NotFavorite'
          ]


#creating dataframes
dataframe = pd.DataFrame(data, columns=['color', 'tag', 'size', 'mode'])
resultframe = pd.DataFrame(result, columns=['favorite'])

#generating numerical labels
le1 = LabelEncoder()
dataframe['color'] = le1.fit_transform(dataframe['color'])

le2 = LabelEncoder()
dataframe['tag'] = le2.fit_transform(dataframe['tag'])

le3 = LabelEncoder()
dataframe['size'] = le3.fit_transform(dataframe['size'])

le4 = LabelEncoder()
dataframe['mode'] = le4.fit_transform(dataframe['mode'])

le5 = LabelEncoder()
resultframe['favorite'] = le5.fit_transform(resultframe['favorite'])

#Use of random forest classifier
rfc = RandomForestClassifier(n_estimators=10, max_depth=2,
                             random_state=0)
rfc = rfc.fit(dataframe, resultframe.values.ravel())

#prediction
prediction = rfc.predict([
    [le1.transform(['red'])[0], le2.transform(['nature'])[0],
     le3.transform(['thumbnail'])[0], le4.transform(['portrait'])[0]]])
print(le5.inverse_transform(prediction))
print(rfc.feature_importances_)

Finally we visualize these estimators.

from sklearn import tree
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import graphviz
import pydotplus
from IPython.display import Image, display

data = [
        ['green', 'nature', 'thumbnail', 'landscape'], 
        ['blue', 'architecture', 'medium', 'portrait'],
        ['blue', 'people', 'medium', 'landscape'],
        ['yellow', 'nature', 'medium', 'portrait'],
        ['green', 'nature', 'thumbnail', 'landscape'],
        ['blue', 'people', 'medium', 'landscape'],
        ['blue', 'nature', 'thumbnail', 'portrait'],
        ['yellow', 'architecture', 'thumbnail', 'landscape'],
        ['blue', 'people', 'medium', 'portrait'],
        ['yellow', 'nature', 'medium', 'landscape'],
        ['yellow', 'people', 'thumbnail', 'portrait'],
        ['blue', 'people', 'medium', 'landscape'],
        ['red', 'architecture', 'thumbnail','landscape']]
result = [
          'Favorite',
          'NotFavorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'Favorite',
          'NotFavorite',
          'NotFavorite',
          'Favorite',
          'Favorite',
          'NotFavorite',
          'NotFavorite'
          ]


#creating dataframes
dataframe = pd.DataFrame(data, columns=['color', 'tag', 'size', 'mode'])
resultframe = pd.DataFrame(result, columns=['favorite'])

#generating numerical labels
le1 = LabelEncoder()
dataframe['color'] = le1.fit_transform(dataframe['color'])

le2 = LabelEncoder()
dataframe['tag'] = le2.fit_transform(dataframe['tag'])

le3 = LabelEncoder()
dataframe['size'] = le3.fit_transform(dataframe['size'])

le4 = LabelEncoder()
dataframe['mode'] = le4.fit_transform(dataframe['mode'])

le5 = LabelEncoder()
resultframe['favorite'] = le5.fit_transform(resultframe['favorite'])

#Use of decision tree classifiers
rfc = RandomForestClassifier(n_estimators=10, max_depth=3,
                             random_state=0,)
rfc = rfc.fit(dataframe, resultframe.values.ravel())

for i in range(10):
    dot_data = tree.export_graphviz(rfc.estimators_[i], out_file=None,
                   feature_names=dataframe.columns,
                   filled=True, rounded=True,
                   class_names =
                    le5.inverse_transform(
                      resultframe.favorite.unique())
                   ) 
    graph = graphviz.Source(dot_data) 
    pydot_graph = pydotplus.graph_from_dot_data(dot_data)
    img = Image(pydot_graph.create_png())
    display(img)

Exercise 4.2 ★

During our last practical session 3, we split our data into two: training data and test data for creating models for prediction and we fed the complete training data to our classifier. However, in real life, we may have new data to train. Check the following code using perceptron and compare it with the code of exercise 3.3:

from sklearn import datasets, metrics from sklearn.linear_model import Perceptron import numpy as np import matplotlib.pyplot as plot digits = datasets.load_digits() training_size = int(digits.images.shape[0]/2) training_images = digits.images[0:training_size] training_images = training_images.reshape((training_images.shape[0], -1)) training_target = digits.target[0:training_size] classifier = Perceptron(max_iter=1000) #training for i in range(training_size): training_data = np.array(training_images[i]) training_data = training_data.reshape(1, -1) classifier.partial_fit(training_data, [training_target[i]], classes=np.unique(digits.target)) #prediction predict_images = digits.images[training_size+1:] actual_labels = digits.target[training_size+1:] predicted_labels = classifier.predict(predict_images.reshape((predict_images.shape[0], -1))) #classification report print(metrics.classification_report(actual_labels,predicted_labels))

This approach is called online machine training (or algorithme d'apprentissage incrémental (fr)). Did you get good precision?

Your next question is to modify the above program and test online training with MLPClassifier.

Try modifying (reducing and increasing) the training data size. What are your observations?

Exercise 4.3 (Optional) ★★

Your final exercise is to use Tensorflow. We will use a Deep Neural Network (DNN) classifier with two hidden layers. Before starting, please refer the installation page for installing tensorflow.

Recall that we have already used Multilayer perceptron (MLP), a subset of Deep Neural Network in our preceding practical session. We will first predict an image of a digit using DNNClassifier.

import tensorflow as tf from sklearn import datasets import matplotlib.pyplot as plot digits = datasets.load_digits() training_size = int(digits.images.shape[0]/2) training_images = digits.images[0:training_size] training_images = training_images.reshape((training_images.shape[0], -1)) training_target = digits.target[0:training_size] classifier = tf.contrib.learn.DNNClassifier( feature_columns=[tf.contrib.layers.real_valued_column("", dtype=tf.float64)], # 2 hidden layers of 50 nodes each hidden_units=[50, 50], # 10 classes: 0, 1, 2...9 n_classes=10) #training classifier.fit(training_images, training_target, steps=100) #prediction predict_images = digits.images[training_size+1:] predict = classifier.predict(predict_images[16].reshape(1,-1)) print(list(predict)) plot.imshow(predict_images[16], cmap=plot.cm.gray_r) plot.show()

Did it work? Let's now try to get the accuracy of our model. Will it work for our entire test data?

import tensorflow as tf from sklearn import datasets import matplotlib.pyplot as plot digits = datasets.load_digits() training_size = int(digits.images.shape[0]/2) training_images = digits.images[0:training_size] training_images = training_images.reshape((training_images.shape[0], -1)) training_target = digits.target[0:training_size] classifier = tf.contrib.learn.DNNClassifier( feature_columns=[tf.contrib.layers.real_valued_column("", dtype=tf.float64)], # 2 hidden layers of 50 nodes each hidden_units=[50, 50], # 10 classes: 0, 1, 2...9 n_classes=10) #training classifier.fit(training_images, training_target, steps=100) #prediction predict_images = digits.images[training_size+1:] actual_labels = digits.target[training_size+1:] evaluation = classifier.evaluate(x=predict_images.reshape((predict_images.shape[0], -1)), y=actual_labels) print(evaluation['accuracy'])

What is the accuracy that you got? Now change the number of neurons in each layer (currently it is set to 50 each). Also try to increase the number of hidden layers. Did your accuracy improve?

Exercise 4.4 ★★★

Project: Image recommender system: 3 practical sessions

Recall that the goal of this project is to recommend images based on the color preferences of the user.

If required, you can refer to some example Python code in the references page to resize images, read and write JSON files.

Please prepare a 3-page Project report (French or English) detailing the following:

Goal of your project
Data sources of your training images and licence. Did you use labeled data sources? Did you ask your user to label images?
Machine learning models that you tested and used as well as their precision.
Size of your training data and test data.
Did you use online machine learning?
Information that you decided to store for each image.
Information concerning user preferences
Self-evaluation of your work.
Remarks concering the practical sessions, exercises and scope for improvement.
Conclusion

Note: Please do not add any program (or code) in this report.

Submission

Rename your notebook as Name1_Name2_[Name3].ipynb, where Name1, Name2 are your names.
Rename your project report as Name1_Name2_[Name3].pdf, where Name1, Name2 are your names.
Submit your notebook and Project report online.
Please do not submit your images, JSON, TSV and CSV files.

References

Link

Practicals: Data Mining

John Samuel