Model Classifier
p71 The model learn a hierarchy if/else questions, leading to a decision.
""" Decision Tree / Classifier (Play Tennis)
Given certain values for each of the attributes, the learned decision tree
is able to give a clear answer if weather is suitable or not for tennis.
The model can make prediction only for known sample from dataset.
"""
import pathlib
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
# Dataset
DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / 'data/play_tennis.csv')
# Encode lables
df_encoded = pd.DataFrame()
for col in df.columns:
df_encoded[col] = LabelEncoder().fit_transform(df[col])
# Train data
X = df_encoded.drop(columns=["play"])
Y = df_encoded['play']
# Fitting the model
dtree_model = DecisionTreeClassifier()
dtree_model.fit(X, Y)
# Prediction
x_new = [1, 0, 1, 0] # expect 1
x_new = pd.DataFrame([x_new], columns=X.columns)
y_pred = dtree_model.predict(x_new)[0]
print("Dataset:"); print(df, "\n")
print("Encoded:"); print(df_encoded, "\n")
print("Unknown:"); print(x_new, "\n")
print("Prediction:", y_pred)
"""
Dataset:
outlook temp humidity windy play
0 sunny hot high False no
1 sunny hot high True no
2 overcast hot high False yes
3 rainy mild high False yes
4 rainy cool normal False yes
5 rainy cool normal True no
6 overcast cool normal True yes
7 sunny mild high False no
8 sunny cool normal False yes
9 rainy mild normal False yes
10 sunny mild normal True yes
11 overcast mild high True yes
12 overcast hot normal False yes
13 rainy mild high True no
Encoded:
outlook temp humidity windy play
0 2 1 0 0 0
1 2 1 0 1 0
2 0 1 0 0 1
3 1 2 0 0 1
4 1 0 1 0 1
5 1 0 1 1 0
6 0 0 1 1 1
7 2 2 0 0 0
8 2 0 1 0 1
9 1 2 1 0 1
10 2 2 1 1 1
11 0 2 0 1 1
12 0 1 1 0 1
13 1 2 0 1 0
Unknown:
outlook temp humidity windy
0 1 0 1 0
Prediction: 1
"""
➥ Information Gain
Information Gain
The attribute with the highest IG is given the higher priority in the tree.
""" Decision Tree / Classifier (Play Tennis)
Given certain values for each of the attributes, the learned decision tree
is able to give a clear answer if weather is suitable or not for tennis.
We have 4 features (outlook, temperature, humidity, windy) and the one target (play).
Information gain is used to identify which attribute provides more information.
The attribute with the highest IG is given the higher priority in the tree.
For example, by calculating the IG for `humidity` and `wind`, we would find that
`humidity` plays a more important role, so it is consider as a better classifier.
"""
import numpy as np
import pandas as pd
import pathlib
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
import graphviz
# Dataset
DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / 'data/play_tennis.csv')
# Encode dataset (sunny=2, rainy=1 ...)
df_encoded = pd.DataFrame()
for col in df.columns:
df_encoded[col] = LabelEncoder().fit_transform(df[col])
# Train data
X = df_encoded.drop(['play'], axis=1) # remove column labeled `play`
y = df_encoded['play']
# Fitting the model
decision_tree = DecisionTreeClassifier(criterion='entropy')
decision_tree.fit(X, y)
# Predictions
X1_new = X.iloc[2:3] # second row
y1_pred = decision_tree.predict(X1_new)[0]
X2_new = [2, 2, 0, 0] # third row
X2_new = pd.DataFrame([X2_new], columns=X.columns)
y2_pred = decision_tree.predict(X2_new)[0]
# Output
dot_data = tree.export_graphviz(decision_tree, out_file=None, filled=True,
class_names=["no", "yes"], feature_names=X.columns)
dot_graph = graphviz.Source(dot_data)
dot_graph.view()
tree_text = tree.export_text(decision_tree, feature_names=list(X.columns))
outputs = [
["Dataset:", df],
["Encoded dataset:", df_encoded],
["Decision tree:", tree_text],
["Row 2:", df.iloc[2:3]],
["Play prediction:", y1_pred],
["Row 7:", df.iloc[7:8]],
["Play prediction:", y2_pred],
]
for out in outputs:
print("\n", out[0], "\n", out[1])
"""
Dataset:
outlook temp humidity windy play
0 sunny hot high False no
1 sunny hot high True no
2 overcast hot high False yes
3 rainy mild high False yes
4 rainy cool normal False yes
5 rainy cool normal True no
6 overcast cool normal True yes
7 sunny mild high False no
8 sunny cool normal False yes
9 rainy mild normal False yes
10 sunny mild normal True yes
11 overcast mild high True yes
12 overcast hot normal False yes
13 rainy mild high True no
Encoded dataset:
outlook temp humidity windy play
0 2 1 0 0 0
1 2 1 0 1 0
2 0 1 0 0 1
3 1 2 0 0 1
4 1 0 1 0 1
5 1 0 1 1 0
6 0 0 1 1 1
7 2 2 0 0 0
8 2 0 1 0 1
9 1 2 1 0 1
10 2 2 1 1 1
11 0 2 0 1 1
12 0 1 1 0 1
13 1 2 0 1 0
Decision tree:
|--- outlook <= 0.50
| |--- class: 1
|--- outlook > 0.50
| |--- humidity <= 0.50
| | |--- outlook <= 1.50
| | | |--- windy <= 0.50
| | | | |--- class: 1
| | | |--- windy > 0.50
| | | | |--- class: 0
| | |--- outlook > 1.50
| | | |--- class: 0
| |--- humidity > 0.50
| | |--- windy <= 0.50
| | | |--- class: 1
| | |--- windy > 0.50
| | | |--- temp <= 1.00
| | | | |--- class: 0
| | | |--- temp > 1.00
| | | | |--- class: 1
Row 2:
outlook temp humidity windy play
2 overcast hot high False yes
Play prediction: 1
Row 7:
outlook temp humidity windy play
7 sunny mild high False no
Play prediction: 0
"""

Prunning
We can select only the most useful properties.
""" Decision Trees / Prunning (Breast Cancer)
Stop developing the tree before the limit of perfectly train data fit.
This will lower the accuracy for train data, but it will improve
the score on test data.
The max_depth in the classifier controls the maximum depth
of the decision tree.
Insteed of looking at the whole tree, we can select
only the most useful properties.
We can see that `worst radius` used in the top split, is by far
the most important feature.
"""
import numpy as np
import pandas as pd
import pathlib
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn import tree
import graphviz
# Dataset
DIR = pathlib.Path(__file__).resolve().parent
df = load_breast_cancer()
# Training and test data
X1, X2, y1, y2 = train_test_split(
df.data, df.target, stratify=df.target, random_state=42)
# Pre-prunning
dtree = DecisionTreeClassifier(max_depth=4, random_state=0)
dtree.fit(X1, y1)
# Predictions
X_new = X2[15]
y_pred = dtree.predict(X_new.reshape(1, -1))
y_pred_target = df['target_names'][y_pred]
score = dtree.score(X2, dtree.predict(X2))
# Get feature importances
importances = dtree.feature_importances_
impdf = pd.DataFrame({
"Feature": df.feature_names,
"Importance": importances
})
impdf_sorted = impdf.sort_values(
by="Importance", ascending=False
)
top_features = impdf_sorted["Feature"].head(5)
# Output
n = df.data.shape[1]
plt.subplots_adjust(left=0.28)
plt.barh(np.arange(n), dtree.feature_importances_, align='center')
plt.yticks(np.arange(n), df.feature_names)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.ylim(-1, n)
plt.show()
dot_data = tree.export_graphviz(dtree, out_file=None, filled=True,
class_names=df['target_names'])
dot_graph = graphviz.Source(dot_data)
dot_graph.view()
tree_text = tree.export_text(dtree)
outputs = [
["Featre names:", df['feature_names']],
["Dataset:", df['data']],
["Shape:", df['data'].shape],
["Target names:", df['target_names']],
["Malignant:", list(df['target_names']).index('malignant')],
["Decistion Tree:", tree_text],
["X_new:", X_new],
["Prediction:", y_pred],
["Prediction Target:", y_pred_target],
["Model accuracy score:", score],
["Top features:", top_features],
]
for out in outputs:
print("\n", out[0], "\n ", out[1])
"""
Featre names:
[ 'mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension' ]
Dataset:
[ [1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
[2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
[1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
...
[1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
[2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
[7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02] ]
Shape:
(569, 30)
Target names:
['malignant' 'benign']
Target malignant:
0
Decistion Tree:
|--- feature_20 <= 16.80
| |--- feature_27 <= 0.14
| | |--- feature_10 <= 1.05
| | | |--- feature_14 <= 0.00
| | | | |--- class: 1
| | | |--- feature_14 > 0.00
| | | | |--- class: 1
| | |--- feature_10 > 1.05
| | | |--- class: 0
| |--- feature_27 > 0.14
| | |--- feature_21 <= 25.62
| | | |--- feature_24 <= 0.18
| | | | |--- class: 1
| | | |--- feature_24 > 0.18
| | | | |--- class: 0
| | |--- feature_21 > 25.62
| | | |--- feature_28 <= 0.27
| | | | |--- class: 1
| | | |--- feature_28 > 0.27
| | | | |--- class: 0
|--- feature_20 > 16.80
| |--- feature_11 <= 0.47
| | |--- class: 1
| |--- feature_11 > 0.47
| | |--- feature_26 <= 0.19
| | | |--- feature_21 <= 30.98
| | | | |--- class: 1
| | | |--- feature_21 > 30.98
| | | | |--- class: 0
| | |--- feature_26 > 0.19
| | | |--- class: 0
X_new:
[ 9.683e+00 1.934e+01 6.105e+01 2.857e+02 8.491e-02 5.030e-02 2.337e-02
9.615e-03 1.580e-01 6.235e-02 2.957e-01 1.363e+00 2.054e+00 1.824e+01
7.440e-03 1.123e-02 2.337e-02 9.615e-03 2.203e-02 4.154e-03 1.093e+01
2.559e+01 6.910e+01 3.642e+02 1.199e-01 9.546e-02 9.350e-02 3.846e-02
2.552e-01 7.920e-02 ]
Prediction:
[1]
Prediction Target:
['benign']
Model accuracy score:
1.0
Top features:
20 worst radius
27 worst concave points
11 texture error
21 worst texture
26 worst concavity
"""


➥ Questions