Supervised ml Decision tree

Decision Tree

The model can only predict based on known samples.
The model learns a hierarchy of if/else questions.
Information gain helps identify attributes with more information.
Pruning allows us to select the most useful attributes.

Can a Decision tree model predict the unknown?
What hierarchy does the model learn?
How is information gain used?
Why do we use pruning?

Play Tennis Decision

The model learn a hierarchy if/else questions, leading to a decision. The model can give answers only for known sample from dataset. Given certain values for each of the attributes, the model can give an answer. It can tell if the weather is suitable or not for tennis.

 
""" Decision Tree / Classifier (Play Tennis)
"""

import pathlib
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier

# Dataset
DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / 'data/play_tennis.csv')

"""
    outlook,temp,humidity,windy,play
    sunny,hot,high,false,no
    sunny,hot,high,true,no
    overcast,hot,high,false,yes
    rainy,mild,high,false,yes
    rainy,cool,normal,false,yes
    rainy,cool,normal,true,no
    overcast,cool,normal,true,yes
    sunny,mild,high,false,no
    sunny,cool,normal,false,yes
    rainy,mild,normal,false,yes
    sunny,mild,normal,true,yes
    overcast,mild,high,true,yes
    overcast,hot,normal,false,yes
    rainy,mild,high,true,no
"""

# Encode lables
df_encoded = pd.DataFrame()
for col in df.columns:
    df_encoded[col] = LabelEncoder().fit_transform(df[col])

# Train data
X = df_encoded.drop(columns=["play"])
Y = df_encoded['play']

# Fitting the model
dtree_model = DecisionTreeClassifier()
dtree_model.fit(X, Y)

# Prediction
x_unknown =  [1, 0, 1, 0] # expect 1
x_unknown = pd.DataFrame([x_unknown], columns=X.columns)
y_pred = dtree_model.predict(x_unknown)[0]

print("Dataset:"); print(df, "\n")
print("Encoded:"); print(df_encoded, "\n")
print("Sample:"); print(x_unknown, "\n")
print("Prediction:", y_pred)

"""
    Dataset:
        outlook  temp humidity  windy play
    0      sunny   hot     high  False   no
    1      sunny   hot     high   True   no
    2   overcast   hot     high  False  yes
    3      rainy  mild     high  False  yes
    4      rainy  cool   normal  False  yes
    5      rainy  cool   normal   True   no
    6   overcast  cool   normal   True  yes
    7      sunny  mild     high  False   no
    8      sunny  cool   normal  False  yes
    9      rainy  mild   normal  False  yes
    10     sunny  mild   normal   True  yes
    11  overcast  mild     high   True  yes
    12  overcast   hot   normal  False  yes
    13     rainy  mild     high   True   no 

    Encoded:
        outlook  temp  humidity  windy  play
    0         2     1         0      0     0
    1         2     1         0      1     0
    2         0     1         0      0     1
    3         1     2         0      0     1
    4         1     0         1      0     1
    5         1     0         1      1     0
    6         0     0         1      1     1
    7         2     2         0      0     0
    8         2     0         1      0     1
    9         1     2         1      0     1
    10        2     2         1      1     1
    11        0     2         0      1     1
    12        0     1         1      0     1
    13        1     2         0      1     0 
    
    Sample:
       outlook  temp  humidity  windy
    0        1     0         1      0 

    Prediction: 1
"""

Information Gain

We have 4 features (outlook, temperature, humidity, windy) and the one target (play). Information gain is used to identify which attribute provides more information. The attribute with the highest IG is given the higher priority in the tree.

 
""" Decision Tree / Information gain (Play Tennis)
"""

import numpy as np
import pandas as pd
import pathlib
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
import graphviz

# Dataset
DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / 'data/play_tennis.csv')

# Encode dataset (sunny=2, rainy=1 ...)
df_encoded = pd.DataFrame()
for col in df.columns:
    df_encoded[col] = LabelEncoder().fit_transform(df[col])

# Train data
X = df_encoded.drop(['play'], axis=1) # remove column labeled `play`
y = df_encoded['play']

# Fitting the model
decision_tree = DecisionTreeClassifier(criterion='entropy')
decision_tree.fit(X, y)

# Predictions
X1_new = X.iloc[2:3] # second row
y1_pred = decision_tree.predict(X1_new)[0]

X2_new = [2, 2, 0, 0] # third row
X2_new = pd.DataFrame([X2_new], columns=X.columns)
y2_pred = decision_tree.predict(X2_new)[0]

# Output
dot_data = tree.export_graphviz(decision_tree, out_file=None, filled=True,
    class_names=["no", "yes"], feature_names=X.columns)
dot_graph = graphviz.Source(dot_data)
dot_graph.view()

tree_text = tree.export_text(decision_tree, feature_names=list(X.columns))

outputs = [
    ["Dataset:", df],
    ["Encoded dataset:", df_encoded],
    ["Decision tree:", tree_text],
    ["Row 2:", df.iloc[2:3]],
    ["Play prediction:", y1_pred],
    ["Row 7:", df.iloc[7:8]],
    ["Play prediction:", y2_pred],
]
for out in outputs:
    print("\n", out[0], "\n", out[1])

"""
     Dataset: 
          outlook  temp humidity  windy play
    0      sunny   hot     high  False   no
    1      sunny   hot     high   True   no
    2   overcast   hot     high  False  yes
    3      rainy  mild     high  False  yes
    4      rainy  cool   normal  False  yes
    5      rainy  cool   normal   True   no
    6   overcast  cool   normal   True  yes
    7      sunny  mild     high  False   no
    8      sunny  cool   normal  False  yes
    9      rainy  mild   normal  False  yes
    10     sunny  mild   normal   True  yes
    11  overcast  mild     high   True  yes
    12  overcast   hot   normal  False  yes
    13     rainy  mild     high   True   no

     Encoded dataset: 
         outlook  temp  humidity  windy  play
    0         2     1         0      0     0
    1         2     1         0      1     0
    2         0     1         0      0     1
    3         1     2         0      0     1
    4         1     0         1      0     1
    5         1     0         1      1     0
    6         0     0         1      1     1
    7         2     2         0      0     0
    8         2     0         1      0     1
    9         1     2         1      0     1
    10        2     2         1      1     1
    11        0     2         0      1     1
    12        0     1         1      0     1
    13        1     2         0      1     0

     Decision tree: 
     |--- outlook <= 0.50
    |   |--- class: 1
    |--- outlook >  0.50
    |   |--- humidity <= 0.50
    |   |   |--- outlook <= 1.50
    |   |   |   |--- windy <= 0.50
    |   |   |   |   |--- class: 1
    |   |   |   |--- windy >  0.50
    |   |   |   |   |--- class: 0
    |   |   |--- outlook >  1.50
    |   |   |   |--- class: 0
    |   |--- humidity >  0.50
    |   |   |--- windy <= 0.50
    |   |   |   |--- class: 1
    |   |   |--- windy >  0.50
    |   |   |   |--- temp <= 1.00
    |   |   |   |   |--- class: 0
    |   |   |   |--- temp >  1.00
    |   |   |   |   |--- class: 1


     Row 2: 
         outlook temp humidity  windy play
     2  overcast  hot     high  False  yes

     Play prediction: 1

     Row 7: 
       outlook  temp humidity  windy play
     7   sunny  mild     high  False   no

     Play prediction: 0
"""

Prunning

We can stop developing the tree before the limit of perfectly train data fit. This will lower the accuracy for train data, but it will improve the score on test data. Insteed of looking at the whole tree, we can select only the most useful properties. We can see that worst radius used in the top split, is by far the most important feature.

 
""" Decision Trees / Prunning (Breast Cancer)
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn import tree

# Dataset
df = load_breast_cancer()

# Training and test data
X1, X2, y1, y2 = train_test_split(
    df.data, df.target, stratify=df.target, random_state=42)

# Pre-prunning
dtree = DecisionTreeClassifier(max_depth=4, random_state=0)
dtree.fit(X1, y1)

# Predictions
X_new = X2[15]
y_pred = dtree.predict(X_new.reshape(1, -1))
y_pred_target = df['target_names'][y_pred]
score = dtree.score(X2, dtree.predict(X2))

# Get feature importances
importances = dtree.feature_importances_
impdf = pd.DataFrame({
    "Feature": df.feature_names, 
    "Importance": importances
})
impdf_sorted = impdf.sort_values(
    by="Importance", ascending=False
)
top_features = impdf_sorted["Feature"].head(5)

# Output
tree_text = tree.export_text(dtree)

outputs = [
    ["Featre names:", df['feature_names']],
    ["Dataset:", df['data']],
    ["Shape:", df['data'].shape],
    ["Target names:", df['target_names']],
    ["Malignant:", list(df['target_names']).index('malignant')],
    ["Decistion Tree:", tree_text],
    ["X_new:", X_new],
    ["Prediction:", y_pred],
    ["Prediction Target:", y_pred_target],
    ["Model accuracy score:", score],
    ["Top features:", top_features],
]
for out in outputs:
    print("\n", out[0], "\n ", out[1])

# Output graphics
print("\n Feature imporantance chart:")
n = df.data.shape[1]
plt.subplots_adjust(left=0.28)
plt.barh(np.arange(n), dtree.feature_importances_, align='center')
plt.yticks(np.arange(n), df.feature_names)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.ylim(-1, n)
plt.show()

"""
     Featre names: 
      [ 'mean radius' 'mean texture' 'mean perimeter' 'mean area'
        'mean smoothness' 'mean compactness' 'mean concavity'
        'mean concave points' 'mean symmetry' 'mean fractal dimension'
        'radius error' 'texture error' 'perimeter error' 'area error'
        'smoothness error' 'compactness error' 'concavity error'
        'concave points error' 'symmetry error' 'fractal dimension error'
        'worst radius' 'worst texture' 'worst perimeter' 'worst area'
        'worst smoothness' 'worst compactness' 'worst concavity'
        'worst concave points' 'worst symmetry' 'worst fractal dimension' ]

     Dataset: 
      [ [1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
        [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
        [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
        ...
        [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
        [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
        [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02] ]

     Shape: 
      (569, 30)

    Target names: 
      ['malignant' 'benign']

     Target malignant: 
      0

     Decistion Tree: 
        |--- feature_20 <= 16.80
        |   |--- feature_27 <= 0.14
        |   |   |--- feature_10 <= 1.05
        |   |   |   |--- feature_14 <= 0.00
        |   |   |   |   |--- class: 1
        |   |   |   |--- feature_14 >  0.00
        |   |   |   |   |--- class: 1
        |   |   |--- feature_10 >  1.05
        |   |   |   |--- class: 0
        |   |--- feature_27 >  0.14
        |   |   |--- feature_21 <= 25.62
        |   |   |   |--- feature_24 <= 0.18
        |   |   |   |   |--- class: 1
        |   |   |   |--- feature_24 >  0.18
        |   |   |   |   |--- class: 0
        |   |   |--- feature_21 >  25.62
        |   |   |   |--- feature_28 <= 0.27
        |   |   |   |   |--- class: 1
        |   |   |   |--- feature_28 >  0.27
        |   |   |   |   |--- class: 0
        |--- feature_20 >  16.80
        |   |--- feature_11 <= 0.47
        |   |   |--- class: 1
        |   |--- feature_11 >  0.47
        |   |   |--- feature_26 <= 0.19
        |   |   |   |--- feature_21 <= 30.98
        |   |   |   |   |--- class: 1
        |   |   |   |--- feature_21 >  30.98
        |   |   |   |   |--- class: 0
        |   |   |--- feature_26 >  0.19
        |   |   |   |--- class: 0

     X_new: 
      [ 9.683e+00 1.934e+01 6.105e+01 2.857e+02 8.491e-02 5.030e-02 2.337e-02
        9.615e-03 1.580e-01 6.235e-02 2.957e-01 1.363e+00 2.054e+00 1.824e+01
        7.440e-03 1.123e-02 2.337e-02 9.615e-03 2.203e-02 4.154e-03 1.093e+01
        2.559e+01 6.910e+01 3.642e+02 1.199e-01 9.546e-02 9.350e-02 3.846e-02
        2.552e-01 7.920e-02 ]

     Prediction: 
      [1]
     
     Prediction Target: 
     ['benign']

     Model accuracy score: 
      1.0

     Top features: 
      20            worst radius
      27    worst concave points
      11           texture error
      21           worst texture
      26         worst concavity
"""

""" Decision Tree / Classifier (Play Tennis)
"""

import pathlib
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier

# Dataset
DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / 'data/play_tennis.csv')

"""
    outlook,temp,humidity,windy,play
    sunny,hot,high,false,no
    sunny,hot,high,true,no
    overcast,hot,high,false,yes
    rainy,mild,high,false,yes
    rainy,cool,normal,false,yes
    rainy,cool,normal,true,no
    overcast,cool,normal,true,yes
    sunny,mild,high,false,no
    sunny,cool,normal,false,yes
    rainy,mild,normal,false,yes
    sunny,mild,normal,true,yes
    overcast,mild,high,true,yes
    overcast,hot,normal,false,yes
    rainy,mild,high,true,no
"""

# Encode lables
df_encoded = pd.DataFrame()
for col in df.columns:
    df_encoded[col] = LabelEncoder().fit_transform(df[col])

# Train data
X = df_encoded.drop(columns=["play"])
Y = df_encoded['play']

# Fitting the model
dtree_model = DecisionTreeClassifier()
dtree_model.fit(X, Y)

# Prediction
x_unknown =  [1, 0, 1, 0] # expect 1
x_unknown = pd.DataFrame([x_unknown], columns=X.columns)
y_pred = dtree_model.predict(x_unknown)[0]

print("Dataset:"); print(df, "\n")
print("Encoded:"); print(df_encoded, "\n")
print("Sample:"); print(x_unknown, "\n")
print("Prediction:", y_pred)

"""
    Dataset:
        outlook  temp humidity  windy play
    0      sunny   hot     high  False   no
    1      sunny   hot     high   True   no
    2   overcast   hot     high  False  yes
    3      rainy  mild     high  False  yes
    4      rainy  cool   normal  False  yes
    5      rainy  cool   normal   True   no
    6   overcast  cool   normal   True  yes
    7      sunny  mild     high  False   no
    8      sunny  cool   normal  False  yes
    9      rainy  mild   normal  False  yes
    10     sunny  mild   normal   True  yes
    11  overcast  mild     high   True  yes
    12  overcast   hot   normal  False  yes
    13     rainy  mild     high   True   no

Encoded:
        outlook  temp  humidity  windy  play
    0         2     1         0      0     0
    1         2     1         0      1     0
    2         0     1         0      0     1
    3         1     2         0      0     1
    4         1     0         1      0     1
    5         1     0         1      1     0
    6         0     0         1      1     1
    7         2     2         0      0     0
    8         2     0         1      0     1
    9         1     2         1      0     1
    10        2     2         1      1     1
    11        0     2         0      1     1
    12        0     1         1      0     1
    13        1     2         0      1     0 
    
    Sample:
       outlook  temp  humidity  windy
    0        1     0         1      0

Prediction: 1
"""

""" Decision Tree / Information gain (Play Tennis)
"""

import numpy as np
import pandas as pd
import pathlib
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
import graphviz

# Dataset
DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / 'data/play_tennis.csv')

# Encode dataset (sunny=2, rainy=1 ...)
df_encoded = pd.DataFrame()
for col in df.columns:
    df_encoded[col] = LabelEncoder().fit_transform(df[col])

# Train data
X = df_encoded.drop(['play'], axis=1) # remove column labeled `play`
y = df_encoded['play']

# Fitting the model
decision_tree = DecisionTreeClassifier(criterion='entropy')
decision_tree.fit(X, y)

# Predictions
X1_new = X.iloc[2:3] # second row
y1_pred = decision_tree.predict(X1_new)[0]

X2_new = [2, 2, 0, 0] # third row
X2_new = pd.DataFrame([X2_new], columns=X.columns)
y2_pred = decision_tree.predict(X2_new)[0]

# Output
dot_data = tree.export_graphviz(decision_tree, out_file=None, filled=True,
    class_names=["no", "yes"], feature_names=X.columns)
dot_graph = graphviz.Source(dot_data)
dot_graph.view()

tree_text = tree.export_text(decision_tree, feature_names=list(X.columns))

outputs = [
    ["Dataset:", df],
    ["Encoded dataset:", df_encoded],
    ["Decision tree:", tree_text],
    ["Row 2:", df.iloc[2:3]],
    ["Play prediction:", y1_pred],
    ["Row 7:", df.iloc[7:8]],
    ["Play prediction:", y2_pred],
]
for out in outputs:
    print("\n", out[0], "\n", out[1])

"""
     Dataset: 
          outlook  temp humidity  windy play
    0      sunny   hot     high  False   no
    1      sunny   hot     high   True   no
    2   overcast   hot     high  False  yes
    3      rainy  mild     high  False  yes
    4      rainy  cool   normal  False  yes
    5      rainy  cool   normal   True   no
    6   overcast  cool   normal   True  yes
    7      sunny  mild     high  False   no
    8      sunny  cool   normal  False  yes
    9      rainy  mild   normal  False  yes
    10     sunny  mild   normal   True  yes
    11  overcast  mild     high   True  yes
    12  overcast   hot   normal  False  yes
    13     rainy  mild     high   True   no

Encoded dataset: 
         outlook  temp  humidity  windy  play
    0         2     1         0      0     0
    1         2     1         0      1     0
    2         0     1         0      0     1
    3         1     2         0      0     1
    4         1     0         1      0     1
    5         1     0         1      1     0
    6         0     0         1      1     1
    7         2     2         0      0     0
    8         2     0         1      0     1
    9         1     2         1      0     1
    10        2     2         1      1     1
    11        0     2         0      1     1
    12        0     1         1      0     1
    13        1     2         0      1     0

Row 2: 
         outlook temp humidity  windy play
     2  overcast  hot     high  False  yes

Play prediction: 1

Row 7: 
       outlook  temp humidity  windy play
     7   sunny  mild     high  False   no

Play prediction: 0
"""

""" Decision Trees / Prunning (Breast Cancer)
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn import tree

# Dataset
df = load_breast_cancer()

# Training and test data
X1, X2, y1, y2 = train_test_split(
    df.data, df.target, stratify=df.target, random_state=42)

# Pre-prunning
dtree = DecisionTreeClassifier(max_depth=4, random_state=0)
dtree.fit(X1, y1)

# Predictions
X_new = X2[15]
y_pred = dtree.predict(X_new.reshape(1, -1))
y_pred_target = df['target_names'][y_pred]
score = dtree.score(X2, dtree.predict(X2))

# Get feature importances
importances = dtree.feature_importances_
impdf = pd.DataFrame({
    "Feature": df.feature_names, 
    "Importance": importances
})
impdf_sorted = impdf.sort_values(
    by="Importance", ascending=False
)
top_features = impdf_sorted["Feature"].head(5)

# Output
tree_text = tree.export_text(dtree)

outputs = [
    ["Featre names:", df['feature_names']],
    ["Dataset:", df['data']],
    ["Shape:", df['data'].shape],
    ["Target names:", df['target_names']],
    ["Malignant:", list(df['target_names']).index('malignant')],
    ["Decistion Tree:", tree_text],
    ["X_new:", X_new],
    ["Prediction:", y_pred],
    ["Prediction Target:", y_pred_target],
    ["Model accuracy score:", score],
    ["Top features:", top_features],
]
for out in outputs:
    print("\n", out[0], "\n ", out[1])

# Output graphics
print("\n Feature imporantance chart:")
n = df.data.shape[1]
plt.subplots_adjust(left=0.28)
plt.barh(np.arange(n), dtree.feature_importances_, align='center')
plt.yticks(np.arange(n), df.feature_names)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.ylim(-1, n)
plt.show()

"""
     Featre names: 
      [ 'mean radius' 'mean texture' 'mean perimeter' 'mean area'
        'mean smoothness' 'mean compactness' 'mean concavity'
        'mean concave points' 'mean symmetry' 'mean fractal dimension'
        'radius error' 'texture error' 'perimeter error' 'area error'
        'smoothness error' 'compactness error' 'concavity error'
        'concave points error' 'symmetry error' 'fractal dimension error'
        'worst radius' 'worst texture' 'worst perimeter' 'worst area'
        'worst smoothness' 'worst compactness' 'worst concavity'
        'worst concave points' 'worst symmetry' 'worst fractal dimension' ]

Dataset: 
      [ [1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
        [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
        [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
        ...
        [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
        [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
        [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02] ]

Shape: 
      (569, 30)

Target names: 
      ['malignant' 'benign']

Target malignant: 
      0

X_new: 
      [ 9.683e+00 1.934e+01 6.105e+01 2.857e+02 8.491e-02 5.030e-02 2.337e-02
        9.615e-03 1.580e-01 6.235e-02 2.957e-01 1.363e+00 2.054e+00 1.824e+01
        7.440e-03 1.123e-02 2.337e-02 9.615e-03 2.203e-02 4.154e-03 1.093e+01
        2.559e+01 6.910e+01 3.642e+02 1.199e-01 9.546e-02 9.350e-02 3.846e-02
        2.552e-01 7.920e-02 ]

Prediction: 
      [1]
     
     Prediction Target: 
     ['benign']

Model accuracy score: 
      1.0

Top features: 
      20            worst radius
      27    worst concave points
      11           texture error
      21           worst texture
      26         worst concavity
"""

 Last update: 125 days ago

minte9 LearnRemember

Decision Tree

Play Tennis Decision

Information Gain

Prunning

minte9
LearnRemember