Algorithms Decision tree

Probability

Probability is the chance of something to happen. When you flip a coin, there is a probability of 0.5 (or 50% chance) to land on heads. It's like asking, "What are the chances of something to happen?" Probability is a number between 0 and 1, where 0 means "no way" and 1 means "definitely happening". $$ P(\text{Heads}) = \frac{1}{2} = 0.5 $$

 
# Coin Flip events
events = ['head', 'tail']

# Probability distribution (coin flip):
print('Head =', 1/2)
print('Tail =', 1/len(events))

"""
    Head = 0.5
    Tail = 0.5
"""

Probability Distribution

Now, imagine you're not just flipping a coin but rolling a dice. There are more outcomes (1 through 6), each with its own probability. A probability distribution is a list with all these probabilities. It's like a map with all the possible outcomes and how likely they are. $$ P(\text{Rolling a 4}) = \frac{1}{6} $$

 
import pandas as pd
from icecream import ic

# Datasets
A = ['apple']*1 + ['orange']*2 + ['banana']*2
B = ['apple']*5 + ['orange']*2 + ['banana']*0
ic(A, B)

# Probability distribution (by hand)
P1 = [{'apple': 1/5}, {'orange': 2/5}, {'banana': 2/5}] 
P2 = [{'apple': 5/7}, {'orange': 2/7}]
ic(P1, P2)

# With pandas
P1 = pd.Series(A).value_counts(normalize=True)
P2 = pd.Series(B).value_counts(normalize=True)
ic(P1, P2);

"""
    ic| A: ['apple', 'orange', 'orange', 'banana', 'banana']
        B: ['apple', 'apple', 'apple', 'apple', 'apple', 'orange', 'orange']
    ic| P1: [{'apple': 0.2}, {'orange': 0.4}, {'banana': 0.4}]
        P2: [{'apple': 0.7142857142857143}, {'orange': 0.2857142857142857}]
    ic| P1: orange    0.4
            banana    0.4
            apple     0.2
            dtype: float64
        P2: apple     0.714286
            orange    0.285714
            dtype: float64
"""

Entropy

Entropy is a measure of how disordered a collection is. The more impure the feature is, the higher the entropy. Probability distribution is the frequency of the unique values. It turns out that a logarithm of the number of states is perfect for compute entropy. $$ H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i) $$

 
import pandas as pd
import numpy as np
from icecream import ic

# Set the initial traning data
A = ['apple']*1 + ['orange']*2 + ['banana']*2
B = ['apple']*5 + ['orange']*2 + ['banana']*0
ic(A, B)

# Probability
P1 = pd.Series(A).value_counts(normalize=True)
P2 = pd.Series(B).value_counts(normalize=True)
ic(P1, P2)

# Entropy (Shannon model)
P1 = P1.values
P2 = P2.values
H1 = -1 * np.sum(P1 * np.log2(P1))
H2 = -1 * np.sum(P2 * np.log2(P2))
ic(H1, H2);

assert H1 > H2

ic("A entropy > B entropy | There is more disorder in A than B")
ic("Assertion passed");

"""
    ic| A: ['apple', 'orange', 'orange', 'banana', 'banana']
        B: ['apple', 'apple', 'apple', 'apple', 'apple', 'orange', 'orange']
    ic| P1: orange    0.4
            banana    0.4
            apple     0.2
            dtype: float64
        P2: apple     0.714286
            orange    0.285714
            dtype: float64
    ic| H1: 1.5219280948873621, H2: 0.863120568566631
    ic| 'A entropy > B entropy | There is more disorder in A than B'
    ic| 'Assertion passed'
"""

Information gain

Information gain is a measure of the reduction in entropy. As the entropy of an attribute increases, the information gain decreases. $$ \text{IG}(X, A) = \text{H}(X) - \sum_{v \in \text{Values}(A)} \frac{|X_v|}{|X|} \cdot \text{H}(X_v) $$

 
""" Decision Trees / Info Gain
Target: Play Tennis or not

Information gain example (for feature 'wind'):
IG = H - (8/14)H_weak - (6/14)H_strong
IG = 0.940 - (8/14)0.811 - (6/14)1.00 = 0.048 
"""

import numpy as np
import pandas as pd
import pathlib

# Load the dataset from a CSV file
DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / 'data/play_tennis.csv')

# Split the dataset into features (X) and target (y), play tennis yes/no
X = df.drop(['play'], axis=1)
y = df['play'] 

# Total entropy
def dataset_entropy(df):
    E = 0
    N = df['play'].value_counts() # yes: 9, no: 5
    values = df['play'].unique()
    for v in values: # yes/no
        P = N[v]/len(df['play'])  # probability
        E += -P*np.log2(P)
    return E

# Entropy for each attribute
def attribute_entropy(attr):
    E = 0
    eps = np.finfo(float).eps   # machine epsilon for the float 
    targets = df.play.unique()
    values = df[attr].unique()
    for v in values: # cool/hot
        ent = 0
        for t in targets: # yes,no
            num = len(df[attr][df[attr] == v][df.play == t]) # numerator
            den = len(df[attr][df[attr] == v])
            fraction = num/(den + eps)
            ent += -fraction*np.log2(fraction + eps) # entropy for one feature
        E += -(den/len(df))*ent # sum of all entropies
    return abs(E)

total_entropy  = dataset_entropy(df)

# Get the names of attributes (excluding the target variable)
attributes = df.keys()[:-1]

# Calculate entropy for each attribute and store it in a dictionary
E = {} 
for k in attributes:
    E[k] = attribute_entropy(k)

# Calculate information gain for each attribute and store it in a dictionary
IG = {}
for k in E:
    IG[k] = total_entropy - E[k]

# Asserts
assert E['outlook']  < E['humidity']
assert IG['outlook'] > IG['humidity']

# Output results
print("\n Dataset:"); print(df)
print("\n Describe:"); print(df.describe())
print("\n Entropy:"); print(total_entropy)
print("\n AttrEntropy:"); print(E)
print("\n Information gains:"); print(IG)

"""
    Dataset:
        outlook  temp humidity  windy play
    0      sunny   hot     high  False   no
    1      sunny   hot     high   True   no
    2   overcast   hot     high  False  yes
    3      rainy  mild     high  False  yes
    4      rainy  cool   normal  False  yes
    5      rainy  cool   normal   True   no
    6   overcast  cool   normal   True  yes
    7      sunny  mild     high  False   no
    8      sunny  cool   normal  False  yes
    9      rainy  mild   normal  False  yes
    10     sunny  mild   normal   True  yes
    11  overcast  mild     high   True  yes
    12  overcast   hot   normal  False  yes
    13     rainy  mild     high   True   no

    Describe:
        outlook  temp humidity  windy play
    count       14    14       14     14   14
    unique       3     3        2      2    2
    top      sunny  mild     high  False  yes
    freq         5     6        7      8    9

    Entropy:
    0.9402859586706311

    AttrEntropy:
    {'outlook': 0.6935361388961914, 
     'temp': 0.9110633930116756, 
     'humidity': 0.7884504573082889, 
     'windy': 0.892158928262361}

    Information gains:
    {'outlook': 0.24674981977443977, 
     'temp': 0.029222565658955535, 
     'humidity': 0.15183550136234225, 
     'windy': 0.048127030408270155}
"""

Decision Tree ID3

We build a decision tree recursively giving priority to the attributes with the higher IG. Iterative Dichotomiser 3 is a classification algorithm. It follows a greedy approach of building a decision tree. It gives priority to the attributes with the higher information gain.

 
""" Decision Trees / ID3 Algorithm 

1. Calculate entropy for dataset
2. For each attribute:
   Calculate entropy for all categorical values
   Calculate information gain for the current attribute
3. Find the feture with maximum information gain
4. Repeat

For example, the `outlook` has the highest info gain (0.24).
It we will select it as the root node for the start level of splitting.
"""

import numpy as np
import pandas as pd
import pathlib
from sklearn import tree

# Dataset
DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / 'data/play_tennis.csv')

# Training data
X = df.drop(['play'], axis=1)
y = df['play']


# Total entropy (for current dataframe)
def dataset_entropy(df):
    E = 0
    N = df['play'].value_counts() # yes: 9, no: 5
    values = df['play'].unique()
    for v in values: # yes/no
        P = N[v]/len(df['play'])  # probability
        E += -P*np.log2(P)
    return E

# Entropy for each attribute
def attribute_entropy(df, attr):
    E = 0
    eps = np.finfo(float).eps   # machine epsilon for the float 
    targets = df.play.unique()
    values = df[attr].unique()
    for v in values: # cool/hot
        ent = 0
        for t in targets: # yes,no
            num = len(df[attr][df[attr] == v][df.play == t]) # numerator
            den = len(df[attr][df[attr] == v])
            fraction = num/(den + eps)
            ent += -fraction*np.log2(fraction + eps) # entropy for one feature
        E += -(den/len(df))*ent # sum of all entropies
    return abs(E)

# Find attribute with maximum information gain
def find_winner(df):
    IG = {}
    attributes = df.keys()[:-1]

    # Loop for attributes in dataframe and compute info gains
    for attr in attributes: 
        IG[attr] = dataset_entropy(df) - attribute_entropy(df, attr)
    winner = attributes[np.argmax(IG)] # maxim info gains
    return winner


# Construct the decision tree (dictionary)
def buildTree(df):
    tree = {}

    # Target column
    Class = df.keys()[-1] # play
    
    # Maximum info gain
    node = find_winner(df) # outlook
    tree[node] = {}

    # Distinct values
    values = np.unique(df[node]) # overcast/rain

    # Loop throw the attribute values
    for value in values:
        subtable = df[df[node] == value].reset_index(drop=True)
        attr_values, counts = np.unique(subtable[Class], return_counts=True)

        if len(counts) == 1: # pure subset
            tree[node][value] = attr_values[0]
        else:
            subtable = subtable.drop(node, axis=1)
            tree[node][value] = buildTree(subtable) # Recursive case
            
    return tree

decision_tree = buildTree(df)


# Print dictionary tree (recursion  in case of subtrees)
def print_tree(tree, attr=None, i=0):
    if not attr:
        attr = next(iter(tree)) # attrribute in the current tree node

    for key, subval in tree[attr].items():

        if isinstance(subval, str): # Base case
            print(i*" ", attr, "=", key, ":", subval)
            continue
        
        print(i*" ", attr, "=", key, ":")
        print_tree(subval, i=i+1) # Recursive

    return

# Predict unknow (only for cases included in the train dataset)
def predict(X, tree):
    key = next(iter(tree))
    val = X[key]
    subval = tree[key][val]

    if isinstance(subval, str): # Base case
        return subval
        
    subval = predict(X, subval) # Recursive
    return subval


print("\n Decistion Tree:")
print(decision_tree, "\n")
print_tree(decision_tree)

print("\n Example usage 1:")
x = {'outlook': 'sunny', 'temp': 'mild', 'humidity': 'high', 'windy': False}
y = predict(x, decision_tree)
print("Attributes:", x)
print("Prediction:", y)

print("\n Example usage 2:")
x = {'outlook': 'rainy', 'temp': 'mild', 'humidity': 'normal', 'windy': True}
y = predict(x, decision_tree)
print("Attributes:", x)
print("Prediction:", y)

"""
     Decistion Tree:
    {'outlook': {'overcast': 'yes', 'rainy': {'temp': {'cool': {'humidity': ...

    outlook = overcast: yes
    outlook = rainy:
      temp = cool:
        humidity = normal:
          windy = False: yes
          windy = True: no
      temp = mild:
        humidity = high:
          windy = False: yes
          windy = True: no
        humidity = normal: yes
    outlook = sunny:
      temp = cool: yes
      temp = hot: no
      temp = mild:
        humidity = high: no
        humidity = normal: yes

     Example usage 1:
    Attributes: {
        'outlook': 'sunny', 
        'temp': 'mild', 
        'humidity': 'high', 
        'windy': False}
    Prediction: no

     Example usage 2:
    Attributes: {
        'outlook': 'rainy', 
        'temp': 'mild', 
        'humidity': 'normal', 
        'windy': True}
    Prediction: yes
"""

import pandas as pd
import numpy as np
from icecream import ic

# Set the initial traning data
A = ['apple']*1 + ['orange']*2 + ['banana']*2
B = ['apple']*5 + ['orange']*2 + ['banana']*0
ic(A, B)

# Probability
P1 = pd.Series(A).value_counts(normalize=True)
P2 = pd.Series(B).value_counts(normalize=True)
ic(P1, P2)

# Entropy (Shannon model)
P1 = P1.values
P2 = P2.values
H1 = -1 * np.sum(P1 * np.log2(P1))
H2 = -1 * np.sum(P2 * np.log2(P2))
ic(H1, H2);

assert H1 > H2

ic("A entropy > B entropy | There is more disorder in A than B")
ic("Assertion passed");

"""
    ic| A: ['apple', 'orange', 'orange', 'banana', 'banana']
        B: ['apple', 'apple', 'apple', 'apple', 'apple', 'orange', 'orange']
    ic| P1: orange    0.4
            banana    0.4
            apple     0.2
            dtype: float64
        P2: apple     0.714286
            orange    0.285714
            dtype: float64
    ic| H1: 1.5219280948873621, H2: 0.863120568566631
    ic| 'A entropy > B entropy | There is more disorder in A than B'
    ic| 'Assertion passed'
"""

""" Decision Trees / Info Gain
Target: Play Tennis or not

Information gain example (for feature 'wind'):
IG = H - (8/14)H_weak - (6/14)H_strong
IG = 0.940 - (8/14)0.811 - (6/14)1.00 = 0.048 
"""

import numpy as np
import pandas as pd
import pathlib

# Load the dataset from a CSV file
DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / 'data/play_tennis.csv')

# Split the dataset into features (X) and target (y), play tennis yes/no
X = df.drop(['play'], axis=1)
y = df['play']

# Total entropy
def dataset_entropy(df):
    E = 0
    N = df['play'].value_counts() # yes: 9, no: 5
    values = df['play'].unique()
    for v in values: # yes/no
        P = N[v]/len(df['play'])  # probability
        E += -P*np.log2(P)
    return E

# Entropy for each attribute
def attribute_entropy(attr):
    E = 0
    eps = np.finfo(float).eps   # machine epsilon for the float 
    targets = df.play.unique()
    values = df[attr].unique()
    for v in values: # cool/hot
        ent = 0
        for t in targets: # yes,no
            num = len(df[attr][df[attr] == v][df.play == t]) # numerator
            den = len(df[attr][df[attr] == v])
            fraction = num/(den + eps)
            ent += -fraction*np.log2(fraction + eps) # entropy for one feature
        E += -(den/len(df))*ent # sum of all entropies
    return abs(E)

total_entropy  = dataset_entropy(df)

# Get the names of attributes (excluding the target variable)
attributes = df.keys()[:-1]

# Calculate entropy for each attribute and store it in a dictionary
E = {} 
for k in attributes:
    E[k] = attribute_entropy(k)

# Calculate information gain for each attribute and store it in a dictionary
IG = {}
for k in E:
    IG[k] = total_entropy - E[k]

# Asserts
assert E['outlook']  < E['humidity']
assert IG['outlook'] > IG['humidity']

# Output results
print("\n Dataset:"); print(df)
print("\n Describe:"); print(df.describe())
print("\n Entropy:"); print(total_entropy)
print("\n AttrEntropy:"); print(E)
print("\n Information gains:"); print(IG)

"""
    Dataset:
        outlook  temp humidity  windy play
    0      sunny   hot     high  False   no
    1      sunny   hot     high   True   no
    2   overcast   hot     high  False  yes
    3      rainy  mild     high  False  yes
    4      rainy  cool   normal  False  yes
    5      rainy  cool   normal   True   no
    6   overcast  cool   normal   True  yes
    7      sunny  mild     high  False   no
    8      sunny  cool   normal  False  yes
    9      rainy  mild   normal  False  yes
    10     sunny  mild   normal   True  yes
    11  overcast  mild     high   True  yes
    12  overcast   hot   normal  False  yes
    13     rainy  mild     high   True   no

Describe:
        outlook  temp humidity  windy play
    count       14    14       14     14   14
    unique       3     3        2      2    2
    top      sunny  mild     high  False  yes
    freq         5     6        7      8    9

Entropy:
    0.9402859586706311

AttrEntropy:
    {'outlook': 0.6935361388961914, 
     'temp': 0.9110633930116756, 
     'humidity': 0.7884504573082889, 
     'windy': 0.892158928262361}

Information gains:
    {'outlook': 0.24674981977443977, 
     'temp': 0.029222565658955535, 
     'humidity': 0.15183550136234225, 
     'windy': 0.048127030408270155}
"""

""" Decision Trees / ID3 Algorithm

1. Calculate entropy for dataset
2. For each attribute:
   Calculate entropy for all categorical values
   Calculate information gain for the current attribute
3. Find the feture with maximum information gain
4. Repeat

For example, the `outlook` has the highest info gain (0.24).
It we will select it as the root node for the start level of splitting.
"""

import numpy as np
import pandas as pd
import pathlib
from sklearn import tree

# Dataset
DIR = pathlib.Path(__file__).resolve().parent
df = pd.read_csv(DIR / 'data/play_tennis.csv')

# Training data
X = df.drop(['play'], axis=1)
y = df['play']

# Total entropy (for current dataframe)
def dataset_entropy(df):
    E = 0
    N = df['play'].value_counts() # yes: 9, no: 5
    values = df['play'].unique()
    for v in values: # yes/no
        P = N[v]/len(df['play'])  # probability
        E += -P*np.log2(P)
    return E

# Entropy for each attribute
def attribute_entropy(df, attr):
    E = 0
    eps = np.finfo(float).eps   # machine epsilon for the float 
    targets = df.play.unique()
    values = df[attr].unique()
    for v in values: # cool/hot
        ent = 0
        for t in targets: # yes,no
            num = len(df[attr][df[attr] == v][df.play == t]) # numerator
            den = len(df[attr][df[attr] == v])
            fraction = num/(den + eps)
            ent += -fraction*np.log2(fraction + eps) # entropy for one feature
        E += -(den/len(df))*ent # sum of all entropies
    return abs(E)

# Find attribute with maximum information gain
def find_winner(df):
    IG = {}
    attributes = df.keys()[:-1]

# Loop for attributes in dataframe and compute info gains
    for attr in attributes: 
        IG[attr] = dataset_entropy(df) - attribute_entropy(df, attr)
    winner = attributes[np.argmax(IG)] # maxim info gains
    return winner

# Construct the decision tree (dictionary)
def buildTree(df):
    tree = {}

# Target column
    Class = df.keys()[-1] # play
    
    # Maximum info gain
    node = find_winner(df) # outlook
    tree[node] = {}

# Distinct values
    values = np.unique(df[node]) # overcast/rain

# Loop throw the attribute values
    for value in values:
        subtable = df[df[node] == value].reset_index(drop=True)
        attr_values, counts = np.unique(subtable[Class], return_counts=True)

if len(counts) == 1: # pure subset
            tree[node][value] = attr_values[0]
        else:
            subtable = subtable.drop(node, axis=1)
            tree[node][value] = buildTree(subtable) # Recursive case
            
    return tree

decision_tree = buildTree(df)

# Print dictionary tree (recursion  in case of subtrees)
def print_tree(tree, attr=None, i=0):
    if not attr:
        attr = next(iter(tree)) # attrribute in the current tree node

for key, subval in tree[attr].items():

if isinstance(subval, str): # Base case
            print(i*" ", attr, "=", key, ":", subval)
            continue
        
        print(i*" ", attr, "=", key, ":")
        print_tree(subval, i=i+1) # Recursive

return

# Predict unknow (only for cases included in the train dataset)
def predict(X, tree):
    key = next(iter(tree))
    val = X[key]
    subval = tree[key][val]

if isinstance(subval, str): # Base case
        return subval
        
    subval = predict(X, subval) # Recursive
    return subval

print("\n Decistion Tree:")
print(decision_tree, "\n")
print_tree(decision_tree)

print("\n Example usage 1:")
x = {'outlook': 'sunny', 'temp': 'mild', 'humidity': 'high', 'windy': False}
y = predict(x, decision_tree)
print("Attributes:", x)
print("Prediction:", y)

print("\n Example usage 2:")
x = {'outlook': 'rainy', 'temp': 'mild', 'humidity': 'normal', 'windy': True}
y = predict(x, decision_tree)
print("Attributes:", x)
print("Prediction:", y)

"""
     Decistion Tree:
    {'outlook': {'overcast': 'yes', 'rainy': {'temp': {'cool': {'humidity': ...

outlook = overcast: yes
    outlook = rainy:
      temp = cool:
        humidity = normal:
          windy = False: yes
          windy = True: no
      temp = mild:
        humidity = high:
          windy = False: yes
          windy = True: no
        humidity = normal: yes
    outlook = sunny:
      temp = cool: yes
      temp = hot: no
      temp = mild:
        humidity = high: no
        humidity = normal: yes

Example usage 1:
    Attributes: {
        'outlook': 'sunny', 
        'temp': 'mild', 
        'humidity': 'high', 
        'windy': False}
    Prediction: no

Example usage 2:
    Attributes: {
        'outlook': 'rainy', 
        'temp': 'mild', 
        'humidity': 'normal', 
        'windy': True}
    Prediction: yes
"""

Questions and answers

Gini_index measure a dataset:

a) disorder
b) diversity (or impurity)

The reduction in entropy by dataset split is measured by:

a) gini index
b) information gain

minte9
LearnRemember / MLEARNING

Probability

Probability Distribution

Entropy

Information gain

Decision Tree ID3

Questions and answers

References

Connected Graph

minte9 LearnRemember / MLEARNING

Probability

Probability Distribution

Entropy

Information gain

Decision Tree ID3

Questions and answers

References

Connected Graph

minte9
LearnRemember / MLEARNING