Programming

  minte9
LearnRemember




K Neighbors

In KNN Classifier the K is the number of nearest neighbors to be used.
 
""" KNN classifier / Features and label

We provide training dataset points (features) and label (target).
Next, we train the model using KNN classifier with k=3 (nearest neighbors).
Finally, we are able now to predict the label for a new (unknown) data point.   
"""

from sklearn.neighbors import KNeighborsClassifier

# Training dataset
X = [[0,0], 
     [1,1], 
     [2,2], 
     [3,3]]    
y = [0, 1, 0, 1]

# Train the model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

# Make predictions
x_unknown = [1,2]        
y_pred = knn.predict([x_unknown])  

print("New point: x_unknown =", x_unknown)
print("Predicted label: y_pred =", y_pred)

"""
    New point: x_unknown = [1, 2]
    Predicted label: y_pred = [0]
"""

Data Frame

Transform a dataset into a dataframe with pandas library.
 
"""KNN classifier / Fruits

Dataset contains heights, widths and labels (fruit name).
The algorithm teach a model to map any combination in order to make predictions.
We use Pandas library to transform a json dataset into a DataFrame.
"""

from sklearn.neighbors import KNeighborsClassifier
import pandas as pd

# Training dataset
data = {

  'height': [
    3.91, 7.09, 10.48, 9.21, 7.95, 7.62, 7.95, 4.69, 7.50, 7.11, 
    4.15, 7.29, 8.49, 7.44, 7.86, 3.93, 4.40, 5.5, 8.10, 8.69
  ], 

  'width': [
     5.76, 7.69, 7.32, 7.20, 5.90, 7.51, 5.32, 6.19, 5.99, 7.02, 
     5.60, 8.38, 6.52, 7.89, 7.60, 6.12, 5.90, 4.5, 6.15, 5.82
  ],
  
  'fruit': [
    'Mandarin', 'Apple', 'Lemon', 'Lemon', 'Lemon', 'Apple', 'Mandarin', 
    'Mandarin', 'Lemon', 'Apple', 'Mandarin', 'Apple', 'Lemon', 'Apple', 
    'Apple', 'Apple', 'Mandarin', 'Lemon', 'Lemon', 'Lemon'
  ]
} 
# Transform dataset
df = pd.DataFrame(data) 
df = df.sort_values(by=['fruit', 'width', 'height'])

X = df[['height', 'width']].values
y = df.fruit.values

# Train the model
knn = KNeighborsClassifier(n_neighbors=3) 
knn.fit(X, y)

# Make predictions
new_item  = [9, 3]
new_items = [[9, 3], [4, 5], [2, 5], [8, 9], [5, 7]]

prediction  = knn.predict([new_item])
predictions = knn.predict(new_items)

print("Dataframe(order by fruit): \n", df, "\n")
print("Prediction label for new item: \n", new_item, "\n", prediction, "\n")
print("Precition labels for new items: \n", new_items, "\n", predictions, "\n") 

"""
    Dataframe(order by fruit): 
        height  width     fruit
    15    3.93   6.12     Apple
    9     7.11   7.02     Apple
    5     7.62   7.51     Apple
    14    7.86   7.60     Apple
    1     7.09   7.69     Apple
    13    7.44   7.89     Apple
    11    7.29   8.38     Apple
    17    5.50   4.50     Lemon
    19    8.69   5.82     Lemon
    4     7.95   5.90     Lemon
    8     7.50   5.99     Lemon
    18    8.10   6.15     Lemon
    12    8.49   6.52     Lemon
    3     9.21   7.20     Lemon
    2    10.48   7.32     Lemon
    6     7.95   5.32  Mandarin
    10    4.15   5.60  Mandarin
    0     3.91   5.76  Mandarin
    16    4.40   5.90  Mandarin
    7     4.69   6.19  Mandarin 

    Prediction label for new item: 
     [9, 3] 
     ['Lemon'] 

    Precition labels for new items: 
     [[9, 3], [4, 5], [2, 5], [8, 9], [5, 7]] 
     ['Lemon' 'Mandarin' 'Mandarin' 'Apple' 'Mandarin'] 
"""

Accurracy Score

Evaluate the model score on the training and test datasets (for k=3)
 
"""KNN classifier Evaluation / Fruit example

We split the dataset in two datasets: training and test
The we evaluate the model on both datasets.

The score is the difference between actual and predicted labels
1.0 means the model correctly predicted all (100%)
"""

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# Training dataset
D1 = pd.DataFrame({

  'height': [
    3.91, 7.09, 10.48, 9.21, 7.95, 7.62, 7.95, 4.69, 7.50, 7.11, 
    4.15, 7.29, 8.49, 7.44, 7.86, 3.93, 4.40, 5.5, 8.10, 8.69
  ], 

  'width': [
     5.76, 7.69, 7.32, 7.20, 5.90, 7.51, 5.32, 6.19, 5.99, 7.02, 
     5.60, 8.38, 6.52, 7.89, 7.60, 6.12, 5.90, 4.5, 6.15, 5.82
  ],
  
  'fruit': [
    'Mandarin', 'Apple', 'Lemon', 'Lemon', 'Lemon', 'Apple', 'Mandarin', 
    'Mandarin', 'Lemon', 'Apple', 'Mandarin', 'Apple', 'Lemon', 'Apple', 
    'Apple', 'Apple', 'Mandarin', 'Lemon', 'Lemon', 'Lemon'
  ]
})

# Test dataset
D2 = pd.DataFrame({
    'height': [4, 4.47, 6.49, 7.51, 8.34],
    'width':  [6.5, 7.13, 7, 5.01, 4.23],
    'fruit':  ['Mandarin', 'Mandarin', 'Apple', 'Lemon', 'Lemon']
})

# Features and labels
X1 = D1[['height', 'width']].values
y1 = D1.fruit.values
X2 = D2[['height', 'width']].values
y2 = D2.fruit.values

# Train the model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X1, y1)

# Evaluate the model
predictions1 = knn.predict(X1)
predictions2 = knn.predict(X2)

score1 = metrics.accuracy_score(y1, predictions1)
score2 = metrics.accuracy_score(y2, predictions2)

print("Model score with training dataset:", score1 * 100)
print("Model score with test dataset:", score2 * 100)

"""
  Model score with training dataset: 85.0
  Model score with test dataset:     100.0
"""

Score Graph

Models between k=3 and k=7 perform optimally on the test set.
 
"""KNN classifier Score Graph / Fruit example

Models between k=3 and k=7 perform optimally on the test set.
In those cases, optimal balance between overfitting and underfitting.
"""

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import matplotlib.pyplot as plt

# Training dataset
D1 = pd.DataFrame({

  'height': [
    3.91, 7.09, 10.48, 9.21, 7.95, 7.62, 7.95, 4.69, 7.50, 7.11, 
    4.15, 7.29, 8.49, 7.44, 7.86, 3.93, 4.40, 5.5, 8.10, 8.69
  ], 

  'width': [
     5.76, 7.69, 7.32, 7.20, 5.90, 7.51, 5.32, 6.19, 5.99, 7.02, 
     5.60, 8.38, 6.52, 7.89, 7.60, 6.12, 5.90, 4.5, 6.15, 5.82
  ],
  
  'fruit': [
    'Mandarin', 'Apple', 'Lemon', 'Lemon', 'Lemon', 'Apple', 'Mandarin', 
    'Mandarin', 'Lemon', 'Apple', 'Mandarin', 'Apple', 'Lemon', 'Apple', 
    'Apple', 'Apple', 'Mandarin', 'Lemon', 'Lemon', 'Lemon'
  ]
})

# Test dataset
D2 = pd.DataFrame({
    'height': [4, 4.47, 6.49, 7.51, 8.34],
    'width': [6.5, 7.13, 7, 5.01, 4.23],
    'fruit': ['Mandarin', 'Mandarin', 'Apple', 'Lemon', 'Lemon']
})

# Features and labels
X1 = D1[['height', 'width']].values
y1 = D1.fruit.values

X2 = D2[['height', 'width']].values
y2 = D2.fruit.values

# Initializa graph params
k = []
score1 = []
score2 = []

# Evaluate the score for different params
for i in range(len(X1)):
    _k = i+1
    
    clf = KNeighborsClassifier(n_neighbors = _k)
    clf.fit(X1, y1)

    _score1 = metrics.accuracy_score(y1, clf.predict(X1))
    _score2 = metrics.accuracy_score(y2, clf.predict(X2))

    k.append(_k)
    score1.append(_score1 * 100)
    score2.append(_score2 * 100)
    
    print(f'k={_k} | score1: {score1[i]} | score2: {score2[i]}')

# Plot train score
plt.scatter(k, score1) #function
plt.plot(k, score1, '-', label='train') #data points

# Plot test score
plt.scatter(k, score2)
plt.plot(k, score2, '-', label='test')

# Plot configurations
plt.axis([max(k),min(k)+1, 0, 100])
plt.xlabel('number of nearest neighbours (k)', size = 13)
plt.ylabel('accuracy score', size = 13)
plt.title('Model Performance vs Complexity', size = 20)
plt.legend()

# Output
plt.show()

"""
  k=1  | score1: 100.0 | score2: 40.0
  k=2  | score1: 95.0 | score2: 60.0
  k=3  | score1: 85.0 | score2: 100.0
  k=4  | score1: 85.0 | score2: 100.0
  k=5  | score1: 85.0 | score2: 100.0
  k=6  | score1: 85.0 | score2: 100.0
  k=7  | score1: 85.0 | score2: 100.0
  k=8  | score1: 85.0 | score2: 100.0
  k=9  | score1: 85.0 | score2: 80.0
  k=10 | score1: 85.0 | score2: 60.0
  k=11 | score1: 80.0 | score2: 60.0
  k=12 | score1: 90.0 | score2: 60.0
  k=13 | score1: 65.0 | score2: 60.0
  k=14 | score1: 55.00000000000001 | score2: 60.0
  k=15 | score1: 55.00000000000001 | score2: 60.0
  k=16 | score1: 45.0 | score2: 60.0
  k=17 | score1: 50.0 | score2: 60.0
  k=18 | score1: 50.0 | score2: 60.0
  k=19 | score1: 40.0 | score2: 40.0
  k=20 | score1: 40.0 | score2: 40.0
"""

Boundaries

Decision boundaries of KNN on a graph (optimal fit for k=5)




References


Related