# minte9 LearnRemember

### Create Vocabulary

In the context of machine learning and natural language processing, Count Vectorizer" refers to the overall method of converting a collection of text documents into a matrix of token counts. Learn the vocabulary (unique words) from the provided text data. We create a list (set) of all unique tokens across all documents.

a = 'London Paris London'
b = 'Paris Paris London'

def create_vocabulary(texts):
vocabulary = set()

for t in texts:
# Split each document into tokens (usually words)
words = t.split()

# Create a list of all unique tokens across all documents
for w in words:
return vocabulary

vocabulary = create_vocabulary([a, b])
print(vocabulary)

"""
{'Paris', 'London'}
"""


### Count Matrix

CountVectorizer converts the text data into a numerical format, which is essential for ML models. It creates a matrix where each column represents a unique word. The numbers in the matrix are counts of how many times each word appears.

def fit_transform(texts):
# Create a set of unique words (vocabulary)
vocabulary = create_vocabulary(texts)

# Convert the set to a list for indexing
vocabulary = list(vocabulary)

# Initialize an empty list to store the count vectors
matrix = []

# Iterate through each text in the input
for t in texts:
# Create a count vector initialized with zeros for each word
count_vector = [0] * len(vocabulary)

# Iterate through each word in the current text
for word in t.split():
# Find the index of the word in the vocabulary
index = vocabulary.index(word)

# Increment the count for this word in the count vector
count_vector[index] += 1

matrix.append(count_vector)
return matrix

# Sample text strings
a = 'London Paris London'
b = 'Paris Paris London'

# Get the frequency matrix
matrix = fit_transform([a, b])
print(matrix)

"""
[[1, 2], [2, 1]]
"""


### Scikit

Both processes (fit, transform) are encapsulated in the fit_transform method in scikit.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

a = 'London Paris London'
b = 'Paris Paris London'

cv = CountVectorizer()
matrix = cv.fit_transform([a, b])
print(matrix)

similarity_scores = cosine_similarity(matrix)
print(similarity_scores)

"""
(0, 0)    2
(0, 1)    1
(1, 0)    1
(1, 1)    2

[[1.  0.8]
[0.8 1. ]]
"""


Last update: 276 days ago