Create Vocabulary
In the context of machine learning and natural language processing, Count Vectorizer" refers to the overall method of converting a collection of text documents into a matrix of token counts. Learn the vocabulary (unique words) from the provided text data. We create a list (set) of all unique tokens across all documents.
a = 'London Paris London'
b = 'Paris Paris London'
def create_vocabulary(texts):
vocabulary = set()
for t in texts:
# Split each document into tokens (usually words)
words = t.split()
# Create a list of all unique tokens across all documents
for w in words:
vocabulary.add(w)
return vocabulary
vocabulary = create_vocabulary([a, b])
print(vocabulary)
"""
{'Paris', 'London'}
"""
Count Matrix
CountVectorizer converts the text data into a numerical format, which is essential for ML models. It creates a matrix where each column represents a unique word. The numbers in the matrix are counts of how many times each word appears.
def fit_transform(texts):
# Create a set of unique words (vocabulary)
vocabulary = create_vocabulary(texts)
# Convert the set to a list for indexing
vocabulary = list(vocabulary)
# Initialize an empty list to store the count vectors
matrix = []
# Iterate through each text in the input
for t in texts:
# Create a count vector initialized with zeros for each word
count_vector = [0] * len(vocabulary)
# Iterate through each word in the current text
for word in t.split():
# Find the index of the word in the vocabulary
index = vocabulary.index(word)
# Increment the count for this word in the count vector
count_vector[index] += 1
matrix.append(count_vector)
return matrix
# Sample text strings
a = 'London Paris London'
b = 'Paris Paris London'
# Get the frequency matrix
matrix = fit_transform([a, b])
print(matrix)
"""
[[1, 2], [2, 1]]
"""
Scikit
Both processes (fit, transform) are encapsulated in the fit_transform method in scikit.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
a = 'London Paris London'
b = 'Paris Paris London'
cv = CountVectorizer()
matrix = cv.fit_transform([a, b])
print(matrix)
similarity_scores = cosine_similarity(matrix)
print(similarity_scores)
"""
(0, 0) 2
(0, 1) 1
(1, 0) 1
(1, 1) 2
[[1. 0.8]
[0.8 1. ]]
"""
Last update: 276 days ago