Create Vocabulary
In the context of machine learning and
natural language processing, Count Vectorizer" refers
to the overall method of
converting a collection of text documents into a matrix of token counts.
Learn the
vocabulary (unique words) from the provided text data.
We create a list (set) of all
unique tokens across all documents.
a = 'London Paris London'
b = 'Paris Paris London'
def create_vocabulary(texts):
vocabulary = set()
for t in texts:
words = t.split()
for w in words:
vocabulary.add(w)
return vocabulary
vocabulary = create_vocabulary([a, b])
print(vocabulary)
Count Matrix
CountVectorizer converts the text data into a numerical format, which is essential for ML models.
It creates a matrix where each column represents a unique word.
The numbers in the matrix are counts of how many times each word appears.
def fit_transform(texts):
vocabulary = create_vocabulary(texts)
vocabulary = list(vocabulary)
matrix = []
for t in texts:
count_vector = [0] * len(vocabulary)
for word in t.split():
index = vocabulary.index(word)
count_vector[index] += 1
matrix.append(count_vector)
return matrix
a = 'London Paris London'
b = 'Paris Paris London'
matrix = fit_transform([a, b])
print(matrix)
Scikit
Both processes (fit, transform) are
encapsulated in the fit_transform method in scikit.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
a = 'London Paris London'
b = 'Paris Paris London'
cv = CountVectorizer()
matrix = cv.fit_transform([a, b])
print(matrix)
similarity_scores = cosine_similarity(matrix)
print(similarity_scores)