minte9
LearnRemember




Create Vocabulary

In the context of machine learning and natural language processing, Count Vectorizer" refers to the overall method of converting a collection of text documents into a matrix of token counts. Learn the vocabulary (unique words) from the provided text data. We create a list (set) of all unique tokens across all documents.
 
a = 'London Paris London'
b = 'Paris Paris London'

def create_vocabulary(texts):
    vocabulary = set()

    for t in texts:
        # Split each document into tokens (usually words)
        words = t.split()

        # Create a list of all unique tokens across all documents
        for w in words:
            vocabulary.add(w)
    return vocabulary

vocabulary = create_vocabulary([a, b])
print(vocabulary) 

"""
     {'Paris', 'London'}
"""

Count Matrix

CountVectorizer converts the text data into a numerical format, which is essential for ML models. It creates a matrix where each column represents a unique word. The numbers in the matrix are counts of how many times each word appears.
 
def fit_transform(texts):
    # Create a set of unique words (vocabulary)
    vocabulary = create_vocabulary(texts)

    # Convert the set to a list for indexing
    vocabulary = list(vocabulary)

    # Initialize an empty list to store the count vectors
    matrix = []

    # Iterate through each text in the input
    for t in texts:
        # Create a count vector initialized with zeros for each word
        count_vector = [0] * len(vocabulary)

        # Iterate through each word in the current text
        for word in t.split():
            # Find the index of the word in the vocabulary
            index = vocabulary.index(word)
            
            # Increment the count for this word in the count vector
            count_vector[index] += 1
            
        matrix.append(count_vector)
    return matrix

# Sample text strings
a = 'London Paris London'
b = 'Paris Paris London'

# Get the frequency matrix
matrix = fit_transform([a, b])
print(matrix)

"""
    [[1, 2], [2, 1]]
"""

Scikit

Both processes (fit, transform) are encapsulated in the fit_transform method in scikit.
 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

a = 'London Paris London'
b = 'Paris Paris London'

cv = CountVectorizer()
matrix = cv.fit_transform([a, b])
print(matrix)

similarity_scores = cosine_similarity(matrix)
print(similarity_scores)

"""
  (0, 0)    2
  (0, 1)    1
  (1, 0)    1
  (1, 1)    2

[[1.  0.8]
 [0.8 1. ]]
"""



  Last update: 222 days ago