Revision history [back]

Here are the steps to use a Word2Vec model on a column within a Pandas dataframe:

Load the Word2Vec model using gensim library.

import gensim
model = gensim.models.Word2Vec.load(model_path)

model_path is the path to the saved Word2Vec model.

Tokenize the text data in the dataframe column. You can use nltk library for tokenization.

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

df['column'] = df['column'].apply(lambda x: word_tokenize(x))

Remove stop words if needed.

from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
df['column'] = df['column'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

Apply the Word2Vec model on the tokenized text data to get the embeddings.

df['embeddings'] = df['column'].apply(lambda x: [model[word] for word in x if word in model.wv.vocab])

This will create a new column 'embeddings' in the dataframe with the Word2Vec embeddings for each row in the 'column' column.

Note: If a word in the text data is not in the Word2Vec vocabulary, the code will throw a KeyError. To handle this, you can either skip the word or replace it with a default vector.