Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Here are the steps to use a Word2Vec model on a column within a Pandas dataframe:

  1. Load the Word2Vec model using gensim library.
import gensim
model = gensim.models.Word2Vec.load(model_path)

model_path is the path to the saved Word2Vec model.

  1. Tokenize the text data in the dataframe column. You can use nltk library for tokenization.
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

df['column'] = df['column'].apply(lambda x: word_tokenize(x))
  1. Remove stop words if needed.
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
df['column'] = df['column'].apply(lambda x: [word for word in x if word.lower() not in stop_words])
  1. Apply the Word2Vec model on the tokenized text data to get the embeddings.
df['embeddings'] = df['column'].apply(lambda x: [model[word] for word in x if word in model.wv.vocab])

This will create a new column 'embeddings' in the dataframe with the Word2Vec embeddings for each row in the 'column' column.

Note: If a word in the text data is not in the Word2Vec vocabulary, the code will throw a KeyError. To handle this, you can either skip the word or replace it with a default vector.