1 | initial version |
Here are the steps to use a Word2Vec model on a column within a Pandas dataframe:
gensim
library.import gensim
model = gensim.models.Word2Vec.load(model_path)
model_path
is the path to the saved Word2Vec model.
nltk
library for tokenization.import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
df['column'] = df['column'].apply(lambda x: word_tokenize(x))
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
df['column'] = df['column'].apply(lambda x: [word for word in x if word.lower() not in stop_words])
df['embeddings'] = df['column'].apply(lambda x: [model[word] for word in x if word in model.wv.vocab])
This will create a new column 'embeddings' in the dataframe with the Word2Vec embeddings for each row in the 'column' column.
Note: If a word in the text data is not in the Word2Vec vocabulary, the code will throw a KeyError. To handle this, you can either skip the word or replace it with a default vector.