from nltk.corpus import reuters
from nltk import sent_tokenize
import pandas as pd
sentences = reuters.sents()
sent_tokenize
to create a list of sentences:sentences = [sent_tokenize(" ".join(sentence)) for sentence in sentences]
sentences
list:df = pd.DataFrame({'text': [item for sublist in sentences for item in sublist]})
corpus_segment
to add a column of the corresponding categories for each sentence:df['category'] = reuters.categories(fileids=[idx for idx, _ in reuters.fileids() if _ in df['text'].tolist()])
This will result in a DataFrame with two columns: text
containing the sentences and category
containing the corresponding categories.
Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss
Asked: 2021-11-09 11:00:00 +0000
Seen: 7 times
Last updated: Jan 01 '22
How can I include the hours component to a DateTime column using PowerQuery?
Identify commonalities among the strings in a specific column of a DataFrame.
What is the procedure for using Pandas fillna() method with the column's mode?
How can you use linq to choose a specific column from a datatable?