Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version
  1. First, import the necessary modules:
from nltk.corpus import reuters
from nltk import sent_tokenize
import pandas as pd
  1. Load the Reuters Corpus:
sentences = reuters.sents()
  1. Use sent_tokenize to create a list of sentences:
sentences = [sent_tokenize(" ".join(sentence)) for sentence in sentences]
  1. Create a DataFrame from the sentences list:
df = pd.DataFrame({'text': [item for sublist in sentences for item in sublist]})
  1. Use corpus_segment to add a column of the corresponding categories for each sentence:
df['category'] = reuters.categories(fileids=[idx for idx, _ in reuters.fileids() if _ in df['text'].tolist()])

This will result in a DataFrame with two columns: text containing the sentences and category containing the corresponding categories.