优化内存使用 - 熊猫/ Python

Question

我目前正在使用包含原始文本的数据集，我应该预先处理它：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemma = WordNetLemmatizer()

from autocorrect import spell

for df in [train_df, test_df]:
    df['comment_text'] = df['comment_text'].apply(lambda x: word_tokenize(str(x)))
    df['comment_text'] = df['comment_text'].apply(lambda x: [lemma.lemmatize(spell(word)) for word in x])
    df['comment_text'] = df['comment_text'].apply(lambda x: ' '.join(x))

但是，包括spell函数会增加内存使用量，直到我得到“内存错误”。没有使用这样的功能就不会发生这种情况。我想知道是否有办法优化这个过程保持spell功能（数据集有很多拼写错误的单词）。

Answer 1

另一答案