对于大型文本数据，如何使 pandas df 列中的文本处理更快？

Posted 2023-02-23

技术标签:

【中文标题】对于大型文本数据，如何使 pandas df 列中的文本处理更快？【英文标题】：How to make text processing in a pandas df column more faster for large textual data? 【发布时间】：2021-01-25 18:17:57 【问题描述】：

我有一个超过 1GB 的聊天数据 (chat.txt) 的大文本文件，格式如下：

john|12-02-1999|hello#,there#,how#,are#,you#,tom$ 
tom|12-02-1999|hey#,john$,hows#, it#, goin#
mary|12-03-1999|hello#,boys#,fancy#,meetin#,ya'll#,here#
...
...
john|12-02-2000|well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$
mary|12-03-2000|catch#,you#,on#,the#,flipside#,tom$,and#,john$

我想处理此文本并分别为每个用户汇总某些关键字的字数（比如 500 个字 - 你好，不错，比如.... 晚餐，不）。此过程还涉及从每个单词中删除所有尾随特殊字符

输出看起来像

user   hello   nice   like    .....    dinner  No  
Tom    10000   500     300    .....    6000    0
John   6000    1200    200    .....    3000    5
Mary   23      9000    10000  .....    100     9000

这是我目前的pythonic解决方案：

chat_data = pd.read_csv("chat.txt", sep="|", names =["user","date","words"])
user_lst = chat_data.user.unique()
user_grouped_data= pd.DataFrame(columns=["user","words"])
user_grouped_data['user']=user_lst

for i,row in user_grouped_data.iterrows():
    id = row["user"]
    temp = chat_data[chat_data["user"]==id]
    user_grouped_data.loc[i,"words"] = ",".join(temp["words"].tolist())

result = pd.DataFrame(columns=[ "user", "hello", "nice", "like","...500 other keywords...", "dinner", "no"])
result["user"]= user_lst

for i, row in result.iterrows():
    id = row["user"]
    temp = user_grouped_data[user_grouped_data["user"]==id]
    words =  temp.values.tolist()[0][1]
    word_lst = words.split(",")
    word_lst = [item[0:-1] for item in word_lst]
    t_dict = Counter(word_lst)
    keys = t_dict.keys()
    for word in keys:
        result.at[i,word]= t_dict.get(word)

result.to_csv("user_word_counts.csv")

这适用于小数据，但是当我的 chat_data 超过 1gb 时，此解决方案变得非常缓慢且无法使用。

下面有没有我可以改进的部分，可以帮助我更快地处理数据？

按用户分组文本数据通过删除尾随特殊字符来清理每行中的文本数据统计字数并将字数分配到右列

【问题讨论】：

【参考方案1】：

您可以split 逗号分隔的列到列表，explode 到该列表列的数据框，groupby 名称和分解列表中的值，unstack 或 pivot_table 数据框转换成您想要的格式，并使用droplevel()、reset_index() 等对多索引列进行一些最后的清理。

以下所有方法都是矢量化的 pandas 方法，所以希望它很快。注意：当我从剪贴板读取并传递headers=None

时，下面代码中的三列是[0,1,2]

输入：

df = pd.DataFrame(0: 0: 'john', 1: 'tom', 2: 'mary', 3: 'john', 4: 'mary',
 1: 0: '12-02-1999',
  1: '12-02-1999',
  2: '12-03-1999',
  3: '12-02-2000',
  4: '12-03-2000',
 2: 0: 'hello#,there#,how#,are#,you#,tom$ ',
  1: 'hey#,john$,hows#, it#, goin#',
  2: "hello#,boys#,fancy#,meetin#,ya'll#,here#",
  3: 'well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$',
  4: 'catch#,you#,on#,the#,flipside#,tom$,and#,john$')

代码：

df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')
df = (df.explode(2)
      .groupby([0, 2])[2].count()
      .rename('Count')
      .reset_index()
      .set_index([0,2])
      .unstack(1)
      .fillna(0))
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]: 
2     0   goin   it   mary  and  are  been  boys  catch  catching  ...   on  \
0  john    0.0  0.0    1.0  1.0  1.0   1.0   0.0    0.0       1.0  ...  0.0   
1  mary    0.0  0.0    0.0  1.0  0.0   0.0   1.0    1.0       0.0  ...  1.0   
2   tom    1.0  1.0    0.0  0.0  0.0   0.0   0.0    0.0       0.0  ...  0.0   

2  the  there  tom  tom    up  well  with  ya'll  you  
0  0.0    1.0  0.0   1.0  1.0   1.0   1.0    0.0  2.0  
1  1.0    0.0  1.0   0.0  0.0   0.0   0.0    1.0  1.0

您也可以使用.pivot_table 代替.unstack()，这样可以节省这行代码：df.columns = df.columns.droplevel()：

df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')
df = (df.explode(2)
      .groupby([0, 2])[2].count()
      .rename('Count')
      .reset_index()
      .pivot_table(index=0, columns=2, values='Count')
      .fillna(0)
      .astype(int)
      .reset_index())
df
Out[45]: 
2     0   goin   it   mary  and  are  been  boys  catch  catching  ...  on  \
0  john      0    0      1    1    1     1     0      0         1  ...   0   
1  mary      0    0      0    1    0     0     1      1         0  ...   1   
2   tom      1    1      0    0    0     0     0      0         0  ...   0   

2  the  there  tom  tom   up  well  with  ya'll  you  
0    0      1    0     1   1     1     1      0    2  
1    1      0    1     0   0     0     0      1    1  
2    0      0    0     0   0     0     0      0    0  

[3 rows x 31 columns]

【讨论】：

【参考方案2】：

如果你能用scikit-learn，用CountVectorizer很容易

from sklearn.feature_extraction.text import CountVectorizer

s = df['words'].str.replace("#|\$|\s+", "")
model = CountVectorizer(tokenizer=lambda x: x.split(','))

df_final = pd.DataFrame(model.fit_transform(s).toarray(),
                        columns=model.get_feature_names(),
                        index=df.user).sum(level=0)

Out[279]:
      and  are  been  boys  catch  catching  fancy  flipside  goin  hello  \
user
john    1    1     1     0      0         1      0         0     0      1
tom     0    0     0     0      0         0      0         0     1      0
mary    1    0     0     1      1         0      1         1     0      1

      here  hey  how  hows  it  its  john  mary  meetin  nice  on  the  there  \
user
john     0    0    1     0   0    1     0     1       0     1   0    0      1
tom      0    1    0     1   1    0     1     0       0     0   0    0      0
mary     1    0    0     0   0    0     1     0       1     0   1    1      0

      tom  up  well  with  ya'll  you
user
john    1   1     1     1      0    2
tom     0   0     0     0      0    0
mary    1   0     0     0      1    1

【讨论】：

【参考方案3】：

我不确定这种方法在大型 DataFrame 上的速度有多快，但您可以尝试一下。首先，删除特殊字符并将字符串拆分为单词列表，从而形成另一列：

from itertools import chain
from collections import Counter
df['lists'] = df['words'].str.replace("#|\$", "").str.split(",")

现在，按用户分组），将列表收集到一个列表中，并使用Counter 计算出现次数：

df.groupby('user')['lists'].apply(chain.from_iterable)\
                           .apply(Counter)\
                           .apply(pd.Series)\
                           .fillna(0).astype(int)

【讨论】：

以上是关于对于大型文本数据，如何使 pandas df 列中的文本处理更快？的主要内容，如果未能解决你的问题，请参考以下文章

在 Pandas 中合并两个大型数据框

如何在 pandas DF 列中找出哪些值不能使用 astype 函数转换为“int”类型

Python Pandas - 数据帧列中的查询和布尔值

根据名称阻止 pandas 列中的文本

如何读取非常大的 CSV 的一小部分行。 Pandas - 时间序列 - 大型数据集

如何将字符添加到 pandas 列中的日期或 str？