如何使用正则表达式将特定子字符串提取到新行中?
Posted
技术标签:
【中文标题】如何使用正则表达式将特定子字符串提取到新行中?【英文标题】:How to extract specific substrings into new rows, using regex? 【发布时间】:2020-03-07 09:08:57 【问题描述】:我有一个数据框,其中包含用户和客户代理之间的完整聊天。我想只提取来自用户的消息并从中创建具有相同票证 ID 的新行:
ticket_id = pd.DataFrame(["1","2"]).rename(columns=0:"Ticket-ID")
full_chat = pd.DataFrame([
"User foo foo foo 12:12 PM, Agent bar bar bar 12:12 PM, User foo foo 12:13
PM, Agent bar bar 12:13 PM, User foo 12:14 PM, Agent bar 12:14 PM",
"User bar bar bar 12:12 PM, Agent foo foo foo 12:12 PM, User bar bar 12:13
PM"
]).rename(columns=0:"Full-Chat")
merge_chat = pd.merge(ticket_id, full_chat, left_index=True, right_index=True, how='outer')
def _split_row(text):
cleaned_text = text.lower()
lines = re.findall(r"\b\w*user\b\ (.*?)\ *\d\d:\d\d*", cleaned_text)
for line in lines:
print(line.split())
print(merge_chat["Full-Chat"].apply(_split_row))
我希望它是这样的:
Ticket-ID Full-Chat
1 foo foo foo
1 foo foo
1 foo
2 bar bar bar
2 bar bar
【问题讨论】:
【参考方案1】:IIUC,
merge_chat['Full-Chat'] = merge_chat['Full-Chat'].apply(lambda i: re.findall(r"\b\w*user\b\ (.*?)\ *\d\d:\d\d*", i.lower()))
从 Pandas 0.25.0 开始,
merge_chat.explode(column='Full-Chat')
会给你结果
在 0.25.0 之前的版本中,
df = pd.DataFrame(merge_chat['Full-Chat'].tolist(), index=merge_chat['Ticket-ID']).stack()
df = df.reset_index([0, 'Ticket-ID'])
df.rename(columns=0:'Full-Chat', inplace=True)
df
Ticket-ID Full-Chat
0 1 foo foo foo
1 1 foo foo
2 1 foo
3 2 bar bar bar
4 2 bar bar
【讨论】:
【参考方案2】:我对此进行了测试,它可以工作
ticket_id = pd.DataFrame(["1","2"]).rename(columns=0:"Ticket-ID")
full_chat = pd.DataFrame(["User foo foo foo 12:12 PM, Agent bar bar bar 12:12 PM, User foo foo 12:13 PM, Agent bar bar 12:13 PM, User foo 12:14 PM, Agent bar 12:14 PM", "User bar bar bar 12:12 PM, Agent foo foo foo 12:12 PM, User bar bar 12:13 PM"]).rename(columns=0:"Full-Chat")
merge_chat = pd.merge(ticket_id, full_chat, left_index=True, right_index=True, how='outer')
Output_df = pd.DataFrame(columns = ["Ticket-ID","Full-Chat"])
def split_row(text,ticket_id):
cleaned_text = text.lower()
lines = re.findall(r"\b\w*user\b\ (.*?)\ *\d\d:\d\d*", cleaned_text)
return_df = pd.DataFrame(columns = ["Ticket-ID","Full-Chat"])
for line in lines:
New_row = pd.DataFrame('Ticket-ID':[ticket_id],'Full-Chat':[line])
return_df = return_df.append(New_row)
return return_df
for index, row in merge_chat.iterrows():
Output_df = Output_df.append(split_row(row['Full-Chat'],row['Ticket-ID']))
Output_df=Output_df[['Ticket-ID', 'Full-Chat']].reset_index(drop=True)
Output_df.head()
输出:
Ticket-ID Full-Chat
0 1 foo foo foo
1 1 foo foo
2 1 foo
3 2 bar bar bar
4 2 bar bar
【讨论】:
以上是关于如何使用正则表达式将特定子字符串提取到新行中?的主要内容,如果未能解决你的问题,请参考以下文章