在Python中的数据框中的每一行的两个子字符串之间选择字符串

Posted 2023-02-22

技术标签:

【中文标题】在Python中的数据框中的每一行的两个子字符串之间选择字符串【英文标题】：Select string between two substrings to every row in dataframe in Python 【发布时间】：2018-08-14 17:06:07 【问题描述】：

我希望能够在两个特定子字符串之间选择一个字符串（如下所示），但有一个循环将遍历数据帧中的每一行。

代码：

import pandas as pd

df = pd.DataFrame(['first: hello1 \nSecond this1 is1 a1 third: test1\n', 'first: hello2 \nSecond this2 is2 a2 third: test2\n', 'first: hello3 \nSecond this3 is3 a3 third: test3\n'])
df = df.rename(columns=0: "text")

def find_between(df, start, end):
  return (df.split(start))[1].split(end)[0]

df2 = df['text'][0]
print(find_between(df3, 'first:', '\nSecond'))

[需要输出] 包含以下信息的数据框：

   output
0  hello1
1  hello2
2  hello3

find_between() 函数是基于Find string between two substrings 创建的，但在这里您只能对已保存为字符串的一个特定变量 (df2) 执行此操作，如所示示例。我需要能够为“df”数据框中的每一行（字符串）执行此操作。

如果有人能帮我解决这个问题，我将不胜感激！谢谢！

【问题讨论】：

【参考方案1】：

为什么要定义一个函数？你可以使用str.extract:

start = 'first'
end = '\nSecond'

df.text.str.extract(r'(?<=)(.*?)(?=)'.format(start, end), expand=False)

0    : hello1 
1    : hello2 
2    : hello3 
Name: text, dtype: object

详情

(?<=   # lookbehind
first
)
(      # capture-group
.*?    # non-greedy match
)
(?=    # lookahead
\nSecond
)

后瞻和前瞻之间的所有内容都会被捕获。

您可以多次调用str.split，但这并不优雅：

df.text.str.split(start).str[1].str.split(end).str[0]

0    : hello1 
1    : hello2 
2    : hello3 
Name: text, dtype: object

【讨论】：

不了解性能，但我想说split 方法比使用正则表达式更加优雅。不过，这可能归结为纯粹的意见。

以上是关于在Python中的数据框中的每一行的两个子字符串之间选择字符串的主要内容，如果未能解决你的问题，请参考以下文章