Python使用提取的正则表达式创建一个新列,直到 \n 从数据框中

Posted

技术标签:

【中文标题】Python使用提取的正则表达式创建一个新列,直到 \\n 从数据框中【英文标题】:Python create a new column with extracted regex until \n from a dataframePython使用提取的正则表达式创建一个新列,直到 \n 从数据框中 【发布时间】:2021-12-13 10:05:13 【问题描述】:

我有一个如下所示的数据框:

data = 'c1':['Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n', 
              'Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
        'c2':["one", "two", "three", "four"]

我想创建:

Thrown: lib: 到第一个\n 之前提取任何内容的正则表达式。我将其称为“01 组”。所以我会在下面这样:

data = 'c3':['this is problem type 01', 
               'this is problem type 01', 
               'this is problem type 02', 
               'this is problem type 04']

然后我想创建一个正则表达式,提取“组 01”(前一个正则表达式)之后的所有内容,忽略句子之间的 \t\n,直到下一个 \n。所以我会在下面这样:

data = 'c4':['Error executing the statement: error statement 1', 
            'Error executing the statement: error statement 3', 
            'Error executing the statement: error statement2', 
            'Error executing the statement: error statement1']

最后我希望我的数据框是这样的:

data = 'c1':['Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3', 
              'Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1'],
        'c3':['this is problem type 01', 
              'this is problem type 01', 
              'this is problem type 02', 
              'this is problem type 04'],
        'c4':['Error executing the statement: error statement 1', 
              'Error executing the statement: error statement 3', 
              'Error executing the statement: error statement2', 
              'Error executing the statement: error statement1'],
        'c2':["one", "two", "three", "four"]

这是我到目前为止所拥有的,我试图从“Thrown: lib:”中提取直到第一个\n,但它不起作用。

df = pd.DataFrame(data)
df['exception'] = df['c1'].str.extract(r'Thrown: lib: (.*(?:\r?\n.*)*)', expand=False)

【问题讨论】:

【参考方案1】:

我会使用re 包:

data['c3'] = [re.findall("Thrown: lib: ([^\n]+)", x) for x in data['c1']]
data['c4'] = [re.split("\n", x)[3].strip() for x in data['c1']]
第一个模式提取 Thrown: lib: 和第一个换行符之间的所有内容 第二种模式假设相关消息始终是第 4 个标记,当被\n 拆分时,似乎是这种情况

跟进:以下问题。 data['c4'] 的模式基于这样一个事实,即消息总是在消息中的 4 个“\n”换行符之后。 现在,如果感兴趣的分隔符是“\n \t\n”,您可以修改以下模式:

data['c4'] = [re.split("\n \t\n", x)[1].strip() for x in data['c1']]

data['c4'] = [re.findall(".*?\n \t\n(.*)", x)[0].strip() for x in data['c1']]

最后一种方法更好,因为如果split 在分隔符上失败,您将获得IndexError

【讨论】:

嗨!感谢您的回答!得到第一个模式,它有效:) 此外,第二个是在“\n \t\n \t”之后,而不仅仅是在\n之后。您对如何更改接受它有任何提示吗? :)【参考方案2】:

也许可以作为单线来做,但是像这样:

import re
import pandas as pd


data = 'c1':['Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n', 
              'Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
        'c2':["one", "two", "three", "four"]



df = pd.DataFrame(data)

pattern1 = 'Thrown: lib: ([a-zA-Z\d\s]*)\\n'
df['c3'] = df['c1'].str.extract(pattern1, expand=False).str.strip()

pattern2 = '(\\n\s\\t)1,(.*)\\n'
df['c4'] = df['c1'].str.extract(pattern2, expand=True)[1]

输出:

print(df.to_string())
                                                                                                                           c1     c2                       c3                                                c4
0  Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n    one  this is problem type 01  Error executing the statement: error statement 1
1  Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n    two  this is problem type 01  Error executing the statement: error statement 3
2   Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n  three  this is problem type 02   Error executing the statement: error statement2
3   Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n   four  this is problem type 04   Error executing the statement: error statement1

【讨论】:

您好您好,感谢您的更新!我不明白你的正则表达式,但我认为它在我测试时不起作用。就像你说的那样,c4就是这个,而且它必须在我在最初的问题上写的“01组”之后。这意味着,它应该在 'Thrown: lib:' 之后有分隔符 \n \t\n 才能工作,否则将匹配“group 01”之前存在的 \n \t\n 发生。 (我需要'Thrown: lib:'之后的第一个\n \t\n):) 我能够改变你的第二个模式来做我想做的事情:pattern2 = 'Thrown: lib.gack.GackContext: [^\n]+(\\n\s\\t )1,(.*)\\n'。谢谢你:) 你能解释一下第二种模式吗?我可以通过试错法进行修改,但我很乐意更好地理解正则表达式:)

以上是关于Python使用提取的正则表达式创建一个新列,直到 \n 从数据框中的主要内容,如果未能解决你的问题,请参考以下文章

正则表达式在多个模式之前找到一个数字序列,放入一个新列(Python,Pandas)

使用正则表达式根据列的值在数据集中创建新列

使用正则表达式提取所有内容,直到特定符号重复出现

Python学习手册之正则表达式示例--邮箱地址提取

在 pandas 列中提取正则表达式

R中的正则表达式命名组