从特定字符串中获取数字的 Lambda 函数

Posted

技术标签:

【中文标题】从特定字符串中获取数字的 Lambda 函数【英文标题】:Lambda function to grab numbers from a specific string 【发布时间】:2020-10-09 01:21:21 【问题描述】:

有没有办法创建一个函数来搜索整个数据帧中的字符串,当它找到所述字符串然后从中提取进行中的数字?

ex 数据框:

1                       2                               3                       4 
6/9/2020 1 Per Page  IRES MLS  : 91 PRICE: $59,900    Beautiful Views Sold   Total Concession: $2000
6/9/2020 1 Per Page  IRES MLS : 906 PRICE: $350,000   Fast Seller!           Total Concession: $5029
6/9/2020 1 Per Page  IRES MLS : 908 PRICE: $360,000   Total Concession: $9000

我能够创建一个函数来判断该字符串是否存在并返回一个布尔值

#searches the dataframe for the words Total Concession and returns if True
df['Concession'] = df.apply(lambda row: row.astype(str).str.contains('Total Concession', regex=True).any(), axis=1)

Total Concession 在创建的数据框中没有一列,它位于数据框中的不同列中。我想知道是否有办法这样做它会返回这个。

Concession
2000
5029
9000

【问题讨论】:

刚刚分享的数据,是不是只有一列?那么假设您的数据框只有一列是否安全? 不,数据框有多个列。 ** 将它们分开并基于它创建一个新列 【参考方案1】:

因为您要查找的内容位于每列的末尾。使用.join() 方法将所有字符串列连接到名为text 的列中

一行代码

df['text'] = df.apply(''.join, axis=1).str.split('[$]').str[-1]

或者,如果想使用正则表达式可以尝试

#df['text'] = df.apply(''.join, axis=1).str.extract('((?<=Concession:).*$)')#Use positive look ahead.Basically anything after Concession:
#df['text']=df['text'].str.replace('$','')#Replace $ with a white space




        1                                           2  \
0  6/9/2020   1 Per Page  IRES MLS  : 91 PRICE: $59,900   
1  6/9/2020  1 Per Page  IRES MLS : 906 PRICE: $350,000   
2  6/9/2020  1 Per Page  IRES MLS : 908 PRICE: $360,000   

                         3                        4  \
0     Beautiful Views Sold  Total Concession: $2000   
1             Fast Seller!  Total Concession: $5029   
2  Total Concession: $9000  Total Concession: $9000   

                                                text concession  
0  6/9/20201 Per Page  IRES MLS  : 91 PRICE: $59,...       2000  
1  6/9/20201 Per Page  IRES MLS : 906 PRICE: $350...       5029  
2  6/9/20201 Per Page  IRES MLS : 908 PRICE: $360...       9000  

【讨论】:

抱歉,我的示例数据框不准确,有单独的列,其中一些包含我正在搜索的文本,但不一致。 这有帮助吗? 这绝对让我走上正轨。它正确拆分,但现在我需要删除一些尾随字符串。不过,这不是我最初问题的一部分。我的实际数据框太大而不能作为一个最小可重复的例子,所以这确实回答了提出的问题。 :)【参考方案2】:

这是一种解决方法;需要注意的是,只有一列可以包含全部让步 - 如果您觉得不止一列可以包含全部让步,您可能需要改写您的问题

import re
def extract(text_box):

    #extract text that has Total Concession
    for entry in text_box:
        if entry is None:
            continue
        if "Total Concession" in entry:
            text = entry
    match = re.search("(?<=Total Concession:).*", text)
    res = match.group(0).strip().strip("$")

    return res

应用功能:

df['extract'] = [extract(lst) for lst in df.to_numpy()]

1   2   3   4   extract
0   6/9/2020 1 Per Page IRES MLS    : 91 PRICE: $59,900 Beautiful Views Sold Total Concession: $2000    2000
1   6/9/2020 1 Per Page IRES MLS : 906 PRICE: $350,000  Fast Seller!    Total Concession: $5029 5029
2   6/9/2020 1 Per Page IRES MLS : 908 PRICE: $360,000  Total Concession: $9000 None    9000

【讨论】:

很遗憾,这是一个错误,UnboundLocalError: local variable 'text' referenced before assignment【参考方案3】:

使用pandas.Series.str.findall 积极向后看:

'(?&lt;= \$)[\d,]+ 这将从字符串中的任何位置提取匹配项。
import pandas as pd

# setup dataframe
data = 'Date': ['6/9/2020', '6/9/2020', '6/9/2020'],
        'Price': ['1 Per Page  IRES MLS  : 91 PRICE: $59,900', '1 Per Page  IRES MLS : 906 PRICE: $350,000', '1 Per Page  IRES MLS : 908 PRICE: $360,000'],
        'Description': ['Beautiful Views Sold', 'Fast Seller!', ''],
        'Total Concession': ['Total Concession: $2000', 'Total Concession: $5029', 'Total Concession: $9000']

df = pd.DataFrame(data)

       Date                                       Price           Description         Total Concession
0  6/9/2020   1 Per Page  IRES MLS  : 91 PRICE: $59,900  Beautiful Views Sold  Total Concession: $2000
1  6/9/2020  1 Per Page  IRES MLS : 906 PRICE: $350,000          Fast Seller!  Total Concession: $5029
2  6/9/2020  1 Per Page  IRES MLS : 908 PRICE: $360,000                        Total Concession: $9000

# extract numbers from columns
for c in df.columns:
    df[f'extracted c'] = df[c].str.findall('(?<= \$)[\d,]+').explode().str.replace(',', '')

# columns with no match, like Description, will be all NaN, so drop them
df.dropna(axis=1, inplace=True, how='all')

# output
       Date                                       Price           Description         Total Concession extracted Price extracted Total Concession
0  6/9/2020    1 Per Page $59,798 IRES MLS  : 91 PRICE:  Beautiful Views Sold  Total Concession: $2000           59798                       2000
1  6/9/2020  1 Per Page  IRES MLS : 906 PRICE: $350,000          Fast Seller!  Total Concession: $5029          350000                       5029
2  6/9/2020  1 Per Page  IRES MLS : 908 PRICE: $360,000                        Total Concession: $9000          360000                       9000

# drop or rename other columns as needed

只有总让步

'(?&lt;=Total Concession: \$)[\d,]+' 这将只提取以'Total Concession: $' 开头的数字
for c in df.columns:
    df[f'extracted c'] = df[c].str.findall('(?<=Total Concession: \$)[\d,]+').explode().str.replace(',', '')

df.dropna(axis=1, inplace=True, how='all')

# output
       Date                                       Price           Description         Total Concession extracted Total Concession
0  6/9/2020   1 Per Page  IRES MLS  : 91 PRICE: $59,900  Beautiful Views Sold  Total Concession: $2000                       2000
1  6/9/2020  1 Per Page  IRES MLS : 906 PRICE: $350,000          Fast Seller!  Total Concession: $5029                       5029
2  6/9/2020  1 Per Page  IRES MLS : 908 PRICE: $360,000                        Total Concession: $9000                       9000

稳健的例子

# setup dataframe
data = 'Date': ['6/9/2020', '6/9/2020', '6/9/2020'],
        'Price': ['1 Per Page  IRES MLS  : 91 PRICE: $59,900', '1 Per Page  IRES MLS : 906 PRICE: $350,000', '1 Per Page  IRES MLS : 908 PRICE: $360,000'],
        'Description': ['Beautiful Views Sold', 'Fast Seller!', ''],
        'Total Concession': ['Nothing to see here', 'Total Concession: $5029', 'Total Concession: $9000'],
        'Test1': ['A bunch Total Concession: $6,399 of random stuff', 'stuff1', 'stuff2']

df = pd.DataFrame(data)

       Date                                       Price           Description         Total Concession                                             Test1
0  6/9/2020   1 Per Page  IRES MLS  : 91 PRICE: $59,900  Beautiful Views Sold      Nothing to see here  A bunch Total Concession: $6,399 of random stuff
1  6/9/2020  1 Per Page  IRES MLS : 906 PRICE: $350,000          Fast Seller!  Total Concession: $5029                                            stuff1
2  6/9/2020  1 Per Page  IRES MLS : 908 PRICE: $360,000                        Total Concession: $9000                                            stuff2


for c in df.columns:
    df[f'extracted c'] = df[c].str.findall('(?<=Total Concession: \$)[\d,]+').explode().str.replace(',', '')

df.dropna(axis=1, inplace=True, how='all')


# list of all extracted columns
extracted_columns = [x for x in df.columns if 'extracted' in x]

# sum all extracted columns
df['all concessions'] = df[extracted_columns].astype(float).sum(axis=1)

# drop the extracted columns
df.drop(columns=extracted_columns, inplace=True)

# print df
       Date                                       Price           Description         Total Concession                                             Test1  all concessions
0  6/9/2020   1 Per Page  IRES MLS  : 91 PRICE: $59,900  Beautiful Views Sold      Nothing to see here  A bunch Total Concession: $6,399 of random stuff           6399.0
1  6/9/2020  1 Per Page  IRES MLS : 906 PRICE: $350,000          Fast Seller!  Total Concession: $5029                                            stuff1           5029.0
2  6/9/2020  1 Per Page  IRES MLS : 908 PRICE: $360,000                        Total Concession: $9000                                            stuff2           9000.0

【讨论】:

很遗憾Total Concession 不在一列中,它在整个数据框中都可以找到 @Taylor 它将在任何列中找到匹配项。请参阅强大的示例以获得证明。 我收到了错误ValueError: cannot reindex from a duplicate axis,试图做df = df[~df.index.duplicated()],但仍然给了我这个错误 @Taylor 这是基于您的示例。 df.reset_index 然后运行我的代码。

以上是关于从特定字符串中获取数字的 Lambda 函数的主要内容,如果未能解决你的问题,请参考以下文章

在特定单词之后从字符串中获取子字符串

从 AWS lambda 函数访问 Parameter Store 字符串时出错

如何从 lambda 函数返回字符串?

使用Spark SQL中的regex函数从字符串中提取特定数字

以字符串形式获取 lambda 函数的名称

从字符串中获取特定数据? C#