从特定字符串中获取数字的 Lambda 函数
Posted
技术标签:
【中文标题】从特定字符串中获取数字的 Lambda 函数【英文标题】:Lambda function to grab numbers from a specific string 【发布时间】:2020-10-09 01:21:21 【问题描述】:有没有办法创建一个函数来搜索整个数据帧中的字符串,当它找到所述字符串然后从中提取进行中的数字?
ex 数据框:
1 2 3 4
6/9/2020 1 Per Page IRES MLS : 91 PRICE: $59,900 Beautiful Views Sold Total Concession: $2000
6/9/2020 1 Per Page IRES MLS : 906 PRICE: $350,000 Fast Seller! Total Concession: $5029
6/9/2020 1 Per Page IRES MLS : 908 PRICE: $360,000 Total Concession: $9000
我能够创建一个函数来判断该字符串是否存在并返回一个布尔值
#searches the dataframe for the words Total Concession and returns if True
df['Concession'] = df.apply(lambda row: row.astype(str).str.contains('Total Concession', regex=True).any(), axis=1)
Total Concession 在创建的数据框中没有一列,它位于数据框中的不同列中。我想知道是否有办法这样做它会返回这个。
Concession
2000
5029
9000
【问题讨论】:
刚刚分享的数据,是不是只有一列?那么假设您的数据框只有一列是否安全? 不,数据框有多个列。 ** 将它们分开并基于它创建一个新列 【参考方案1】:因为您要查找的内容位于每列的末尾。使用.join()
方法将所有字符串列连接到名为text
的列中
一行代码
df['text'] = df.apply(''.join, axis=1).str.split('[$]').str[-1]
或者,如果想使用正则表达式可以尝试
#df['text'] = df.apply(''.join, axis=1).str.extract('((?<=Concession:).*$)')#Use positive look ahead.Basically anything after Concession:
#df['text']=df['text'].str.replace('$','')#Replace $ with a white space
1 2 \
0 6/9/2020 1 Per Page IRES MLS : 91 PRICE: $59,900
1 6/9/2020 1 Per Page IRES MLS : 906 PRICE: $350,000
2 6/9/2020 1 Per Page IRES MLS : 908 PRICE: $360,000
3 4 \
0 Beautiful Views Sold Total Concession: $2000
1 Fast Seller! Total Concession: $5029
2 Total Concession: $9000 Total Concession: $9000
text concession
0 6/9/20201 Per Page IRES MLS : 91 PRICE: $59,... 2000
1 6/9/20201 Per Page IRES MLS : 906 PRICE: $350... 5029
2 6/9/20201 Per Page IRES MLS : 908 PRICE: $360... 9000
【讨论】:
抱歉,我的示例数据框不准确,有单独的列,其中一些包含我正在搜索的文本,但不一致。 这有帮助吗? 这绝对让我走上正轨。它正确拆分,但现在我需要删除一些尾随字符串。不过,这不是我最初问题的一部分。我的实际数据框太大而不能作为一个最小可重复的例子,所以这确实回答了提出的问题。 :)【参考方案2】:这是一种解决方法;需要注意的是,只有一列可以包含全部让步 - 如果您觉得不止一列可以包含全部让步,您可能需要改写您的问题
import re
def extract(text_box):
#extract text that has Total Concession
for entry in text_box:
if entry is None:
continue
if "Total Concession" in entry:
text = entry
match = re.search("(?<=Total Concession:).*", text)
res = match.group(0).strip().strip("$")
return res
应用功能:
df['extract'] = [extract(lst) for lst in df.to_numpy()]
1 2 3 4 extract
0 6/9/2020 1 Per Page IRES MLS : 91 PRICE: $59,900 Beautiful Views Sold Total Concession: $2000 2000
1 6/9/2020 1 Per Page IRES MLS : 906 PRICE: $350,000 Fast Seller! Total Concession: $5029 5029
2 6/9/2020 1 Per Page IRES MLS : 908 PRICE: $360,000 Total Concession: $9000 None 9000
【讨论】:
很遗憾,这是一个错误,UnboundLocalError: local variable 'text' referenced before assignment
【参考方案3】:
使用pandas.Series.str.findall 积极向后看:
'(?<= \$)[\d,]+
这将从字符串中的任何位置提取匹配项。
import pandas as pd
# setup dataframe
data = 'Date': ['6/9/2020', '6/9/2020', '6/9/2020'],
'Price': ['1 Per Page IRES MLS : 91 PRICE: $59,900', '1 Per Page IRES MLS : 906 PRICE: $350,000', '1 Per Page IRES MLS : 908 PRICE: $360,000'],
'Description': ['Beautiful Views Sold', 'Fast Seller!', ''],
'Total Concession': ['Total Concession: $2000', 'Total Concession: $5029', 'Total Concession: $9000']
df = pd.DataFrame(data)
Date Price Description Total Concession
0 6/9/2020 1 Per Page IRES MLS : 91 PRICE: $59,900 Beautiful Views Sold Total Concession: $2000
1 6/9/2020 1 Per Page IRES MLS : 906 PRICE: $350,000 Fast Seller! Total Concession: $5029
2 6/9/2020 1 Per Page IRES MLS : 908 PRICE: $360,000 Total Concession: $9000
# extract numbers from columns
for c in df.columns:
df[f'extracted c'] = df[c].str.findall('(?<= \$)[\d,]+').explode().str.replace(',', '')
# columns with no match, like Description, will be all NaN, so drop them
df.dropna(axis=1, inplace=True, how='all')
# output
Date Price Description Total Concession extracted Price extracted Total Concession
0 6/9/2020 1 Per Page $59,798 IRES MLS : 91 PRICE: Beautiful Views Sold Total Concession: $2000 59798 2000
1 6/9/2020 1 Per Page IRES MLS : 906 PRICE: $350,000 Fast Seller! Total Concession: $5029 350000 5029
2 6/9/2020 1 Per Page IRES MLS : 908 PRICE: $360,000 Total Concession: $9000 360000 9000
# drop or rename other columns as needed
只有总让步
'(?<=Total Concession: \$)[\d,]+'
这将只提取以'Total Concession: $'
开头的数字
for c in df.columns:
df[f'extracted c'] = df[c].str.findall('(?<=Total Concession: \$)[\d,]+').explode().str.replace(',', '')
df.dropna(axis=1, inplace=True, how='all')
# output
Date Price Description Total Concession extracted Total Concession
0 6/9/2020 1 Per Page IRES MLS : 91 PRICE: $59,900 Beautiful Views Sold Total Concession: $2000 2000
1 6/9/2020 1 Per Page IRES MLS : 906 PRICE: $350,000 Fast Seller! Total Concession: $5029 5029
2 6/9/2020 1 Per Page IRES MLS : 908 PRICE: $360,000 Total Concession: $9000 9000
稳健的例子
# setup dataframe
data = 'Date': ['6/9/2020', '6/9/2020', '6/9/2020'],
'Price': ['1 Per Page IRES MLS : 91 PRICE: $59,900', '1 Per Page IRES MLS : 906 PRICE: $350,000', '1 Per Page IRES MLS : 908 PRICE: $360,000'],
'Description': ['Beautiful Views Sold', 'Fast Seller!', ''],
'Total Concession': ['Nothing to see here', 'Total Concession: $5029', 'Total Concession: $9000'],
'Test1': ['A bunch Total Concession: $6,399 of random stuff', 'stuff1', 'stuff2']
df = pd.DataFrame(data)
Date Price Description Total Concession Test1
0 6/9/2020 1 Per Page IRES MLS : 91 PRICE: $59,900 Beautiful Views Sold Nothing to see here A bunch Total Concession: $6,399 of random stuff
1 6/9/2020 1 Per Page IRES MLS : 906 PRICE: $350,000 Fast Seller! Total Concession: $5029 stuff1
2 6/9/2020 1 Per Page IRES MLS : 908 PRICE: $360,000 Total Concession: $9000 stuff2
for c in df.columns:
df[f'extracted c'] = df[c].str.findall('(?<=Total Concession: \$)[\d,]+').explode().str.replace(',', '')
df.dropna(axis=1, inplace=True, how='all')
# list of all extracted columns
extracted_columns = [x for x in df.columns if 'extracted' in x]
# sum all extracted columns
df['all concessions'] = df[extracted_columns].astype(float).sum(axis=1)
# drop the extracted columns
df.drop(columns=extracted_columns, inplace=True)
# print df
Date Price Description Total Concession Test1 all concessions
0 6/9/2020 1 Per Page IRES MLS : 91 PRICE: $59,900 Beautiful Views Sold Nothing to see here A bunch Total Concession: $6,399 of random stuff 6399.0
1 6/9/2020 1 Per Page IRES MLS : 906 PRICE: $350,000 Fast Seller! Total Concession: $5029 stuff1 5029.0
2 6/9/2020 1 Per Page IRES MLS : 908 PRICE: $360,000 Total Concession: $9000 stuff2 9000.0
【讨论】:
很遗憾Total Concession
不在一列中,它在整个数据框中都可以找到
@Taylor 它将在任何列中找到匹配项。请参阅强大的示例以获得证明。
我收到了错误ValueError: cannot reindex from a duplicate axis
,试图做df = df[~df.index.duplicated()]
,但仍然给了我这个错误
@Taylor 这是基于您的示例。 df.reset_index
然后运行我的代码。以上是关于从特定字符串中获取数字的 Lambda 函数的主要内容,如果未能解决你的问题,请参考以下文章
从 AWS lambda 函数访问 Parameter Store 字符串时出错