如何在一列字符串中找到特定的数字模式并将该值替换为该序数的文本版本?
Posted
技术标签:
【中文标题】如何在一列字符串中找到特定的数字模式并将该值替换为该序数的文本版本?【英文标题】:How do I find a specific number pattern within a column of strings and replace that value with a text version of that ordinal number? 【发布时间】:2021-06-06 23:28:30 【问题描述】:请原谅,我是 python 新手。但是我正在构建一个功能,我可以用它来清理各种调查的文本。我觉得我接近将序数的数字版本转换为文本版本,但我并不完全在那里。这是我正在尝试构建的函数(注意,我尝试了 2 种方法来在函数中的 *nbr = * 行上找到正则表达式模式,但我得到了我在下面解释的两种错误):
import pandas as pd
from num2words import num2words
import re
my_df = pd.DataFrame("record": [47,56,59,134,454],
"the_string": ["this is the first string",
"this is the 2nd string",
"nothing to see here",
"4th string has the date: today is the 8th",
"this has a typo10th"])
def replace_ordinal_numbers(words):
nbr = re.findall('(\d+)[st|nd|rd|th]', words) #words.str.findall('(\d+)[st|nd|rd|th]')
newText = words
for n in nbr:
ordinal_words = num2words(n, ordinal=True)
newText = words.replace(r'\d+[st|nd|rd|th]', ordinal_words)
return newText
my_df['the_string_clean'] = replace_ordinal_numbers(str(my_df['the_string']))
错误:
当我在函数的“nbr =”行上运行words.str.findall
时,我收到错误:AttributeError: 'str' object has no attribute 'str'
当我运行re.findall
时,我能够获得一个数据框,但“the_string_clean”列没有反映每行的字符串。相反,我得到:
record the_string the_string_clean
0 47 This is the first string "0This is the first string 1This is the 2nd string 2nothing to
see here 3 4th string has the date: today is the 8th 4This has
a typo10th"
Name: the_string, dtype: object
1 56 This is the 2nd string "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
2 59 nothing to see here "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
3 134 4th string has the "0This is the first string 1This is the 2nd string 2 nothing to
date: today is the 8th see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
4 454 this has a typo10th "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
预期输出:这是我期待的输出:
record the_string the_string_clean
47 this is the first string this is the first string
56 this is the 2nd string this is the second string
59 nothing to see here nothing to see here
134 4th string has the date: today is the 8th fourth string has the date: today is the eighth
454 this has a typo10th this has a typotenth
我希望我足够清楚。我是 Python 新手,我们将不胜感激。
【问题讨论】:
【参考方案1】:您可以通过使用 re.sub
并在 lambda 函数中调用 num2words
作为替换来简化您的 replace_ordinal_numbers
函数。然后只需使用DataFrame.apply
在列上运行函数:
import pandas as pd
from num2words import num2words
import re
my_df = pd.DataFrame("record": [47,56,59,134,454],
"the_string": ["this is the first string",
"this is the 2nd string",
"nothing to see here",
"4th string has the date: today is the 8th",
"this has a typo10th"])
def replace_ordinal_numbers(words):
return re.sub(r'(\d+)(?:st|nd|rd|th)', lambda m: num2words(m.group(1), ordinal=True), words)
my_df['the_string'] = my_df['the_string'].apply(replace_ordinal_numbers)
my_df
输出
record the_string
0 47 this is the first string
1 56 this is the second string
2 59 nothing to see here
3 134 fourth string has the date: today is the eighth
4 454 this has a typotenth
请注意,您需要在正则表达式中使用替换 (?:st|nd|rd|th)
来匹配 st
、nd
、rd
或 th
之一;您正在使用的字符类:[st|nd|rd|th]
将匹配包含dnrst|
中任何字符的任何字符串。
【讨论】:
以上是关于如何在一列字符串中找到特定的数字模式并将该值替换为该序数的文本版本?的主要内容,如果未能解决你的问题,请参考以下文章