如何在一列字符串中找到特定的数字模式并将该值替换为该序数的文本版本？

Posted 2023-03-29

技术标签:

【中文标题】如何在一列字符串中找到特定的数字模式并将该值替换为该序数的文本版本？【英文标题】：How do I find a specific number pattern within a column of strings and replace that value with a text version of that ordinal number? 【发布时间】：2021-06-06 23:28:30 【问题描述】：

请原谅，我是 python 新手。但是我正在构建一个功能，我可以用它来清理各种调查的文本。我觉得我接近将序数的数字版本转换为文本版本，但我并不完全在那里。这是我正在尝试构建的函数（注意，我尝试了 2 种方法来在函数中的 *nbr = * 行上找到正则表达式模式，但我得到了我在下面解释的两种错误）：

import pandas as pd
from num2words import num2words
import re

my_df = pd.DataFrame("record": [47,56,59,134,454],
                      "the_string": ["this is the first string",
                                     "this is the 2nd string",
                                     "nothing to see here",
                                     "4th string has the date: today is the 8th",
                                     "this has a typo10th"])

def replace_ordinal_numbers(words):
    nbr = re.findall('(\d+)[st|nd|rd|th]', words) #words.str.findall('(\d+)[st|nd|rd|th]')
    
    newText = words
    for n in nbr:
        ordinal_words = num2words(n, ordinal=True)
        newText = words.replace(r'\d+[st|nd|rd|th]', ordinal_words)
    return newText

my_df['the_string_clean'] = replace_ordinal_numbers(str(my_df['the_string']))

错误：当我在函数的“nbr =”行上运行words.str.findall 时，我收到错误：AttributeError: 'str' object has no attribute 'str' 当我运行re.findall 时，我能够获得一个数据框，但“the_string_clean”列没有反映每行的字符串。相反，我得到：

    record  the_string                  the_string_clean
0   47      This is the first string    "0This is the first string 1This is the 2nd string 2nothing to 
                                        see here 3 4th string has the date: today is the 8th 4This has 
                                        a typo10th"
Name: the_string, dtype: object
1   56      This is the 2nd string      "0This is the first string 1This is the 2nd string 2 nothing to 
                                        see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object
2   59       nothing to see here        "0This is the first string 1This is the 2nd string 2 nothing to 
                                        see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object
3   134      4th string has the         "0This is the first string 1This is the 2nd string 2 nothing to
             date: today is the 8th     see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object
4   454      this has a typo10th        "0This is the first string 1This is the 2nd string 2 nothing to 
                                        see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object

预期输出：这是我期待的输出：

record    the_string                                 the_string_clean
47        this is the first string                   this is the first string
56        this is the 2nd string                     this is the second string
59        nothing to see here                        nothing to see here
134       4th string has the date: today is the 8th  fourth string has the date: today is the eighth
454       this has a typo10th                        this has a typotenth

我希望我足够清楚。我是 Python 新手，我们将不胜感激。

【问题讨论】：

【参考方案1】：

您可以通过使用 re.sub 并在 lambda 函数中调用 num2words 作为替换来简化您的 replace_ordinal_numbers 函数。然后只需使用DataFrame.apply 在列上运行函数：

import pandas as pd
from num2words import num2words
import re

my_df = pd.DataFrame("record": [47,56,59,134,454],
                      "the_string": ["this is the first string",
                                     "this is the 2nd string",
                                     "nothing to see here",
                                     "4th string has the date: today is the 8th",
                                     "this has a typo10th"])

def replace_ordinal_numbers(words):
    return re.sub(r'(\d+)(?:st|nd|rd|th)', lambda m: num2words(m.group(1), ordinal=True), words)

my_df['the_string'] = my_df['the_string'].apply(replace_ordinal_numbers)

my_df

输出

   record                                       the_string
0      47                         this is the first string
1      56                        this is the second string
2      59                              nothing to see here
3     134  fourth string has the date: today is the eighth
4     454                             this has a typotenth

请注意，您需要在正则表达式中使用替换 (?:st|nd|rd|th) 来匹配 st、nd、rd 或 th 之一；您正在使用的字符类：[st|nd|rd|th] 将匹配包含dnrst| 中任何字符的任何字符串。

【讨论】：

以上是关于如何在一列字符串中找到特定的数字模式并将该值替换为该序数的文本版本？的主要内容，如果未能解决你的问题，请参考以下文章