从 Python 中的非结构化文本中提取一个人的年龄

Posted 2023-02-19

技术标签:

【中文标题】从 Python 中的非结构化文本中提取一个人的年龄【英文标题】：Extracting a person's age from unstructured text in Python 【发布时间】：2019-12-15 03:15:49 【问题描述】：

我有一个行政档案数据集，其中包括简短的传记。我正在尝试通过使用 python 和一些模式匹配来提取人们的年龄。一些句子的例子是：

“邦德先生，67 岁，是英国的工程师” “Amanda B. Bynes，34 岁，是一名演员” “Peter Parker (45) 将成为我们的下一任管理员” “Dylan 先生 46 岁。” “史蒂夫·琼斯，年龄：32，”

这些是我在数据集中发现的一些模式。我想补充一点，还有其他模式，但我还没有遇到它们，不知道我怎么能做到这一点。我编写了以下代码，效果很好，但效率很低，因此在整个数据集上运行会花费太多时间。

#Create a search list of expressions that might come right before an age instance
age_search_list = [" " + last_name.lower().strip() + ", age ",
" " + clean_sec_last_name.lower().strip() + " age ",
last_name.lower().strip() + " age ",
full_name.lower().strip() + ", age ",
full_name.lower().strip() + ", ",
" " + last_name.lower() + ", ",
" " + last_name.lower().strip()  + " \(",
" " + last_name.lower().strip()  + " is "]

#for each element in our search list
for element in age_search_list:
    print("Searching: ",element)

    # retrieve all the instances where we might have an age
    for age_biography_instance in re.finditer(element,souptext.lower()):

        #extract the next four characters
        age_biography_start = int(age_biography_instance.start())
        age_instance_start = age_biography_start + len(element)
        age_instance_end = age_instance_start + 4
        age_string = souptext[age_instance_start:age_instance_end]

        #extract what should be the age
        potential_age = age_string[:-2]

        #extract the next two characters as a security check (i.e. age should be followed by comma, or dot, etc.)
        age_security_check = age_string[-2:]
        age_security_check_list = [", ",". ",") "," y"]

        if age_security_check in age_security_check_list:
            print("Potential age instance found for ",full_name,": ",potential_age)

            #check that what we extracted is an age, convert it to birth year
            try:
                potential_age = int(potential_age)
                print("Potential age detected: ",potential_age)
                if 18 < int(potential_age) < 100:
                    sec_birth_year = int(filing_year) - int(potential_age)
                    print("Filing year was: ",filing_year)
                    print("Estimated birth year for ",clean_sec_full_name,": ",sec_birth_year)
                    #Now, we save it in the main dataframe
                    new_sec_parser = pd.DataFrame([[clean_sec_full_name,"0","0",sec_birth_year,""]],columns = ['Name','Male','Female','Birth','Suffix'])
                    df_sec_parser = pd.concat([df_sec_parser,new_sec_parser])

            except ValueError:
                print("Problem with extracted age ",potential_age)

我有几个问题：

有没有更有效的方法来提取这些信息？我应该改用正则表达式吗？我的文本文档很长，而且我有很多。我可以一次搜索所有项目吗？检测数据集中其他模式的策略是什么？

从数据集中提取的一些句子：

“2010 年授予 Love 先生的股权奖励占其总薪酬的 48%” “George F. Rubin(14)(15) 68 岁受托人，自：1997 年。” “INDRA K. NOOYI，56 岁，自 2006 年起担任百事可乐首席执行官 (CEO)” “Lovallo 先生，47 岁，于 2011 年被任命为财务主管。” “查尔斯·贝克先生，79 岁，是生物技术公司的商业顾问。” “Botein 先生，43 岁，自我们成立以来一直是我们董事会的成员。”

【问题讨论】：

这些简短的 ppl 传记是否包含除年龄以外的任何数字？是的，他们有。它们包含的财务信息可以是股票数量、金额等。那么，这些其他数字是否具有固定格式，例如货币总是有美元或英镑符号等？是的，这些是 SEC 文件，因此具有格式。唯一不是年龄的两位数字应该是百分比。因此，您的策略应该是在段落中删除所有其他特定格式的数字。那你就只剩下Age了，如果你能提供一个简短的传记例子，我也可以给出代码 【参考方案1】：

您也可以使用Spacy pattern matching，而不是使用正则表达式。以下模式可行，但您可能需要添加一些额外的内容以确保您不会了解百分比和货币价值。

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher 

age_patterns = [
# e.g Steve Jones, Age: 32,
["LOWER": "aged", "IS_PUNCT": True,"OP":"?","LIKE_NUM": True],
["LOWER": "age", "IS_PUNCT": True,"OP":"?", "LIKE_NUM": True],
# e.g "Peter Parker (45) will be our next administrator" OR "Amanda B. Bynes, 34, is an actress"
['POS':'PROPN',"IS_PUNCT": True, "LIKE_NUM": True, "IS_PUNCT": True],
# e.g "Mr. Dylan is 46 years old."
["LIKE_NUM": True,"IS_PUNCT": True,"OP":"*","LEMMA": "year", "IS_PUNCT": True,"OP":"*",
 "LEMMA": "old","IS_ALPHA": True, "OP":"*",'POS':'PROPN',"OP":"*",'POS':'PROPN',"OP":"*"  ]
]

doc = nlp(text)
matcher = Matcher(nlp.vocab) 
matcher.add("matching", age_patterns) 
matches = matcher(doc)

schemes = []
for i in range(0,len(matches)):

    # match: id, start, end
    start, end = matches[i][1], matches[i][2]

    if doc[start].pos_=='DET':
        start = start+1

    # matched string
    span = str(doc[start:end])

    if (len(schemes)!=0) and (schemes[-1] in span):
        schemes[-1] = span
    else:
        schemes.append(span)

【讨论】：

【参考方案2】：

从句子中找出一个人的年龄的一种简单方法是提取一个 2 位数字：

import re

sentence = 'Steve Jones, Age: 32,'
print(re.findall(r"\b\d2\b", 'Steve Jones, Age: 32,')[0])

# output: 32

如果您不希望 % 出现在您号码的末尾，并且您希望在开头有一个空格，您可以这样做：

sentence = 'Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation'

match = re.findall(r"\b\d2(?!%)[^\d]", sentence)

if match:
    print(re.findall(r"\b\d2(?!%)[^\d]", sentence)[0][:2])
else:
    print('no match')

# output: no match

也适用于前一句

【讨论】：

感谢您的回答。如果两个数字之前不是空格，或者后面不是“%”符号，我该如何改进请求，使其不会拉出两个数字？你能举个例子吗？ @rusu_ro1：我的第一条评论专门问了这个问题。 OP 在段落中可以有几个 2 位数字！！例如：Steve Jones, Age: 32, and has 30% of shares of XYZ company 当然，这里有一个例子：“2010 年授予 Love 先生的股权奖励占其总薪酬的 48%”。 @user1029296 立即查看【参考方案3】：

由于您的文本必须经过处理，而不仅仅是模式匹配，因此正确的方法是使用现有的众多 NLP 工具之一。

您的目标是使用通常基于机器学习模型完成的命名实体识别 (NER)。 NER 活动尝试在文本中识别一组确定的实体类型。例如：地点、日期、组织和人名。

虽然不是 100% 精确，但这比简单的模式匹配要精确得多（尤其是英语），因为它依赖于模式以外的其他信息，例如词性 (POS)，依赖解析等

看看我使用Allen NLP Online Tool（使用细粒度NER模型）为您提供的短语获得的结果：

“67 岁的 Bond 先生是英国的一名工程师”：

“Amanda B. Bynes，34 岁，是一名演员”

“Peter Parker (45) 将成为我们的下一任管理员”

“Dylan 先生 46 岁。”

“史蒂夫·琼斯，年龄：32，”

请注意，最后一个是错误的。正如我所说，不是 100%，但易于使用。

这种方法的一大优势：您不必为数百万种可用的可能性中的每一种都制作一个特殊的模式。

最好的一点：您可以将它集成到您的 Python 代码中：

pip install allennlp

还有：

from allennlp.predictors import Predictor
al = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine- 
grained-ner-model-elmo-2018.12.21.tar.gz")
al.predict("Your sentence with date here")

然后，查看“日期”实体的结果字典。

Spacy 也是如此：

!python3 -m spacy download en_core_web_lg
import spacy
sp_lg = spacy.load('en_core_web_lg')
(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents

（但是，我在那里遇到了一些糟糕的预测，尽管它被认为更好）。

欲了解更多信息，请阅读 Medium 上的这篇有趣的文章：https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b

【讨论】：

恕我直言，这些示例都没有被正确分类，因为目标表达式不是日期，而是年龄。日期还包括诸如“01.09.2001”、“12 日星期四”和“昨天”等通常可以放在时间线上的表达。 “47岁”显然不是同一种表达方式，应该与日期区分开来。因此，需要一些（例如基于模式的）后处理来将这些 DATE 重新分类为 AGE。 @ongenz 这是一个值得注意的意见。这可能是由于实体标签的限制 - 该模型被训练以将年龄识别为日期。它与粒度有关，并且是交换的一部分：您想要更好的结果吗？好的，让我们用大量数据进行更多概括……但是，除了 1000 个不同的数字模式之外，模式提取单个（或可能 3 个）模式不是更容易吗？此外，这取决于使用的语料库，可能没有提供日期。他还可以检查最接近个人实体的日期。是的，我会选择一个简单的基于标记的模式匹配方法，而不是一个基于语料库的 NER 模型。但是看到提供了答案，我的建议是为了扩展它。【参考方案4】：

这适用于您提供的所有案例：https://repl.it/repls/NotableAncientBackground

import re 

input =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,", "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation",
"George F. Rubin(14)(15) Age 68 Trustee since: 1997.",
"INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006",
"Mr. Lovallo, 47, was appointed Treasurer in 2011.",
"Mr. Charles Baker, 79, is a business advisor to biotechnology companies.",
"Mr. Botein, age 43, has been a member of our Board since our formation."]
for i in input:
  age = re.findall(r'Age[\:\s](\d1,3)', i)
  age.extend(re.findall(r' (\d1,3),? ', i))
  if len(age) == 0:
    age = re.findall(r'\((\d1,3)\)', i)
  print(i+ " --- AGE: "+ str(set(age)))

Mr Bond, 67, is an engineer in the UK --- AGE: '67'
Amanda B. Bynes, 34, is an actress --- AGE: '34'
Peter Parker (45) will be our next administrator --- AGE: '45'
Mr. Dylan is 46 years old. --- AGE: '46'
Steve Jones, Age:32, --- AGE: '32'
Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation --- AGE: set()
George F. Rubin(14)(15) Age 68 Trustee since: 1997. --- AGE: '68'
INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006 --- AGE: '56'
Mr. Lovallo, 47, was appointed Treasurer in 2011. --- AGE: '47'
Mr. Charles Baker, 79, is a business advisor to biotechnology companies. --- AGE: '79'
Mr. Botein, age 43, has been a member of our Board since our formation. --- AGE: '43'

【讨论】：

【参考方案5】：

从你给出的例子来看，这是我提出的策略：

第 1 步：

检查语句Regex:(?i)(Age).*?(\d+)中是否有Age

上面会处理这样的例子：

-- George F. Rubin(14)(15) 68 岁自：1997 年受托人。

-- 史蒂夫·琼斯，年龄：32

第 2 步：

-- 检查“%”符号是否为句子，如果是则删除其中带有符号的数字

-- 如果句子中没有“年龄”，则编写一个正则表达式以删除所有 4 位数字。正则表达式示例：\b\d4\b

--然后看看句子中是否还有数字，那就是你的年龄

涵盖的示例如下：：

--Love 先生 2010 年获得的股权奖励占其总薪酬的 48%”-不会留下任何数字

--“INDRA K. NOOYI，56 岁，自 2006 年起担任百事可乐首席执行官 (CEO)”-- 仅剩 56 人

--“47 岁的洛瓦洛先生于 2011 年被任命为财务主管。” -- 只剩下 47 个了

这可能不是完整的答案，因为您也可以有其他模式。但是由于您要求提供策略和发布的示例，因此这适用于所有情况

【讨论】：

【参考方案6】：

import re 

x =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,"]

[re.findall(r'\d1,3', i)[0] for i in x] # ['67', '34', '45', '46', '32']

【讨论】：

我认为他说会有百分比和货币价值，这个正则表达式也会把它捡起来

以上是关于从 Python 中的非结构化文本中提取一个人的年龄的主要内容，如果未能解决你的问题，请参考以下文章

专栏 | 技术提取智慧：文本挖掘的3大应用

从“London”出发，8步搞定自然语言处理（Python代码）

有没有办法从 OpenOffice Calc 中的单元格中提取子字符串？

如何从图像中的表格中提取文本？

JavaScript中如何提取字符串？

如何从 Python 中的文本中提取 2d 年？