如何从熊猫数据框中提取首字母缩写词和缩写词?

Posted

技术标签:

【中文标题】如何从熊猫数据框中提取首字母缩写词和缩写词?【英文标题】:How to extract acronyms and abbreviations from pandas dataframe? 【发布时间】:2022-01-21 21:38:19 【问题描述】:

我有一个 pandas 数据框,一个包含文本数据的列。我想提取该文本列中所有唯一的首字母缩略词缩写词

到目前为止,我有一个函数可以从给定文本中提取所有首字母缩写词和缩写词

def extract_acronyms_abbreviations(text):
    eaa = 
    for match in re.finditer(r"\((.*?)\)", text):
        start_index = match.start()
        abbr = match.group(1)
        size = len(abbr)
        words = text[:start_index].split()[-size:]
        definition = " ".join(words)

        eaa[abbr] = definition


    return eaa
extract_acronyms_abbreviations(a)
'FHH': 'family health history', 'NP': 'nurse practitioner'

我想从文本列中应用/提取所有独特的首字母缩写词和缩写词

样本数据:

s = """The MLCommons Association, an open engineering consortium dedicated to improving machine learning for everyone, today announced the general availability of the People's Speech Dataset and the Multilingual Spoken Words Corpus (MSWC). This trail-blazing and permissively licensed datasets advance innovation in machine learning research and commercial applications. Also today, the MLCommons Association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI."""
k = """The MLCommons Association is a firm proponent of Data-Centric AI (DCAI), the discipline of systematically engineering the data for AI systems by developing efficient software tools and engineering practices to make dataset creation and curation easier. Our open datasets and tools like DataPerf concretely support the DCAI movement and drive machine learning innovation."""
j = """The key global provider of sustainable packaging solutions has now taken a significant step towards reaching these ambitions by signing two 10-year virtual Power Purchase Agreements (VPPA) with global renewable energy developer BayWa r.e covering its operations in Europe. The agreements form the largest solar VPPA for the packaging industry in Europe, as well as the first major solar VPPA by a Finnish company."""
a = """Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP)."""
import pandas as pd

data = "text":[s,k,j,a,s,k,j]
df = pd.DataFrame(data)

期望的输出

'MSWC': 'Multilingual Spoken Words Corpus',
'DCAI': 'proponent of Data-Centric AI',
'VPPA': 'virtual Power Purchase Agreements',
'NP': 'nurse practitioner',
'FHH': 'family health history'

【问题讨论】:

【参考方案1】:

假设df['text'] 包含您要使用的文本数据。

df["acronyms"] = df.apply(extract_acronyms_abbreviations)
# It will create a new columns containing dictionary return by your function.

现在创建一个类似的主字典

master_dict = dict()
for d in df["acronyms"].values:
    master_dict.update(d)
print(master_dict)

【讨论】:

以上是关于如何从熊猫数据框中提取首字母缩写词和缩写词?的主要内容,如果未能解决你的问题,请参考以下文章

从另一个表更新行[重复]

为啥 DB 是首字母缩写词而不是缩写词?

HTML 首字母缩写词“span”代表啥?

运行vb代码计算相似度时定义首字母缩写词

使用 App 首字母缩写词为所有类/接口/函数添加前缀是一种好习惯吗?

有没有一种字符串方法可以在 python 中将首字母缩写词大写?