如何合并多词NER标签？

Posted 2023-03-29

技术标签:

【中文标题】如何合并多词NER标签？【英文标题】：How to merge multiword NER tags? 【发布时间】：2019-10-01 08:15:18 【问题描述】：

我目前正在使用 allennlp 进行 NER 标记。

代码：

from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("...path to model...")
sentence = "Top Gun was inspired by a newspaper article."
result = predictor.predict(sentence)
lang = 
for word, tag in zip(result["words"], result["tags"]):
  if tag != "O":
    lang[word] = tag

是否有任何解析器可以合并下面的输出，使其返回“Top Gun”和标签“WORK_OF_ART”？

'Top': 'B-WORK_OF_ART', 'Gun': 'L-WORK_OF_ART'

【问题讨论】：

我在下面给出了解决方案，请检查并告诉我使用转换结果合并多词 NER 标签 【参考方案1】：

您可以更改模型路径并尝试使用您的路径

from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.12.18.tar.gz") # change model path
sentence = "Did Uriah honestly think he could beat The Legend of Zelda in under three hours?"
result = predictor.predict(sentence)

lang = 

completeWord = ""

for word, tag in zip(result["words"], result["tags"]):
    if(tag.startswith("B")):
        completeWord = completeWord + " " +word
        completeWord = completeWord + " " +word
    elif(tag.startswith("L")):
        completeWord = completeWord + " " +word
        lang[completeWord] = tag.split("-")[1]
        completeWord = ""
    else:
        lang[word] = tag

print(lang)

>>>' The Legend of Zelda': 'MISC',
 '?': 'O',
 'Did': 'O',
 'Uriah': 'U-PER',
 'beat': 'O',
 'could': 'O',
 'he': 'O',
 'honestly': 'O',
 'hours': 'O',
 'in': 'O',
 'think': 'O',
 'three': 'O',
 'under': 'O'

如果有用，请标记为已接受。

【讨论】：

【参考方案2】：

此存储库包含所有 AllenNLP 模块的下载路径。你可以下载任何你需要的东西。点击here！

从以下路径下载 AllenNLP NER Pretrained 模型点击here！

安装 ALLENNLP 和 allennlp-models

pip install allennlp

pip install allennlp-models

导入所需的 AllenNlp 模块

导入 allennlp

从 allennlp.predictors.predictor 导入预测器

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.09.03.tar.gz")

Predict 函数调用 AllenNLP 的 Predictor.predict 函数，该函数使用一段文本来分析命名实体并将其从非结构化文本分类到预定义的类别（单词、标签、掩码和 logits）。如人名、位置、地标等。作为库（Pythoncode）

BILOU Method/Schema（我希望AllenNLP使用BILOU schema）

| ------|--------------------------------------|
| BEGIN | The first token of a final entity    |
| ------|--------------------------------------| 
| IN    | An inner token of a final entity     |
| ------|--------------------------------------|
| LAST  | The final token of a final entity    |
| ------|--------------------------------------| 
| Unit  | A single-token entity                |
| ------|--------------------------------------|
| Out   | A non-entity token entity            |
| ------|--------------------------------------|

点击here！

输入

导入所需的包

    import allennlp
    from allennlp.predictors.predictor import Predictor
    predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.09.03.tar.gz")
      

    document = """The U.S. is a country of 50 states covering a vast swath of North America, with Alaska in the northwest and Hawaii extending the nation’s presence into the Pacific Ocean. Major Atlantic Coast cities are New York, a global finance and culture center, and capital Washington, DC. Midwestern metropolis Chicago is known for influential architecture and on the west coast, Los Angeles' Hollywood is famed for filmmaking"""


    ####### Convert Entities ##########
    def convert_results(allen_results):
        ents = set()
        for word, tag in zip(allen_results["words"], allen_results["tags"]):
            if tag != "O":
                ent_position, ent_type = tag.split("-")
                if ent_position == "U":
                    ents.add((word,ent_type))
                else:
                  if ent_position == "B":
                      w = word
                  elif ent_position == "I":
                      w += " " + word
                  elif ent_position == "L":
                      w += " " + word
                  ents.add((w,ent_type))
        return ents
    

    def allennlp_ner(document):
        return convert_results(predictor.predict(sentence=document))

    results = predictor.predict(sentence=document)
    
    [tuple(i) for i in zip(results["words"],results["tags"])]

    ##Output##
    [('The', 'O'),
    ('U.S.', 'U-LOC'),
    ('is', 'O'),
    ('a', 'O'),
    ('country', 'O'),
    ('of', 'O'),
    ('50', 'O'),
    ('states', 'O'),
    ('covering', 'O'),
    ('a', 'O'),
    ('vast', 'O'),
    ('swath', 'O'),
    ('of', 'O'),
    ('North', 'B-LOC'),
    ('America', 'L-LOC'),
    (',', 'O'),
    ('with', 'O'),
    ('Alaska', 'U-LOC'),
    ('in', 'O'),
    ('the', 'O'),
    ('northwest', 'O'),
    ('and', 'O'),
    ('Hawaii', 'U-LOC'),
    ('extending', 'O'),
    ('the', 'O'),
    ('nation', 'O'),
    ('’s', 'O'),
    ('presence', 'O'),
    ('into', 'O'),
    ('the', 'O'),
    ('Pacific', 'B-LOC'),
    ('Ocean', 'L-LOC'),
    ('.', 'O'),
    ('Major', 'B-LOC'),
    ('Atlantic', 'I-LOC'),
    ('Coast', 'L-LOC'),
    ('cities', 'O'),
    ('are', 'O'),
    ('New', 'B-LOC'),
    ('York', 'L-LOC'),
    (',', 'O'),
    ('a', 'O'),
    ('global', 'O'),
    ('finance', 'O'),
    ('and', 'O'),
    ('culture', 'O'),
    ('center', 'O'),
    (',', 'O'),
    ('and', 'O'),
    ('capital', 'O'),
    ('Washington', 'U-LOC'),
    (',', 'O'),
    ('DC', 'U-LOC'),
    ('.', 'O'),
    ('Midwestern', 'U-MISC'),
    ('metropolis', 'O'),
    ('Chicago', 'U-LOC'),
    ('is', 'O'),
    ('known', 'O'),
    ('for', 'O'),
    ('influential', 'O'),
    ('architecture', 'O'),
    ('and', 'O'),
    ('on', 'O'),
    ('the', 'O'),
    ('west', 'O'),
    ('coast', 'O'),
    (',', 'O'),
    ('Los', 'B-LOC'),
    ('Angeles', 'L-LOC'),
    ("'", 'O'),
    ('Hollywood', 'U-LOC'),
    ('is', 'O'),
    ('famed', 'O'),
    ('for', 'O'),
    ('filmmaking', 'O')]

    # Merging Multiword NER Tags using convert_results
    allennlp_ner(document)
    
    # the output print like this

    ('Alaska', 'LOC'),
    ('Chicago', 'LOC'),
    ('DC', 'LOC'),
    ('Hawaii', 'LOC'),
    ('Hollywood', 'LOC'),
    ('Los', 'LOC'),
    ('Los Angeles', 'LOC'),
    ('Major', 'LOC'),
    ('Major Atlantic', 'LOC'),
    ('Major Atlantic Coast', 'LOC'),
    ('Midwestern', 'MISC'),
    ('New', 'LOC'),
    ('New York', 'LOC'),
    ('North', 'LOC'),
    ('North America', 'LOC'),
    ('Pacific', 'LOC'),
    ('Pacific Ocean', 'LOC'),
    ('U.S.', 'LOC'),
    ('Washington', 'LOC')

【讨论】：

以上是关于如何合并多词NER标签？的主要内容，如果未能解决你的问题，请参考以下文章