如何从简历中提取学位/教育和年份？在 python 中使用 NLTK

Posted 2023-03-12

技术标签:

【中文标题】如何从简历中提取学位/教育和年份？在 python 中使用 NLTK【英文标题】：How do I extract degrees / education and year from a resume? in python using NLTK 【发布时间】：2020-09-21 17:11:14 【问题描述】：

我已尝试以下代码，但无法从简历中提取正确的教育和年份。

import re
from nltk.corpus import stopwords

# load pre-trained model
nlp = spacy.load('en_core_web_sm')

# Grad all general stop words
STOPWORDS = set(stopwords.words('english'))

# Education Degrees
EDUCATION = [
            'BE','B.E.', 'B.E', 'BS', 'B.S','C.A.','c.a.','B.Com','B. Com','M. Com', 'M.Com','M. Com .',
            'ME', 'M.E', 'M.E.', 'MS', 'M.S',
            'BTECH', 'B.TECH', 'M.TECH', 'MTECH',
            'PHD', 'phd', 'ph.d', 'Ph.D.','MBA','mba','graduate', 'post-graduate','5 year integrated masters','masters',
            'SSC', 'HSC', 'CBSE', 'ICSE', 'X', 'XII'
        ]

def extract_education(resume_text):
    nlp_text = nlp(resume_text)
    # Sentence Tokenizer
    nlp_text = [sent.string.strip() for sent in nlp_text.sents]
    edu = 
    # Extract education degree
    for index, text in enumerate(nlp_text):
        #print(index, text), print('-'*50)
        for tex in text.split():
            # Replace all special symbols
            tex = re.sub(r'[?|$|.|!|,]', r'', tex)
            print(tex)
            if tex.upper() in EDUCATION and tex not in STOPWORDS:
                edu[tex] = text + nlp_text[index + 1]
                print(edu.keys())

print(extract_education(text)) #resume parsed into text

文字：

B.Tech Computer Science  -  2016, MSc Computer Science - 2018 and other text...... (focusing on degree part of resume)

上面的输出没有显示任何东西.. --

[]    #empty list

期望的输出：

[[B.Tech, 2016], [MSc, 2018]]

有人可以帮助我更正此代码并获取相应教育的通过年份吗？提前致谢！

【问题讨论】：

请附上您的运行结果。一些示例，例如您的期望和得到的输出。 @Mandy8055：请重新检查问题，并添加当前输出和预期输出，以及示例文本。 this 有帮助吗？ Python Implementation 不是真的......据我所知，正则表达式无济于事。因为如果我解析其他简历，那么它不会获取它。那么我们怎样才能泛化并得到想要的输出呢。 【参考方案1】：

将字符串更改为文本：

[sent.text.strip() for sent in nlp_text.sents]

在最后添加return list(edu.keys()) 以返回度数列表。

这将为您提供学位名称，例如 ME、CBSE

【讨论】：

以上是关于如何从简历中提取学位/教育和年份？在 python 中使用 NLTK的主要内容，如果未能解决你的问题，请参考以下文章