使用BeautifulSoup从JATS XML获取日期

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用BeautifulSoup从JATS XML获取日期相关的知识,希望对你有一定的参考价值。

如何使用BeautifulSoup从JATS XML中提取日期(epub)?

<pub-date pub-type="epub">
<day>12</day>
<month>09</month>
<year>2011</year>
</pub-date>

→ 2011-09-12

<pub-date pub-type="collection">
<year>2011</year>
</pub-date>

应该被忽略。

答案

在您的示例中,pub-type是pub-date的属性,该属性的值为"epub"。要以标准化的格式(如JATS XML)浏览文档树,您需要使用lxml,无论是作为standalone还是作为parser within BeautifulSoup

这是使用lxml.etree的两个函数,仅当属性为“epub”时才使用xpath解析候选日期字段。我基于PLOS的JATS XML格式,我希望这里适用。

import datetime
import lxml.etree as et

def parse_article_date(date_element, date_format='%Y %m %d'):
    """
    For an article date element, convert XML fields to a datetime object
    :param date_format: string format used to convert to datetime object
    :return: datetime object based on XML date fields
    """
    day = ''
    month = ''
    year = ''
    for item in date_element.getchildren():
        if item.tag == 'day':
            day = item.text
        if item.tag == 'month':
            month = item.text
        if item.tag == 'year':
            year = item.text
    date = (year, month, day)
    string_date = ' '.join(date)
    date = datetime.datetime.strptime(string_date, date_format)

    return date

def get_article_pubdate(article_file, tag_path_elements=None, string_=False):
    """
    For a local article file, get its date of publication
    :param article_file: the xml file for a single article
    :param tag_path_elements: xpath search results of the location in the article's XML tree
    :param string_: defaults to False. If True, returns a date string instead of datetime object
    :return: dict of date type mapped to datetime object for that article
    """
    pub_date = {}
    if tag_path_elements is None:
        tag_path_elements = ("/",
                             "article",
                             "front",
                             "article-meta",
                             "pub-date")

    article_tree = et.parse(article_file)
    article_root = article_tree.getroot()
    tag_location = '/'.join(tag_path_elements)
    pub_date_fields = article_root.xpath(tag_location)
    print(pub_date_fields)

    for element in pub_date_fields:
        pub_type = element.get('pub-type')
        if pub_type == 'epub':
            date = parse_article_date(element)
            pub_date[pub_type] = date

    if string_:
        for key, value in pub_date.items():
            if value:
                pub_date[key] = value.strftime('%Y-%m-%d')  # you can set this to any date format

    return pub_date

以上是关于使用BeautifulSoup从JATS XML获取日期的主要内容,如果未能解决你的问题,请参考以下文章

使用 BeautifulSoup 基于属性提取图像 src

python爬虫学习记录解析库的使用——BeautifulSoup

用beautifulsoup 解析xml 文件的html 视图?

使用 BeautifulSoup 创建 XML 文档

Python爬虫:BeautifulSoup库

BeautifulSoup基本使用