尝试使用 Python、BeautifulSoup 和/或 Regex 获取日期

Posted

技术标签:

【中文标题】尝试使用 Python、BeautifulSoup 和/或 Regex 获取日期【英文标题】:Trying to grab dates using Python, BeautifulSoup and/or Regex 【发布时间】:2020-07-04 08:30:31 【问题描述】:

我有这个清单:

tags = [```<div class="cat-info-index">
<h2><a href="catalogue/lady_standing_at_a_virginal.html">A Lady Standing at a Virginal </a></h2> c. 1670–1674
<br/> Oil on canvas
<br/> 51.7 x 45.2 cm. (20 3/8 x 17 3/4 in.)
<br/>
<a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
<br/>
<a href="mailto:info@vermeerdelft.nl">email contact</a>
<br/>
</div>, <div class="cat-info-index">
<h2><a href="catalogue/lady_seated_at_a_virginal.html">A Lady Seated at a Virginal </a></h2> c. 1670–1675
<br/> Oil on canvas
<br/> 51.5 x 45.5 cm. (20 1/4 x 17 7/8 in.)
<br/>
<a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
<br/>
<a href="mailto:info@vermeerdelft.nl">email contact</a></div>, <div class="cat-info-index"><h2><a href="catalogue/allegory_of_faith.html">Allegory of Faith</a></h2> c. 1670–1674
<br/> Oil on canvas
<br/> 114.3 x 88.9 cm. (45 x 35 in.)
<br/>
<a href="museum_pages.html#MET" target="_blank">Metropolitan Museum of Art</a>, New York
<br/>
<a href="mailto:communications@metmuseum.org"><em> museum contact</em></a></div>, <div class="cat-info-index">
<h2><a href="catalogue/praxedis.html">Saint Praxedis </a></h2> 1655
<br/> Oil on canvas
<br/> 101.6 x 82.6 cm. (40 x 32 1/2 in.)
<br/> National Museum of Western Art, Tokyo</div>, <div class="cat-info-index">
<h2><a href="catalogue/baron_rolin.html">A Young Woman Seated at the Virginals </a></h2> (attributed to Vermeer)
<br/> Oil on canvas
<br/> c. 1670
<br/> 25.2 x 20 cm. (9 7/8 x 7 7/8 in.)
<br/>
<a href="museumsthree.html#KAPLAN">The Leiden Collection</a>, New York</div>```]

HTML:

<div class="cat-info-index">
    <h2><a href="catalogue/lady_standing_at_a_virginal.html">A Lady Standing at a Virginal </a></h2> c. 1670–1674
    <br/> Oil on canvas
    <br/> 51.7 x 45.2 cm. (20 3/8 x 17 3/4 in.)
    <br/>
    <a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
    <br/>
    <a href="mailto:info@vermeerdelft.nl">email contact</a>
    <br/>
</div>,
<div class="cat-info-index">
    <h2><a href="catalogue/lady_seated_at_a_virginal.html">A Lady Seated at a Virginal </a></h2> c. 1670–1675
    <br/> Oil on canvas
    <br/> 51.5 x 45.5 cm. (20 1/4 x 17 7/8 in.)
    <br/>
    <a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
    <br/>
    <a href="mailto:info@vermeerdelft.nl">email contact</a></div>,
<div class="cat-info-index"><h2><a href="catalogue/allegory_of_faith.html">Allegory of Faith</a></h2> c. 1670–1674
    <br/> Oil on canvas
    <br/> 114.3 x 88.9 cm. (45 x 35 in.)
    <br/>
    <a href="museum_pages.html#MET" target="_blank">Metropolitan Museum of Art</a>, New York
    <br/>
    <a href="mailto:communications@metmuseum.org"><em> museum contact</em></a></div>,
<div class="cat-info-index">
    <h2><a href="catalogue/praxedis.html">Saint Praxedis </a></h2> 1655
    <br/> Oil on canvas
    <br/> 101.6 x 82.6 cm. (40 x 32 1/2 in.)
    <br/> National Museum of Western Art, Tokyo
</div>,
<div class="cat-info-index">
    <h2><a href="catalogue/baron_rolin.html">A Young Woman Seated at the Virginals </a></h2> (attributed to Vermeer)
    <br/> Oil on canvas
    <br/> c. 1670
    <br/> 25.2 x 20 cm. (9 7/8 x 7 7/8 in.)
    <br/>
    <a href="museumsthree.html#KAPLAN">The Leiden Collection</a>, New York
</div>

我要抢:

['1670–1674', '1670-1675', '1655', '1670']

我尝试过正则表达式,但似乎只能用破折号或独立的年份日期来获取日期范围。有什么建议吗?

【问题讨论】:

你能把它分享为格式化的 HTML 吗?很难阅读。 另外,目前还不清楚你到底在追求什么。您可以使用所需的确切输出来编辑您的问题吗? 我们需要一个格式化的 HTML plz - 这是非常糟糕的阅读 - 帮助我们帮助你。提前致谢 - 【参考方案1】:

试试这个:

from bs4 import BeautifulSoup
html = """<div class="cat-info-index">
    <h2><a href="catalogue/lady_standing_at_a_virginal.html">A Lady Standing at a Virginal </a></h2> c. 1670–1674
    <br/> Oil on canvas
    <br/> 51.7 x 45.2 cm. (20 3/8 x 17 3/4 in.)
    <br/>
    <a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
    <br/>
    <a href="mailto:info@vermeerdelft.nl">email contact</a>
    <br/>
</div>,
<div class="cat-info-index">
    <h2><a href="catalogue/lady_seated_at_a_virginal.html">A Lady Seated at a Virginal </a></h2> c. 1670–1675
    <br/> Oil on canvas
    <br/> 51.5 x 45.5 cm. (20 1/4 x 17 7/8 in.)
    <br/>
    <a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
    <br/>
    <a href="mailto:info@vermeerdelft.nl">email contact</a></div>,
<div class="cat-info-index"><h2><a href="catalogue/allegory_of_faith.html">Allegory of Faith</a></h2> c. 1670–1674
    <br/> Oil on canvas
    <br/> 114.3 x 88.9 cm. (45 x 35 in.)
    <br/>
    <a href="museum_pages.html#MET" target="_blank">Metropolitan Museum of Art</a>, New York
    <br/>
    <a href="mailto:communications@metmuseum.org"><em> museum contact</em></a></div>,
<div class="cat-info-index">
    <h2><a href="catalogue/praxedis.html">Saint Praxedis </a></h2> 1655
    <br/> Oil on canvas
    <br/> 101.6 x 82.6 cm. (40 x 32 1/2 in.)
    <br/> National Museum of Western Art, Tokyo
</div>,
<div class="cat-info-index">
    <h2><a href="catalogue/baron_rolin.html">A Young Woman Seated at the Virginals </a></h2> (attributed to Vermeer)
    <br/> Oil on canvas
    <br/> c. 1670
    <br/> 25.2 x 20 cm. (9 7/8 x 7 7/8 in.)
    <br/>
    <a href="museumsthree.html#KAPLAN">The Leiden Collection</a>, New York
</div>"""
l = []
for h2 in soup.find_all('h2'):
    l.append(h2.next_sibling.strip())
l

【讨论】:

以上是关于尝试使用 Python、BeautifulSoup 和/或 Regex 获取日期的主要内容,如果未能解决你的问题,请参考以下文章

无法在python中导入beautifulsoup

在 Python 中使用 BeautifulSoup 从脚本标签中提取文本

在 Python 中使用 BeautifulSoup 解析数据

谷歌搜索使用 python3 爬行时出现 503 错误——请求,Beautifulsoup4

使用 Python 和 BeautifulSoup(将网页源代码保存到本地文件中)

Python - 使用 BeautifulSoup 和 Urllib 进行抓取