尝试使用 Python、BeautifulSoup 和/或 Regex 获取日期
Posted
技术标签:
【中文标题】尝试使用 Python、BeautifulSoup 和/或 Regex 获取日期【英文标题】:Trying to grab dates using Python, BeautifulSoup and/or Regex 【发布时间】:2020-07-04 08:30:31 【问题描述】:我有这个清单:
tags = [```<div class="cat-info-index">
<h2><a href="catalogue/lady_standing_at_a_virginal.html">A Lady Standing at a Virginal </a></h2> c. 1670–1674
<br/> Oil on canvas
<br/> 51.7 x 45.2 cm. (20 3/8 x 17 3/4 in.)
<br/>
<a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
<br/>
<a href="mailto:info@vermeerdelft.nl">email contact</a>
<br/>
</div>, <div class="cat-info-index">
<h2><a href="catalogue/lady_seated_at_a_virginal.html">A Lady Seated at a Virginal </a></h2> c. 1670–1675
<br/> Oil on canvas
<br/> 51.5 x 45.5 cm. (20 1/4 x 17 7/8 in.)
<br/>
<a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
<br/>
<a href="mailto:info@vermeerdelft.nl">email contact</a></div>, <div class="cat-info-index"><h2><a href="catalogue/allegory_of_faith.html">Allegory of Faith</a></h2> c. 1670–1674
<br/> Oil on canvas
<br/> 114.3 x 88.9 cm. (45 x 35 in.)
<br/>
<a href="museum_pages.html#MET" target="_blank">Metropolitan Museum of Art</a>, New York
<br/>
<a href="mailto:communications@metmuseum.org"><em> museum contact</em></a></div>, <div class="cat-info-index">
<h2><a href="catalogue/praxedis.html">Saint Praxedis </a></h2> 1655
<br/> Oil on canvas
<br/> 101.6 x 82.6 cm. (40 x 32 1/2 in.)
<br/> National Museum of Western Art, Tokyo</div>, <div class="cat-info-index">
<h2><a href="catalogue/baron_rolin.html">A Young Woman Seated at the Virginals </a></h2> (attributed to Vermeer)
<br/> Oil on canvas
<br/> c. 1670
<br/> 25.2 x 20 cm. (9 7/8 x 7 7/8 in.)
<br/>
<a href="museumsthree.html#KAPLAN">The Leiden Collection</a>, New York</div>```]
HTML:
<div class="cat-info-index">
<h2><a href="catalogue/lady_standing_at_a_virginal.html">A Lady Standing at a Virginal </a></h2> c. 1670–1674
<br/> Oil on canvas
<br/> 51.7 x 45.2 cm. (20 3/8 x 17 3/4 in.)
<br/>
<a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
<br/>
<a href="mailto:info@vermeerdelft.nl">email contact</a>
<br/>
</div>,
<div class="cat-info-index">
<h2><a href="catalogue/lady_seated_at_a_virginal.html">A Lady Seated at a Virginal </a></h2> c. 1670–1675
<br/> Oil on canvas
<br/> 51.5 x 45.5 cm. (20 1/4 x 17 7/8 in.)
<br/>
<a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
<br/>
<a href="mailto:info@vermeerdelft.nl">email contact</a></div>,
<div class="cat-info-index"><h2><a href="catalogue/allegory_of_faith.html">Allegory of Faith</a></h2> c. 1670–1674
<br/> Oil on canvas
<br/> 114.3 x 88.9 cm. (45 x 35 in.)
<br/>
<a href="museum_pages.html#MET" target="_blank">Metropolitan Museum of Art</a>, New York
<br/>
<a href="mailto:communications@metmuseum.org"><em> museum contact</em></a></div>,
<div class="cat-info-index">
<h2><a href="catalogue/praxedis.html">Saint Praxedis </a></h2> 1655
<br/> Oil on canvas
<br/> 101.6 x 82.6 cm. (40 x 32 1/2 in.)
<br/> National Museum of Western Art, Tokyo
</div>,
<div class="cat-info-index">
<h2><a href="catalogue/baron_rolin.html">A Young Woman Seated at the Virginals </a></h2> (attributed to Vermeer)
<br/> Oil on canvas
<br/> c. 1670
<br/> 25.2 x 20 cm. (9 7/8 x 7 7/8 in.)
<br/>
<a href="museumsthree.html#KAPLAN">The Leiden Collection</a>, New York
</div>
我要抢:
['1670–1674', '1670-1675', '1655', '1670']
我尝试过正则表达式,但似乎只能用破折号或独立的年份日期来获取日期范围。有什么建议吗?
【问题讨论】:
你能把它分享为格式化的 HTML 吗?很难阅读。 另外,目前还不清楚你到底在追求什么。您可以使用所需的确切输出来编辑您的问题吗? 我们需要一个格式化的 HTML plz - 这是非常糟糕的阅读 - 帮助我们帮助你。提前致谢 - 【参考方案1】:试试这个:
from bs4 import BeautifulSoup
html = """<div class="cat-info-index">
<h2><a href="catalogue/lady_standing_at_a_virginal.html">A Lady Standing at a Virginal </a></h2> c. 1670–1674
<br/> Oil on canvas
<br/> 51.7 x 45.2 cm. (20 3/8 x 17 3/4 in.)
<br/>
<a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
<br/>
<a href="mailto:info@vermeerdelft.nl">email contact</a>
<br/>
</div>,
<div class="cat-info-index">
<h2><a href="catalogue/lady_seated_at_a_virginal.html">A Lady Seated at a Virginal </a></h2> c. 1670–1675
<br/> Oil on canvas
<br/> 51.5 x 45.5 cm. (20 1/4 x 17 7/8 in.)
<br/>
<a href="museum_pages.html#LONDONNG" target="_blank">National Gallery</a>, London
<br/>
<a href="mailto:info@vermeerdelft.nl">email contact</a></div>,
<div class="cat-info-index"><h2><a href="catalogue/allegory_of_faith.html">Allegory of Faith</a></h2> c. 1670–1674
<br/> Oil on canvas
<br/> 114.3 x 88.9 cm. (45 x 35 in.)
<br/>
<a href="museum_pages.html#MET" target="_blank">Metropolitan Museum of Art</a>, New York
<br/>
<a href="mailto:communications@metmuseum.org"><em> museum contact</em></a></div>,
<div class="cat-info-index">
<h2><a href="catalogue/praxedis.html">Saint Praxedis </a></h2> 1655
<br/> Oil on canvas
<br/> 101.6 x 82.6 cm. (40 x 32 1/2 in.)
<br/> National Museum of Western Art, Tokyo
</div>,
<div class="cat-info-index">
<h2><a href="catalogue/baron_rolin.html">A Young Woman Seated at the Virginals </a></h2> (attributed to Vermeer)
<br/> Oil on canvas
<br/> c. 1670
<br/> 25.2 x 20 cm. (9 7/8 x 7 7/8 in.)
<br/>
<a href="museumsthree.html#KAPLAN">The Leiden Collection</a>, New York
</div>"""
l = []
for h2 in soup.find_all('h2'):
l.append(h2.next_sibling.strip())
l
【讨论】:
以上是关于尝试使用 Python、BeautifulSoup 和/或 Regex 获取日期的主要内容,如果未能解决你的问题,请参考以下文章
在 Python 中使用 BeautifulSoup 从脚本标签中提取文本
在 Python 中使用 BeautifulSoup 解析数据
谷歌搜索使用 python3 爬行时出现 503 错误——请求,Beautifulsoup4