美丽的汤 - 提取信息

Posted

技术标签:

【中文标题】美丽的汤 - 提取信息【英文标题】:Beautiful Soup - Extracting info 【发布时间】:2020-01-19 18:29:44 【问题描述】:

我在尝试从此 html 摘录中提取信息时遇到了一些问题。

到目前为止,我正在使用它来提取下面的 html。

#//////////////////////////////
with open('soup.html','r') as f:

    soup = BeautifulSoup(f, 'html.parser')

base = soup.find_all('script', type="application/ld+json")


print(base)
#//////////////////////////////
    如何提取每行的 URL? 如何提取每行的名称?

这是我得到的:

[<script type="application/ld+json">
      "@context":"http://schema.org","@type":"Organization","name":"Redfin","logo":"https://ssl.cdn-redfin.com/static-images/images/redfin-logo-transparent-bg-260x66.png","url":"https://www.redfin.com"
</script>,
<script type="application/ld+json">
    ["@context":"http://schema.org","name":"7316 Green St, New Orleans, LA 70118","url":"/LA/New-Orleans/7316-Green-St-70118/home/79443425","address":"@type":"PostalAddress","streetAddress":"7316 Green St","addressLocality":"New Orleans","addressRegion":"LA","postalCode":"70118","addressCountry":"US","numberOfRooms":"6","@type":"SingleFamilyResidence","@context":"http://schema.org","@type":"Product","name":"7316 Green St, New Orleans, LA 70118","offers":"@type":"Offer","price":"624900","priceCurrency":"USD","url":"/LA/New-Orleans/7316-Green-St-70118/home/79443425"]
</script>,
<script type="application/ld+json">
    ["@context":"http://schema.org","name":"257 Cherokee St #2, New Orleans, LA 70118","url":"/LA/New-Orleans/257-Cherokee-St-70118/unit-2/home/144766248","address":"@type":"PostalAddress","streetAddress":"257 Cherokee St #2","addressLocality":"New Orleans","addressRegion":"LA","postalCode":"70118","addressCountry":"US","numberOfRooms":"2","@type":"SingleFamilyResidence","@context":"http://schema.org","@type":"Product","name":"257 Cherokee St #2, New Orleans, LA 70118","offers":"@type":"Offer","price":"129500","priceCurrency":"USD","url":"/LA/New-Orleans/257-Cherokee-St-70118/unit-2/home/144766248"]
</script>, <script type="application/ld+json">

【问题讨论】:

【参考方案1】:

您显示的结果是一个字典列表,您应该对其进行迭代并获取所需的值。

【讨论】:

【参考方案2】:

使用json以字典/json格式读取,然后就可以通过键名调用item了:

您需要添加:

import json

那么你可以这样做:

#//////////////////////////////
with open('soup.html','r') as f:

    soup = BeautifulSoup(f, 'html.parser')

base = soup.find_all('script', type="application/ld+json")


for each in base:
    jsonData = json.loads(each.text)
    url = jsonData['url']
    name = jsonData['name']

    print ('Name: %s\nURL: %s\n' %(name, url))
#//////////////////////////////

【讨论】:

以上是关于美丽的汤 - 提取信息的主要内容,如果未能解决你的问题,请参考以下文章

美丽的汤在源文件中找到标记的位置?

如何使用美丽的汤从脚本标签中提取 json?

提取两个不同标签之间的文本 美丽的汤

美丽的汤和提取价值

使用美丽的汤从标签中提取“href”

从美丽的汤标签中提取href [重复]