为什么BeautifulSoup库只忽略一个特定的元素?

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了为什么BeautifulSoup库只忽略一个特定的元素?相关的知识,希望对你有一定的参考价值。

我正试图从世界计量器中获取各国电晕病例的信息。由于某种原因,我无法按类定位特定的TR标签(它们的类在python控制台中只是丢失了,但在chrome开发人员中却有)。所以我将所有tr元素作为目标,然后对其进行过滤。一切正常,但由于某些奇怪的原因,中国在十大国家中缺失了。中国的html标签没有什么不同,但我仍然不能放在那里。有任何想法吗?'''

r = requests.get("https://www.worldometers.info/coronavirus/")
content = r.content
soup = BeautifulSoup(content, "html.parser")
all_rows = soup.find_all("tr") 
startingIndex = None

for index,each in enumerate(all_rows,start=0):
    if "World" in each.text: # After that word "WORLD" comes TR elements of individual countries. 
        startingIndex = index
        break

top10 = all_rows[startingIndex+1:startingIndex+11] # here i select top 10 countries that i need.

for index,each in enumerate(top10,start = 1):
    droebiti_list = each.text.split("\n")
    print(f"{index}){droebiti_list[1]} - {droebiti_list[6]}") # and printing info about recovered people

'''

答案

无法确保此代码有效('我在错误的环境中进行此操作',但是要刮擦此代码应有效的数据:

r = requests.get("https://www.worldometers.info/coronavirus/")
    content = r.content
    soup = BeautifulSoup(content, "html.parser")
    all_rows = soup.find_all("tr")

    for elements_all_rows in all_rows: # Like you said this goes trough all 'tr' elements
        ScrapedResult = []
        elements_all_rows = soup.find_all("td") # In each Tr Element you now search for 'td' elements
        for elements_elements_all_rows in elements_all_rows: # Now you go trough the td and filter the text
            ScrapedResult.append(elements_elements_all_rows.getText())
        print(ScrapedResult)

您只需要根据需要修改ScrapedResult

另一答案

页面源,content变量,其国家/地区与表的顺序不同(该顺序可能由于javascript脚本或其他原因而改变)。

所以您可以只收集所有数据,然后根据总情况对它们重新排序。

import requests,time
from bs4 import BeautifulSoup

# Get the page source and clear it
r = requests.get("https://www.worldometers.info/coronavirus/")
contents = r.content
soup = BeautifulSoup(contents, "html.parser")
table = soup.find("tbody") 
countries = table.find_all("tr")
startingIndex = None

# Here we will store the top ten countries values
total=list(range(10))
names=list(range(10))
recovered=list(range(10))

# Compare each "new" country with the current top ten
for index,each in enumerate(countries[8:]):
    droebiti_list = each.text.split("\n")
    for j in range(10):
        if int(droebiti_list[2].replace(',','')) > total[j]:

            for jj in reversed(range(j,10)):
                recovered[jj]=recovered[jj-1]
                names[jj]=names[jj-1]
                total[jj]=total[jj-1]

            recovered[j]=droebiti_list[6]
            names[j]=droebiti_list[1]
            total[j]=int(droebiti_list[2].replace(',',''))
            break

    print(f"{index}){droebiti_list[1]} - {droebiti_list[2]}") 

# Print the results    
for k in range(10):
    print(names[k],'\t\t\t',recovered[k])

有趣的输出:

USA              36,254
Spain            64,727
Italy            35,435
France           27,718
Germany              64,300
UK           N/A
China            77,663
Iran             45,983
Turkey           3,957
Belgium              6,707

以上是关于为什么BeautifulSoup库只忽略一个特定的元素?的主要内容,如果未能解决你的问题,请参考以下文章

Python爬虫:想听榜单歌曲?使用BeautifulSoup库只需要14行代码即可搞定

搜索正则表达式时忽略子节点

是否可以使用请求库在网站上搜索特定文本?

BeautifulSoup 不会使用 .find_all('a') 抓取页面中的所有锚标记。我忽略了啥吗?

如何在python中忽略BeautifulSoup解析器中的换行符

Beautifulsoup + HTML...如何忽略一些 h3 类