为什么BeautifulSoup库只忽略一个特定的元素?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了为什么BeautifulSoup库只忽略一个特定的元素?相关的知识,希望对你有一定的参考价值。
我正试图从世界计量器中获取各国电晕病例的信息。由于某种原因,我无法按类定位特定的TR标签(它们的类在python控制台中只是丢失了,但在chrome开发人员中却有)。所以我将所有tr元素作为目标,然后对其进行过滤。一切正常,但由于某些奇怪的原因,中国在十大国家中缺失了。中国的html标签没有什么不同,但我仍然不能放在那里。有任何想法吗?'''
r = requests.get("https://www.worldometers.info/coronavirus/")
content = r.content
soup = BeautifulSoup(content, "html.parser")
all_rows = soup.find_all("tr")
startingIndex = None
for index,each in enumerate(all_rows,start=0):
if "World" in each.text: # After that word "WORLD" comes TR elements of individual countries.
startingIndex = index
break
top10 = all_rows[startingIndex+1:startingIndex+11] # here i select top 10 countries that i need.
for index,each in enumerate(top10,start = 1):
droebiti_list = each.text.split("\n")
print(f"{index}){droebiti_list[1]} - {droebiti_list[6]}") # and printing info about recovered people
。
'''
答案
无法确保此代码有效('我在错误的环境中进行此操作',但是要刮擦此代码应有效的数据:
r = requests.get("https://www.worldometers.info/coronavirus/")
content = r.content
soup = BeautifulSoup(content, "html.parser")
all_rows = soup.find_all("tr")
for elements_all_rows in all_rows: # Like you said this goes trough all 'tr' elements
ScrapedResult = []
elements_all_rows = soup.find_all("td") # In each Tr Element you now search for 'td' elements
for elements_elements_all_rows in elements_all_rows: # Now you go trough the td and filter the text
ScrapedResult.append(elements_elements_all_rows.getText())
print(ScrapedResult)
您只需要根据需要修改ScrapedResult
。
另一答案
页面源,content
变量,其国家/地区与表的顺序不同(该顺序可能由于javascript脚本或其他原因而改变)。
所以您可以只收集所有数据,然后根据总情况对它们重新排序。
import requests,time
from bs4 import BeautifulSoup
# Get the page source and clear it
r = requests.get("https://www.worldometers.info/coronavirus/")
contents = r.content
soup = BeautifulSoup(contents, "html.parser")
table = soup.find("tbody")
countries = table.find_all("tr")
startingIndex = None
# Here we will store the top ten countries values
total=list(range(10))
names=list(range(10))
recovered=list(range(10))
# Compare each "new" country with the current top ten
for index,each in enumerate(countries[8:]):
droebiti_list = each.text.split("\n")
for j in range(10):
if int(droebiti_list[2].replace(',','')) > total[j]:
for jj in reversed(range(j,10)):
recovered[jj]=recovered[jj-1]
names[jj]=names[jj-1]
total[jj]=total[jj-1]
recovered[j]=droebiti_list[6]
names[j]=droebiti_list[1]
total[j]=int(droebiti_list[2].replace(',',''))
break
print(f"{index}){droebiti_list[1]} - {droebiti_list[2]}")
# Print the results
for k in range(10):
print(names[k],'\t\t\t',recovered[k])
有趣的输出:
USA 36,254
Spain 64,727
Italy 35,435
France 27,718
Germany 64,300
UK N/A
China 77,663
Iran 45,983
Turkey 3,957
Belgium 6,707
以上是关于为什么BeautifulSoup库只忽略一个特定的元素?的主要内容,如果未能解决你的问题,请参考以下文章
Python爬虫:想听榜单歌曲?使用BeautifulSoup库只需要14行代码即可搞定
BeautifulSoup 不会使用 .find_all('a') 抓取页面中的所有锚标记。我忽略了啥吗?