如何使用 BeautifulSoup 从表中抓取特定列并作为熊猫数据框返回

Posted 2023-02-24

技术标签:

【中文标题】如何使用 BeautifulSoup 从表中抓取特定列并作为熊猫数据框返回【英文标题】：How to scrape specific columns from table with BeautifulSoup and return as pandas dataframe 【发布时间】：2022-01-18 05:11:10 【问题描述】：

尝试使用 HDI 解析表并将数据加载到 Pandas DataFrame 中，列：Country, HDI_score。

我无法使用以下代码加载 Nation 列：

import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')

df = pd.DataFrame(columns=['Countries', 'HDI_score'])
for row in table.find_all('tr'):    
    columns = row.find_all('td')
    
    if(columns != []):
        countries = columns[1].text.strip()
        hdi_score = columns[2].text.strip()
        df = df.append('Countries': countries, 'HDI_score': hdi_score, ignore_index=True)

Result from my code

因此，我没有国家名称，而是从“5 年的排名变化”列中获取值。我已尝试更改列的索引，但没有帮助。

【问题讨论】：

【参考方案1】：

您可以使用 pandas 获取适当的表，match='Rank' 为您提供正确的表，然后提取感兴趣的列。

import pandas as pd

table = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index', match='Rank')[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)

根据 cmets，如果您仍在使用 pandas，我认为涉及 bs4 的意义不大。见下图：

import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(th:contains("Rank"))')))[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)

【讨论】：

感谢您提供有效的解决方案。尽管如此，我想知道它是否仍然可以用 bs 解决.. 当然是。由于您已经在使用 pandas，因此您将需要更多的努力，因此您将使用更多的代码并且效率会降低。我添加了一个编辑来回答。感谢您的快速回复。我真的很感激?【参考方案2】：

注意 投票给 QHarr，因为在我看来它也是使用 pandas 的最直接的解决方案

另外并回答您的问题 - 仅通过 BeautifulSoup 选择列也是可能的。只需将css selectors 和stripped_strings 结合起来即可。

示例

import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')

pd.DataFrame(
    [list(r.stripped_strings)[-3:-1] for r in bsObj.select('table tr:has(span[data-sort-value])')],
    columns=['Countries', 'HDI_score']
)

输出

Countries	HDI_score
Norway	0.957
Ireland	0.955
Switzerland	0.955
...	...

【讨论】：

以上是关于如何使用 BeautifulSoup 从表中抓取特定列并作为熊猫数据框返回的主要内容，如果未能解决你的问题，请参考以下文章

如何使用 CodeIgniter 3 中的外键从表中获取列数据

从表中抓取数据时，'int'对象没有属性'find_all'

使用 python 和 Beautifulsoup4 从抓取数据中写入和保存 CSV 文件

如何使用 Python 3.5 和 BeautifulSoup 抓取 href [重复]

如何使用 beautifulsoup 从 html 页面中抓取纬度/经度数据

如何在 python 中使用 beautifulsoup4 来抓取标签中的内容