如何从具有相同类的页面中的两个表中提取数据？

Posted 2023-02-23

技术标签:

【中文标题】如何从具有相同类的页面中的两个表中提取数据？【英文标题】：How to extract data from two tables in a page with same class? 【发布时间】：2019-12-07 19:10:19 【问题描述】：

我想从两个具有相同类的不同表中获取或选择数据。

我尝试从“soup.find_all”中获取它，但格式化数据变得越来越困难。

有许多表具有相同的类。我只需要从表格中获取值（无标签）。

网址：https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/

表 1：

<div class="bh_collapsible-body" style="display: none;">
    <table border="0" cellpadding="2" cellspacing="2" class="prop-list">
        <tbody>
            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Rim Material</td>
                                <td class="value">Alloy</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Front Tyre Description</td>
                                <td class="value">215/55 R16</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
            </tr>

            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Front Rim Description</td>
                                <td class="value">16x7.0</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Rear Tyre Description</td>
                                <td class="value">215/55 R16</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
            </tr>

            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Rear Rim Description</td>
                                <td class="value">16x7.0</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td></td>
            </tr>
        </tbody>
    </table>
</div>
</div> // I thing this is a extra close </div>

表 2：

<div class="bh_collapsible-body" style="display: none;">
    <table border="0" cellpadding="2" cellspacing="2" class="prop-list">
        <tbody>
            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Steering</td>
                                <td class="value">Rack and Pinion</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td></td>
            </tr>
        </tbody>
    </table>
</div>
</div>// I thing this is a extra close </div>

我尝试过的：

我尝试从 Xpath 获取第一个表格内容，但它提供了值和标签。

table1 = driver.find_element_by_xpath("//*[@id='features']/div/div[5]/div[2]/div[1]/div[1]/div/div[2]/table/tbody/tr[1]/td[1]/table/tbody/tr/td[2]")

我尝试拆分数据，但没有成功。如果您想检查，请提供页面的 URL

【问题讨论】：

您可以使用 xpath 将表获取为 python 的列表，并使用索引 tables_list[0] 或 tables_list[1] 在列表中选择表，然后使用 xpath 从该单个表中获取值。你能解释一下吗？不知道如何使用它们您不必在 xpath 中使用所有这些 div。大多数情况下，您可以使用// 跳过它们以获得预期的元素仅获取必须在 xpath 中使用 td[@class="value"] 的值使用xpath 获取所有表（或具有某些类的表），然后使用索引获取仅需要的表并使用其他xpath 从表中获取值。尝试创建一个xpath 会更简单 【参考方案1】：

不是一个完美的解决方案，但是如果您愿意稍微翻阅一下数据，我建议您使用 pandas 的 read_html 函数。

pandas 的 read_html 提取网页中的所有 html 表格，并将其转换为 pandas 数据帧数组。

此代码似乎获取了您链接的页面中的所有 82 个表格元素：

import pandas as pd
import requests

url = "https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/"

#Need to add a fake header to avoid 403 forbidden error
header = 
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
        

resp = requests.get(url, headers=header)

table_dataframes = pd.read_html(resp.text)


for i, df in enumerate(table_dataframes):
    print(f"================Table i=================\n")
    print(df)

这将打印出网页中存在的所有 82 个表格。限制是您必须手动查找您感兴趣的表并相应地操作它。似乎是表 71 和 74 是您想要的表。

这种方法需要额外的智能才能实现自动化。

【讨论】：

【参考方案2】：

这两个表的定位有点“棘手”，因为它们包含其他表。我使用 CSS 选择器 table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)) 定位第一个表，并使用带有字符串 "Steering" 的相同选择器定位第二个表：

from bs4 import BeautifulSoup
import requests

url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'

headers = 'User-Agent':'Mozilla/5.0'
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

rows = []
for tr in soup.select('table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)), table:has(td:contains("Steering")):has(table) tr:not(:has(tr))'):
    rows.append([td.get_text(strip=True) for td in tr.select('td')])

for label, text in rows:
    print(': <30: '.format(label, text))

打印：

Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 215/55 R16
Front Rim Description         : 16x7.0
Rear Tyre Description         : 215/55 R16
Rear Rim Description          : 16x7.0

编辑：从多个 URL 获取数据：

from bs4 import BeautifulSoup
import requests

headers = 'User-Agent':'Mozilla/5.0'

urls = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/',
        'https://www.redbook.com.au/cars/details/2019-genesis-g80-38-ultimate-auto-my19/SPOT-ITM-520697/']

for url in urls:
    soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

    rows = []
    for tr in soup.select('table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)), table:has(td:contains("Steering")):has(table) tr:not(:has(tr))'):
        rows.append([td.get_text(strip=True) for td in tr.select('td')])

    print(': <30: '.format('Title', soup.h1.text))
    print('-' * (len(soup.h1.text.strip())+32))
    for label, text in rows:
        print(': <30: '.format(label, text))

    print('*' * 80)

打印：

Title                         : 2019 Honda Civic 50 Years Edition Auto MY19
---------------------------------------------------------------------------
Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 215/55 R16
Front Rim Description         : 16x7.0
Rear Tyre Description         : 215/55 R16
Rear Rim Description          : 16x7.0
********************************************************************************
Title                         : 2019 Genesis G80 3.8 Ultimate Auto MY19
-----------------------------------------------------------------------
Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 245/40 R19
Front Rim Description         : 19x8.5
Rear Tyre Description         : 275/35 R19
Rear Rim Description          : 19x9.0
********************************************************************************

【讨论】：

因此您的目标是包含一些独特元素的表格。你能把它变成一个df，这样如果我在一个循环中运行两个url，它们就会追加。 @thoris 我没有安装 pandas，但是向 Pandas 数据框插入列表列表肯定不是问题。肯定会尝试两页让你知道试过这个。但只有一个存储

url = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/','https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/']  headers = 'User-Agent':'Mozilla/5.0' for it in url:          soup = BeautifulSoup(requests.get(it, headers=headers).text, 'lxml')

试过这个

url = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/','https://www.redbook.com.au/cars/details/2019-genesis-g80-38-ultimate-auto-my19/SPOT-ITM-520697/']  headers = 'User-Agent':'Mozilla/5.0' for it in url:          soup = BeautifulSoup(requests.get(it, headers=headers).text, 'lxml')

【参考方案3】：

您不必在一个xpath 中完成。您可以使用xpath 获取所有<table class=prop-list>，然后使用索引从列表中选择表并使用另一个xpath 从这个表中获取值

我为此使用 BeautifulSoup，但使用 xpath 应该是相似的

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'

text = requests.get(url, headers='User-Agent': 'Mozilla/5.0').text

soup = BS(text, 'html.parser')

all_tables = soup.find_all('table', 'class': 'prop-list') # xpath('//table[@class="prop-list"]')
#print(len(all_tables))

print("\n--- Engine ---\n")
all_labels = all_tables[3].find_all('td', 'class': 'label') # xpath('.//td[@class="label"]')
all_values = all_tables[3].find_all('td', 'class': 'value') # xpath('.//td[@class="value"]')
for label, value in zip(all_labels, all_values):
    print(': '.format(label.text, value.text))

print("\n--- Fuel ---\n")
all_labels = all_tables[4].find_all('td', 'class': 'label')
all_values = all_tables[4].find_all('td', 'class': 'value')
for label, value in zip(all_labels, all_values):
    print(': '.format(label.text, value.text))

print("\n--- Stearing ---\n")
all_labels = all_tables[7].find_all('td', 'class': 'label')
all_values = all_tables[7].find_all('td', 'class': 'value')
for label, value in zip(all_labels, all_values):
    print(': '.format(label.text, value.text))

print("\n--- Wheels ---\n")
all_labels = all_tables[8].find_all('td', 'class': 'label')
all_values = all_tables[8].find_all('td', 'class': 'value')
for label, value in zip(all_labels, all_values):
    print(': '.format(label.text, value.text))

结果：

--- Engine ---

Engine Type: Piston
Valves/Ports per Cylinder: 4
Engine Location: Front
Compression ratio: 10.6
Engine Size (cc) (cc): 1799
Engine Code: R18Z1
Induction: Aspirated
Power: 104kW @ 6500rpm
Engine Configuration: In-line
Torque: 174Nm @ 4300rpm
Cylinders: 4
Power to Weight Ratio (W/kg): 82.6
Camshaft: OHC with VVT & Lift

--- Fuel ---

Fuel Type: Petrol - Unleaded ULP
Fuel Average Distance (km): 734
Fuel Capacity (L): 47
Fuel Maximum Distance (km): 940
RON Rating: 91
Fuel Minimum Distance (km): 540
Fuel Delivery: Multi-Point Injection
CO2 Emission Combined (g/km): 148
Method of Delivery: Electronic Sequential
CO2 Extra Urban (g/km): 117
Fuel Consumption Combined (L/100km): 6.4
CO2 Urban (g/km): 202
Fuel Consumption Extra Urban (L/100km): 5
Emission Standard: Euro 5
Fuel Consumption Urban (L/100km): 8.7

--- Stearing ---

Steering: Rack and Pinion

--- Wheels ---

Rim Material: Alloy
Front Tyre Description: 215/55 R16
Front Rim Description: 16x7.0
Rear Tyre Description: 215/55 R16
Rear Rim Description: 16x7.0

我假设所有页面都有相同的表格并且它们具有相同的编号。

【讨论】：

以上是关于如何从具有相同类的页面中的两个表中提取数据？的主要内容，如果未能解决你的问题，请参考以下文章