无法处理网页抓取中的空 <td> 值

Posted

技术标签:

【中文标题】无法处理网页抓取中的空 <td> 值【英文标题】:Unable to handle empty <td> value in web scraping 【发布时间】:2022-01-15 22:28:13 【问题描述】:

我正在抓取一个 wiki 页面,但某些行中有一些空的 &lt;td&gt; 元素,因此我使用了:

for tr in table1.tbody:
list = []
for td in tr:
    try:
        if(td.text is None): list.append('NA')
        else: list.append(td.text.strip())
        
    except:
        list.append(td.strip())

将这些行元素存储在列表中,但是当我打印row_list时。

那些rows_list 带有空的&lt;td&gt; 值,现在应该附加'NA' 值,它们仍然是空的,即'NA' have not appended in list

我该如何解决这个问题?

【问题讨论】:

欢迎来到 SO - 请改进您的问题,以便我们重现您的问题。如何创建minimal reproducible example 谢谢 【参考方案1】:

注意 问题需要改进 - 当您在此处更新时,只有两个选项可以修复

选项#1

使用 pandas 以快速且适当的方式获取表格:

import pandas as pd
pd.concat(pd.read_html('https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches#Past_launches')[2:11])

选项 #2

将列表放在循环之前,以避免覆盖并检查缩进:

data = []
for tr in table1.tbody:

    for td in tr:
        try:
            if(td.text is None): data.append('NA')
            else: data.append(td.text.strip())
        
        except:

        data.append(td.strip())

【讨论】:

【参考方案2】:

这里有几件事:

    不要使用list 作为变量。这是python中预定义的方法。 td.text 不是None。实际上有一个字符串作为内容(即:' ') 您没有遍历 tr 标签和 td 标签(或至少在您在此处提供的代码中)。您需要创建 tr 标签列表和 td 元素以在 for 循环中使用。

试试这个:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches#Past_launches'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table1 = soup.find_all('table')[2]


stored_list = []
for tr in table1.tbody.find_all('tr'):
    for td in tr.find_all('td'):
        if td.text.strip() == '': 
            stored_list.append('NA')
        else: 
            stored_list.append(td.text.strip())

输出:

print(stored_list)
['4 June 2010,18:45', 'F9 v1.0[7]B0003[8]', 'CCAFS,SLC-40', 'Dragon Spacecraft Qualification Unit', 'NA', 'LEO', 'SpaceX', 'Success', 'Failure[9][10](parachute)', 'First flight of Falcon 9 v1.0.[11] Used a boilerplate version of Dragon capsule which was not designed to separate from the second stage.(more details below) Attempted to recover the first stage by parachuting it into the ocean, but it burned up on reentry, before the parachutes even go to deploy.[12]', '8 December 2010,15:43[13]', 'F9 v1.0[7]B0004[8]', 'CCAFS,SLC-40', 'Dragon demo flight C1(Dragon C101)', 'NA', 'LEO (ISS)', 'NASA (COTS)\nNRO', 'Success[9]', 'Failure[9][14](parachute)', "Maiden flight of SpaceX's Dragon capsule, consisting of over 3 hours of testing thruster maneuvering and then reentry.[15] Attempted to recover the first stage by parachuting it into the ocean, but it disintegrated upon reentry, again before the parachutes were deployed.[12] (more details below) It also included two CubeSats,[16] and a wheel of Brouère cheese. Before the launch, SpaceX discovered that there was a crack in the nozzle of the 2nd stage's Merlin vacuum engine. So Elon just had them cut off the end of the nozzle with a pair of shears and launched the rocket a few days later. After SpaceX had trimmed the nozzle, NASA was notified of the change and they agreed to it.[17]", '22 May 2012,07:44[18]', 'F9 v1.0[7]B0005[8]', 'CCAFS,SLC-40', 'Dragon demo flight C2+[19](Dragon C102)', '525\xa0kg (1,157\xa0lb)[20] (excl. Dragon mass)', 'LEO (ISS)', 'NASA (COTS)', 'Success[21]', 'No attempt', 'The Dragon spacecraft demonstrated a series of tests before it was allowed to approach the International Space Station. Two days later, it became the first commercial spacecraft to board the ISS.[18] (more details below)', '8 October 2012,00:35[22]', 'F9 v1.0[7]B0006[8]', 'CCAFS,SLC-40', 'SpaceX CRS-1[23](Dragon C103)', '4,700\xa0kg (10,400\xa0lb) (excl. Dragon mass)', 'LEO (ISS)', 'NASA (CRS)', 'Success', 'No attempt', 'Orbcomm-OG2[24]', '172\xa0kg (379\xa0lb)[25]', 'LEO', 'Orbcomm', 'Partial failure[26]', "CRS-1 was successful, but the secondary payload was inserted into an abnormally low orbit and subsequently lost. This was due to one of the nine Merlin engines shutting down during the launch, and NASA declining a second reignition, as per ISS visiting vehicle safety rules, the primary payload owner is contractually allowed to decline a second reignition. NASA stated that this was because SpaceX could not guarantee a high enough likelihood of the second stage completing the second burn successfully which was required to avoid any risk of secondary payload's collision with the ISS.[27][28][29]", '1 March 2013,15:10', 'F9 v1.0[7]B0007[8]', 'CCAFS,SLC-40', 'SpaceX CRS-2[23](Dragon C104)', '4,877\xa0kg (10,752\xa0lb) (excl. Dragon mass)', 'LEO (ISS)', 'NASA (CRS)', 'Success', 'No attempt', 'Last launch of the original Falcon 9 v1.0 launch vehicle, first use of the unpressurized trunk section of Dragon.[30]', '29 September 2013,16:00[31]', 'F9 v1.1[7]B1003[8]', 'VAFB,SLC-4E', 'CASSIOPE[23][32]', '500\xa0kg (1,100\xa0lb)', 'Polar orbit LEO', 'MDA', 'Success[31]', 'Uncontrolled(ocean)[d]', 'First commercial mission with a private customer, first launch from Vandenberg, and demonstration flight of Falcon 9 v1.1 with an improved 13-tonne to LEO capacity.[30] After separation from the second stage carrying Canadian commercial and scientific satellites, the first stage booster performed a controlled reentry,[33] and an ocean touchdown test for the first time. This provided good test data, even though the booster started rolling as it neared the ocean, leading to the shutdown of the central engine as the roll depleted it of fuel, resulting in a hard impact with the ocean.[31] This was the first known attempt of a rocket engine being lit to perform a supersonic retro propulsion, and allowed SpaceX to enter a public-private partnership with NASA and its Mars entry, descent, and landing technologies research projects.[34] (more details below)', '3 December 2013,22:41[35]', 'F9 v1.1B1004', 'CCAFS,SLC-40', 'SES-8[23][36][37]', '3,170\xa0kg (6,990\xa0lb)', 'GTO', 'SES', 'Success[38]', 'No attempt[39]', 'First Geostationary transfer orbit (GTO) launch for Falcon 9,[36] and first successful reignition of the second stage.[40] SES-8 was inserted into a Super-Synchronous Transfer Orbit of 79,341\xa0km (49,300\xa0mi) in apogee with an inclination of 20.55° to the equator.']

【讨论】:

以上是关于无法处理网页抓取中的空 <td> 值的主要内容,如果未能解决你的问题,请参考以下文章

匹配所有不包含子元素或者文本的空元素

用jsoup解析获取一段网页内容的问题

Python爬虫怎么抓取html网页的代码块

基于 tr 计数的 td/th 的 XPath

vb 读取网页数据

当我网页抓取时,TD列表变为空