为啥在尝试访问 HTML 表中的前两列时会出现错误?
Posted
技术标签:
【中文标题】为啥在尝试访问 HTML 表中的前两列时会出现错误?【英文标题】:Why do I get an error when trying to access the first two columns in an HTML table?为什么在尝试访问 HTML 表中的前两列时会出现错误? 【发布时间】:2022-01-22 21:48:09 【问题描述】:import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_Pixar_films"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url, 'lxml')
table_class = "wikitable plainrowheaders sortable"
my_table = soup.find('table', 'class': table_class)
Film = []
release = []
for row in my_table.find_all('i')[0:]:
Film_cell = row.find_all('a')[0]
Film.append(Film_cell.text)
print(Film)
for row in my_table.find_all('td')[0:]:
release = row.find_all('span')[:1]
release.append(release.text)
print(release)
输出:
['Toy Story', "A Bug's Life", 'Toy Story 2', 'Monsters, Inc.',
'Finding Nemo', 'The Incredibles', 'Cars', 'Ratatouille', 'WALL-E',
'Up', 'Toy Story 3', 'Cars 2', 'Brave', 'Monsters University', 'Inside Out',
'The Good Dinosaur', 'Finding Dory', 'Cars 3', 'Coco', 'Incredibles 2',
'Toy Story 4', 'Onward', 'Soul', 'Luca', 'Turning Red', 'Lightyear']
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-223-6481bc092354> in <module>
7 for row in my_table.find_all('td')[0:]:
8 release = row.find_all('span')[:1]
----> 9 release.append(release.text)
10 print(release)
AttributeError: 'list' object has no attribute 'text'
【问题讨论】:
【参考方案1】:for row in my_table.find_all('td')[0:]:
release= row.find_all('span')[:1]
release.append(release.text)
print(release)
my_table.find_all('td')[0:]
与 my_table.find_all('td')
相同
row.find_all('span')[:1]
是一个列表,可能是 row.find_all('span')[0]
release= row.find_all('span')[:1]
,应该使用另一个变量
获取前两列,不包括索引列。
release = []
for row in my_table.find_all('td'):
span = row.find_all('span')
if span:
release.append(span[0].text)
print(release)
[('Toy Story', 'November 22, 1995'), ("A Bug's Life", 'November 25, 1998'), ('Toy Story 2', 'November 24, 1999'), ('Monsters, Inc.', 'November 2, 2001'), ('Finding Nemo', 'May 30, 2003'), ('The Incredibles', 'November 5, 2004'), ('Cars', 'June 9, 2006'), ('Ratatouille', 'June 29, 2007'), ('WALL-E', 'June 27, 2008'), ('Up', 'May 29, 2009'), ('Toy Story 3', 'June 18, 2010'), ('Cars 2', 'June 24, 2011'), ('Brave', 'June 22, 2012'), ('Monsters University', 'June 21, 2013'), ('Inside Out', 'June 19, 2015'), ('The Good Dinosaur', 'November 25, 2015'), ('Finding Dory', 'June 17, 2016'), ('Cars 3', 'June 16, 2017'), ('Coco', 'November 22, 2017'), ('Incredibles 2', 'June 15, 2018'), ('Toy Story 4', 'June 21, 2019'), ('Onward', 'March 6, 2020'), ('Soul', 'December 25, 2020'), ('Luca', 'June 18, 2021'), ('Turning Red[1]', 'March 11, 2022[5]'), ('Lightyear[2]', 'June 17, 2022[5]'), ('TBA', 'June 16, 2023[8]'), ('TBA', 'March 1, 2024[4]'), ('TBA', 'June 14, 2024[4]')]
【讨论】:
【参考方案2】:代码release= row.find_all('span')[:1]
生成一个没有“文本”参数的列表。您需要进一步解析它以获得“文本”元素,即release.append(release[0].text)
而不是release.append(release.text)
。
但这也会产生“索引超出范围错误”,因为您的循环中有许多列表是空的。
修改代码如下:
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_Pixar_films"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')
table_class = "wikitable plainrowheaders sortable"
my_table = soup.find('table','class':table_class)
Film = []
release = []
for row in my_table.find_all('i')[0:]:
Film_cell = row.find_all('a')[0]
Film.append(Film_cell.text)
print(Film)
new_list = []
for row in my_table.find_all('td')[0:]:
release= row.find_all('span')[:1]
if len(release) > 0:
new_list.append(release[0].text)
print(new_list)
【讨论】:
以上是关于为啥在尝试访问 HTML 表中的前两列时会出现错误?的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 shell(awk、sed 等)删除文件中的前两列