IndexError：使用beautifulsoup 抓取广告时列出的索引超出范围

Posted 2023-02-23

技术标签:

【中文标题】IndexError：使用beautifulsoup 抓取广告时列出的索引超出范围【英文标题】：IndexError: list index out of range while webscraping advertisements with beautifulsoup 【发布时间】：2019-09-08 02:30:38 【问题描述】：

我正在本地网站上搜索公寓购买/租赁广告。

在某些情况下，我收到 IndexError: list index out of range 错误。

当我的抓取工具遇到没有某些参数的添加时，我收到错误消息。通常是 Powierzchnia（大小）、Liczba pokoi（房间数量）、Pietro（楼层）、Rok budowy（建造年份 - 我没有刮）

我想是因为这个：

pietro = ogl.find_all('p', class_ ="list__item__details__icons__element__desc")[2].text

如果没有 [2] ，通常是第三个参数，它会抛出这个 [2] 超出范围的错误。

我试图将 if 放入 for 循环中，它会检查是否有这样的参数，如果没有，继续。然而无法通过它。

我也试过这样使用：

Powierzchnia = zrzut.find_all('li', class_ = "list__item__details__icons__element details--icons--element--powierzchnia")[0].text

这个没有抛出错误，但给所有的广告提供了相同的大小

完整代码如下：

from bs4 import BeautifulSoup
from requests import get
import pandas as pd
import itertools
import matplotlib.pyplot as plt


headers = ('User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
link = 'https://ogloszenia.trojmiasto.pl/nieruchomosci/wi,100,dw,1d.html?' + str(strona)
r = get(link, headers = headers)
zupa = BeautifulSoup(r.text, 'html.parser')

ogloszenia= zupa.find_all('div', class_="list__item")

n_stron = 0
numer = 0
for strona in range(0,12):
    n_stron +=1
    for ogl in ogloszenia:
        tytul = ogl.find_all('h2', class_ ="list__item__content__title")[0].text
        powierzchnia = ogl.find_all('p', class_ ="list__item__details__icons__element__desc")[0].text
        liczba_pokoi = ogl.find_all('p', class_ ="list__item__details__icons__element__desc")[1].text
        pietro = ogl.find_all('p', class_ ="list__item__details__icons__element__desc")[2].text
        lokalizacja = ogl.find_all('p', class_ = "list__item__content__subtitle")[0].text
        cena = ogl.find_all('p', class_ = "list__item__price__value")[0].text
        cena_m = ogl.find_all('p', class_ = "list__item__details__info details--info--price")[0].text

        numer += 1
        print(numer)
        print(tytul)
        print('Powierzchnia: ' + powierzchnia )
        print('Lokalizacja: ' + lokalizacja )
        print('Liczba pokoi: ' + liczba_pokoi )
        print('Pietro: ' + pietro )
        print('Cena: ' + cena )
        print('Cena za metr kwadratowy: ' + cena_m +'\n')

【问题讨论】：

【参考方案1】：

您可以捕获IndexError 异常并将变量设置为None 或''

try:
    powierzchnia = ogl.find_all('p', class_ ="list__item__details__icons__element__desc")[0].text
except IndexError:
    powierzchnia = ''

对于其他变量，您也可能会遇到这种情况。只需对每个重复相同的操作即可。

【讨论】：

这正是我想要的！非常感谢！【参考方案2】：

试试：

data = ogl.find_all('p', class_ ="list__item__details__icons__element__desc")
for idx,entry in enumerate(data):
    if idx == 0:
        print('powierzchnia '.format(entry.text))
    elif idx == 1:
        print('liczba_pokoi '.format(entry.text))
    else:
        print('pietro '.format(entry.text))

【讨论】：

【参考方案3】：

我会推荐两个更改。

首先，尝试隔离函数中的重复命令。

def findDetail(ogl, tag, class, index):
     return ogl.find_all(tag, class_ = class)[index].text

然后，在索引不可用的情况下，您可以使用“try-except”来处理它。这是在 Python 中处理错误的标准方法：

def findDetail(ogl, tag, class, index):
    try:
        return ogl.find_all(tag, class_ = class)[index].text
    except IndexError:
        print(f”Could not find index index for tag with class”)
        return “”

然后调用它：

for ogl in ogloszenia:
    tytul = findDetail(ogl, “h2”, “"list__item__content__title", 0)
    powierzchnia = findDetail(ogl, ‘p’, "list__item__details__icons__element__desc", 0)

等等。如果找不到索引，那么它只会打印一个空白字符串。

【讨论】：

以上是关于IndexError：使用beautifulsoup 抓取广告时列出的索引超出范围的主要内容，如果未能解决你的问题，请参考以下文章