Python - 索引错误 - 列表索引超出范围

Posted 2023-02-25

技术标签:

【中文标题】Python - 索引错误 - 列表索引超出范围【英文标题】：Python - Index Error - list index out of range 【发布时间】：2019-12-18 11:40:10 【问题描述】：

我正在解析来自网站的数据，但收到错误“IndexError: list index out of range”。但是，在调试时我得到了所有的值。以前，它工作得很好，但突然不明白为什么我会收到这个错误。

str2 = cols[1].text.strip()

IndexError: 列表索引超出范围

这是我的代码。

import requests
import DivisionModel
from bs4 import BeautifulSoup
from time import sleep


class DivisionParser:

    def __init__(self, zoneName, zoneUrl):
        self.zoneName = zoneName
        self.zoneUrl = zoneUrl

    def getDivision(self):

        response = requests.get(self.zoneUrl)
        soup = BeautifulSoup(response.content, 'html5lib')
        table = soup.findAll('table', id='mytable')
        rows = table[0].findAll('tr')

        division = []
        for row in rows:
            if row.text.find('T No.') == -1:
                cols = row.findAll('td')

                str1 = cols[0].text.strip()
                str2 = cols[1].text.strip()
                str3 = cols[2].text.strip()
                strurl = cols[2].findAll('a')[0].get('href')
                str4 = cols[3].text.strip()
                str5 = cols[4].text.strip()
                str6 = cols[5].text.strip()
                str7 = cols[6].text.strip()

                divisionModel = DivisionModel.DivisionModel(self.zoneName, str2, str3, strurl, str4, str5, str6, str7)
                division.append(divisionModel)
        return division

这些是调试时的值：

str1 = str '1'
str2 = str 'BHUSAWAL DIVN-ENGINEERING'
str3 = str 'DRMWBSL692019t1'
str4 = str 'Bhusawal Division - TRR/P- 44.898Tkms & 2.225Tkms on 9 Bridges total 47.123Tkms on ADEN MMR &'
str5 = str 'Open'
str6 = str '23/12/2019 15:00'
str7 = str '5'
strurl = str '/works/pdfdocs/122019/51822293/viewNitPdf_3021149.pdf'

【问题讨论】：

好吧，显然是len(cols) < 2。我们没有您的程序的输入，这可以解释为什么会出现这种情况，所以您只需自己查看并决定如何处理它（例如，删除那些特定的行、修复它们等）。 @goodvibration 请再次解决问题，在调试时我得到了所有值，每次每个值，直到循环耗尽。所以你不同意这样一个事实，即cols[1].text.strip() 线上的IndexError 暗示len(cols) < 2??? @aviboy2006 谢谢我是新来的。您是否考虑过有时这些值也不会从网站正确返回，也许在调试一切正常时，但在实时，服务器可能无法处理请求并返回空响应？ 【参考方案1】：

当我通过连续检查 T 号来解析网站数据并获取 td 中的所有值时，网站开发人员在某个 td 行中输入“No Result”，这就是为什么在运行时我的循环将无法获取值并抛出“列表索引超出范围错误”。

感谢大家的帮助。

类 DivisionParser：

def __init__(self, zoneName, zoneUrl):
    self.zoneName = zoneName
    self.zoneUrl = zoneUrl

def getDivision(self):
    global rows
    try:
        response = requests.get(self.zoneUrl)
        soup = BeautifulSoup(response.content, 'html5lib')
        table = soup.findAll('table', id='mytable')
        rows = table[0].findAll('tr')
    except IndexError:
            sleep(2)

    division = []
    for row in rows:
        if row.text.find('T No.') == -1:
            try:
                cols = row.findAll('td')

                str1 = cols[0].text.strip()
                str2 = cols[1].text.strip()
                str3 = cols[2].text.strip()
                strurl = cols[2].findAll('a')[0].get('href')
                str4 = cols[3].text.strip()
                str5 = cols[4].text.strip()
                str6 = cols[5].text.strip()
                str7 = cols[6].text.strip()
                divisionModel = DivisionModel.DivisionModel(self.zoneName, str2, str3, strurl, str4, str5, str6,
                                                            str7)
                division.append(divisionModel)
            except IndexError:
                print("No Result")
    return division

【讨论】：

【参考方案2】：

作为一般规则，来自寒冷和敌对世界的任何东西都是完全不可靠的。这里：

    response = requests.get(self.zoneUrl)
    soup = BeautifulSoup(response.content, 'html5lib')

你似乎被可怕的错觉所困扰，认为你的反应总是你所期望的。提示：不会。保证有时响应会有所不同 - 可能是网站已关闭，或决定将您的 IP 列入黑名单，因为他们不喜欢您抓取他们的数据之类的。

IOW，您真的想检查响应的状态代码和响应内容。实际上，你想为任何事情做好准备 - FWIW，因为你不specify a timeout，你的代码可能会永远冻结等待响应

所以实际上你想要的是沿着这条线

try:
    response = requests.get(yoururl, timeout=some_appropriate_value)
    # cf requests doc
    response.raise_for_status() 

# cf requests doc
except requests.exceptions.RequestException as e
    # nothing else you can do here - depending on
    # the context (script ? library code ?), 
    # you either want to re-raise the exception
    # raise your own exception or well, just
    # show the error message and exit. 
    # Only you can decide what's the appropriate course
    print("couldn't fetch : ".format(yoururl, e))
    return

 if not response.headers['content-type'].startswith("text/html"):
     # idem - not what you expected, and you can't do much
     # except mentionning the fact to the caller one way
     # or another. Here I just print the error and return
     # but if this is library code you want to raise an exception
     # instead
     print(" returned non text/html content ".format(yoururl, response.headers['content-type'])) 
     print("response content:\n\n\n".format(response.text))
     return

 # etc...

request 有一些相当详尽的文档，我建议您多阅读快速入门以正确学习和使用它。这只是工作的一半——即使你确实得到了没有重定向和正确内容类型的 200 响应，这并不意味着标记是你所期望的，所以在这里你必须再次检查你从 BeautifulSoup 得到的东西- 例如这里：

table = soup.findAll('table', id='mytable')
rows = table[0].findAll('tr')

绝对不能保证标记包含任何具有匹配 id 的表（也不是任何 FWIW 表），因此您必须事先检查或处理异常：

table = soup.findAll('table', id='mytable')
if not tables:
    # oops, no matching tables ?
    print("no table 'mytable' found in markup")
    print("markup:\n\n".format(response.text))
    return
rows = table[0].findAll('tr')
# idem, the table might be empty, etc etc

编程的一个有趣的事情是处理名义情况通常相当简单 - 但是你必须处理所有可能的极端情况，这通常需要与名义情况一样多或更多的代码;-)

【讨论】：

好吧，我对网站代码进行了硬编码以将其删除，所以我非常确定这些值，但我在这里面临的是如何检查这些值是否分配给变量，所以我的代码将无错误地运行，你能帮我一些示例代码吗？您的代码永远不会保证“无错误运行”。您必须了解的是，响应可以只是whatever。正如我已经说过的，您必须首先在 request 级别处理可能的极端情况 - 添加超时，在 request.get 调用周围有一个 try/except 子句（当然只捕获请求异常），检查响应状态，内容-type 等 - cf 请求文档。然后你必须在汤级别处理错误，从检查 soup.findAll() 返回的内容开始，而不是盲目地相信标记是你所期望的。我用更多的细节编辑了我的答案，但无论如何不要盲目地复制粘贴我的示例代码。首先是因为正确的错误处理很大程度上取决于上下文，只有您知道您的代码在哪个上下文中使用，然后因为它只是一个非常不完整的可以完成的示例。你必须阅读你的库的文档，测试事情，考虑每一个可能出错的事情，并考虑整个上下文来做正确的事情。感谢@bruno desthuilliers 的建议，我使用 try/except 方法解决了我的问题。

以上是关于Python - 索引错误 - 列表索引超出范围的主要内容，如果未能解决你的问题，请参考以下文章