我有一个带有获取请求的循环,在 5 次循环后出现 403 错误

Posted

技术标签:

【中文标题】我有一个带有获取请求的循环,在 5 次循环后出现 403 错误【英文标题】:I have a Loop with a get Request that I get a 403 error after 5 loops 【发布时间】:2020-10-28 20:38:36 【问题描述】:

我的脚本正在遍历一个 get 请求并将它们连接成一个 pandas 数据框以导出到 excel。一切正常,直到循环通过 5 次,然后站点给出 403 错误。一旦我提出了 50k 行的请求并给出 403 错误,网站不知何故知道。有没有办法让任何人都可以与我分享。step 是 URL 字符串末尾的一个变量,它告诉要带回多少行。我一次只能做 10k 或者它滞后太多它不会工作。SKIP 是 URL 字符串中的另一个变量,它向前跳过一组行。该脚本也非常慢,如果有人可以提供有关如何使其更快的任何提示,将不胜感激。谢谢。

from selenium import webdriver
import time
import json
import pandas as pd
import requests
driver = webdriver.Chrome()
executor_url = driver.command_executor._url
session_id = driver.session_id

#put the url/website you are trying to scrape from here > this should be the url you go to when you login
driver.get(r"http://10.131.178.162:9090/xGLinear/login.html")

#waits 60 secs to give you time to login manually
time.sleep(60)

#this will copy all the cookies and login info you need from chrome and now you can start using requests
cookies = driver.get_cookies()
s = requests.Session()
for cookie in cookies:
    s.cookies.set(cookie['name'], cookie['value'])

res = s.get(r"http://10.131.178.162:9090/orders/OrderStatus?$dataAccess=ALL&$&$dateRange=ordered&$endDate=06%2F28%2F2020&$filter=%7B%22operator%22:%22AND%22,%22criteria%22:%5B%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22lineMode%22,%22value%22:%22R%22%7D,%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22creditHold%22,%22value%22:%22N%22%7D,%7B%22operator%22:%22OR%22,%22criteria%22:%5B%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22AP%22%7D,%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22SC%22%7D%5D%7D%5D%7D&$skip=0&$sortBy=%5B%22-key.orderlineId%22%5D&$startDate=06%2F22%2F2020&$top=10000")

data = json.loads(res.text)


TotalR=data['totalRows']

SKIP=10000

skip1=10000
total_count= int(TotalR/skip1)
step=10000

Count=0

df = pd.DataFrame()
try:
    while Count < total_count : 
        res1= s.get(f"http://10.131.178.162:9090/orders/OrderStatus?$dataAccess=ALL&$skip=SKIP&$dateRange=ordered&$endDate=06%2F28%2F2020&$filter=%7B%22operator%22:%22AND%22,%22criteria%22:%5B%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22lineMode%22,%22value%22:%22R%22%7D,%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22creditHold%22,%22value%22:%22N%22%7D,%7B%22operator%22:%22OR%22,%22criteria%22:%5B%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22AP%22%7D,%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22SC%22%7D%5D%7D%5D%7D&$skip=0&$sortBy=%5B%22-key.orderlineId%22%5D&$startDate=06%2F22%2F2020&$top=step")
        data1 = json.loads(res1.text)
        for d in data1['data']:
            dict_new = pd.DataFrame(d)
            df = pd.concat([df,dict_new])
        SKIP+=10000
        Count+=1

except:
    print(res1.status_code)
        

final=pd.DataFrame(data['data'])
final1=pd.DataFrame(final)
final2= pd.concat([df,final1])        
final2.to_excel(r'C:\Users\c\Desktop\xg.xlsx',index= False)

【问题讨论】:

403 是一个“禁止”错误。您可能应该在请求之间等待。我认为该网站认识到您在短时间内向许多请求发送垃圾邮件并阻止您的请求(可能是您触发的某种 ddos​​ 保护)。 【参考方案1】:

没有办法解决这个问题,你刚刚达到了极限。

一个解决方案是查看文档,并了解此计数重置的频率。 然后,您将可以添加等待,以保持良好的节奏并摆脱 403 错误代码。

import time
try:
    cpt = 0
    while Count < total_count : 
        res1= s.get(f"http://10.131.178.162:9090/orders/OrderStatus?$dataAccess=ALL&$skip=SKIP&$dateRange=ordered&$endDate=06%2F28%2F2020&$filter=%7B%22operator%22:%22AND%22,%22criteria%22:%5B%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22lineMode%22,%22value%22:%22R%22%7D,%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22creditHold%22,%22value%22:%22N%22%7D,%7B%22operator%22:%22OR%22,%22criteria%22:%5B%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22AP%22%7D,%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22SC%22%7D%5D%7D%5D%7D&$skip=0&$sortBy=%5B%22-key.orderlineId%22%5D&$startDate=06%2F22%2F2020&$top=step")
        data1 = json.loads(res1.text)
        for d in data1['data']:
            dict_new = pd.DataFrame(d)
            df = pd.concat([df,dict_new])
        SKIP+=10000
        Count+=1
        cpt += 1
        if cpt == 5:
            cpt = 0
            time.wait(x) // X is how many seconds you'll need to wait

except:
    print(res1.status_code)

【讨论】:

以上是关于我有一个带有获取请求的循环,在 5 次循环后出现 403 错误的主要内容,如果未能解决你的问题,请参考以下文章

在带有条件语句的 while 循环中放置错误消息

如何在3次请求后中断此Javascript HTTP无限请求循环?

Python - 更好的循环解决方案 - 出现错误后重新运行并在 3 次尝试后忽略该错误

一次循环 5 条记录并将其分配给变量

循环的快速完成处理程序将执行一次,而不是由于循环而执行10次

从循环中一次发送一个 AJAX 请求