我有一个带有获取请求的循环,在 5 次循环后出现 403 错误
Posted
技术标签:
【中文标题】我有一个带有获取请求的循环,在 5 次循环后出现 403 错误【英文标题】:I have a Loop with a get Request that I get a 403 error after 5 loops 【发布时间】:2020-10-28 20:38:36 【问题描述】:我的脚本正在遍历一个 get 请求并将它们连接成一个 pandas 数据框以导出到 excel。一切正常,直到循环通过 5 次,然后站点给出 403 错误。一旦我提出了 50k 行的请求并给出 403 错误,网站不知何故知道。有没有办法让任何人都可以与我分享。step 是 URL 字符串末尾的一个变量,它告诉要带回多少行。我一次只能做 10k 或者它滞后太多它不会工作。SKIP 是 URL 字符串中的另一个变量,它向前跳过一组行。该脚本也非常慢,如果有人可以提供有关如何使其更快的任何提示,将不胜感激。谢谢。
from selenium import webdriver
import time
import json
import pandas as pd
import requests
driver = webdriver.Chrome()
executor_url = driver.command_executor._url
session_id = driver.session_id
#put the url/website you are trying to scrape from here > this should be the url you go to when you login
driver.get(r"http://10.131.178.162:9090/xGLinear/login.html")
#waits 60 secs to give you time to login manually
time.sleep(60)
#this will copy all the cookies and login info you need from chrome and now you can start using requests
cookies = driver.get_cookies()
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
res = s.get(r"http://10.131.178.162:9090/orders/OrderStatus?$dataAccess=ALL&$&$dateRange=ordered&$endDate=06%2F28%2F2020&$filter=%7B%22operator%22:%22AND%22,%22criteria%22:%5B%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22lineMode%22,%22value%22:%22R%22%7D,%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22creditHold%22,%22value%22:%22N%22%7D,%7B%22operator%22:%22OR%22,%22criteria%22:%5B%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22AP%22%7D,%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22SC%22%7D%5D%7D%5D%7D&$skip=0&$sortBy=%5B%22-key.orderlineId%22%5D&$startDate=06%2F22%2F2020&$top=10000")
data = json.loads(res.text)
TotalR=data['totalRows']
SKIP=10000
skip1=10000
total_count= int(TotalR/skip1)
step=10000
Count=0
df = pd.DataFrame()
try:
while Count < total_count :
res1= s.get(f"http://10.131.178.162:9090/orders/OrderStatus?$dataAccess=ALL&$skip=SKIP&$dateRange=ordered&$endDate=06%2F28%2F2020&$filter=%7B%22operator%22:%22AND%22,%22criteria%22:%5B%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22lineMode%22,%22value%22:%22R%22%7D,%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22creditHold%22,%22value%22:%22N%22%7D,%7B%22operator%22:%22OR%22,%22criteria%22:%5B%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22AP%22%7D,%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22SC%22%7D%5D%7D%5D%7D&$skip=0&$sortBy=%5B%22-key.orderlineId%22%5D&$startDate=06%2F22%2F2020&$top=step")
data1 = json.loads(res1.text)
for d in data1['data']:
dict_new = pd.DataFrame(d)
df = pd.concat([df,dict_new])
SKIP+=10000
Count+=1
except:
print(res1.status_code)
final=pd.DataFrame(data['data'])
final1=pd.DataFrame(final)
final2= pd.concat([df,final1])
final2.to_excel(r'C:\Users\c\Desktop\xg.xlsx',index= False)
【问题讨论】:
403 是一个“禁止”错误。您可能应该在请求之间等待。我认为该网站认识到您在短时间内向许多请求发送垃圾邮件并阻止您的请求(可能是您触发的某种 ddos 保护)。 【参考方案1】:没有办法解决这个问题,你刚刚达到了极限。
一个解决方案是查看文档,并了解此计数重置的频率。 然后,您将可以添加等待,以保持良好的节奏并摆脱 403 错误代码。
import time
try:
cpt = 0
while Count < total_count :
res1= s.get(f"http://10.131.178.162:9090/orders/OrderStatus?$dataAccess=ALL&$skip=SKIP&$dateRange=ordered&$endDate=06%2F28%2F2020&$filter=%7B%22operator%22:%22AND%22,%22criteria%22:%5B%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22lineMode%22,%22value%22:%22R%22%7D,%7B%22operator%22:%22EQUALS%22,%22fieldName%22:%22creditHold%22,%22value%22:%22N%22%7D,%7B%22operator%22:%22OR%22,%22criteria%22:%5B%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22AP%22%7D,%7B%22fieldName%22:%22status%22,%22operator%22:%22EQUALS%22,%22value%22:%22SC%22%7D%5D%7D%5D%7D&$skip=0&$sortBy=%5B%22-key.orderlineId%22%5D&$startDate=06%2F22%2F2020&$top=step")
data1 = json.loads(res1.text)
for d in data1['data']:
dict_new = pd.DataFrame(d)
df = pd.concat([df,dict_new])
SKIP+=10000
Count+=1
cpt += 1
if cpt == 5:
cpt = 0
time.wait(x) // X is how many seconds you'll need to wait
except:
print(res1.status_code)
【讨论】:
以上是关于我有一个带有获取请求的循环,在 5 次循环后出现 403 错误的主要内容,如果未能解决你的问题,请参考以下文章
如何在3次请求后中断此Javascript HTTP无限请求循环?
Python - 更好的循环解决方案 - 出现错误后重新运行并在 3 次尝试后忽略该错误