Python 使用 urllib2 抓取网页 Http 错误 500

Posted 2023-03-10

技术标签:

【中文标题】Python 使用 urllib2 抓取网页 Http 错误 500【英文标题】：Python crawl web HttpEror 500 using urlib2 【发布时间】：2015-12-03 18:24:48 【问题描述】：

我使用url lib、urllib2、cookie lib 来抓取web:get 登录页面并发布数据。

def getpage():

codeurl=r"http://www.xxx/sign_in"

request=urllib2.Request(codeurl)

response=urllib2.urlopen(request)

return response

def parsecode(response):

"""
parse the login page to get the changed code
 """

pattern=re.compile(r"""<meta.*?csrf-token.*?content=(.*?)\s/>""")
code=re.findall(pattern,response.read())[0]

return code


def Hand():

"""
deal with cookie and header
"""
headers=
        "Referer":"xxx",
        "User-Agent":"xxx"
        
ck=cookielib.MozillaCookieJar()
handle=urllib2.HTTPCookieProcessor(ck)
openner=urllib2.build_opener(handle)
head=[]
for key,value in headers.items():
    tup=(key,value)
    head.append(tup)
openner.addheaders = head
return openner


def postdata(code,openner):

"""
post the data xxx.com needed
"""
logurl=r"http://www.jianshu.com/sessions"
sign_in="name":"xxx","password":"xxx","authenticity_token":code
data=urllib.urlencode(sign_in).encode("utf-8")
x=openner.open(logurl,data)
for item in ck:
    print item

但是，我遇到了这个错误：

Traceback（最近一次调用最后一次）：

文件“jianshu.py”，第 80 行，在发布数据（代码，操作）

文件“jianshu.py”，第 43 行，在 postdata x=openner.open(logurl,data)

文件“/usr/lib64/python2.7/urllib2.py”，第 437 行，打开响应=方法（请求，响应） http_response 中的文件“/usr/lib64/python2.7/urllib2.py”，第 550 行 'http', 请求, 响应, 代码, msg, hdrs)

文件“/usr/lib64/python2.7/urllib2.py”，第 475 行，错误 return self._call_chain(*args)

文件“/usr/lib64/python2.7/urllib2.py”，第 409 行，在 _call_chain 结果 = func(*args)

文件“/usr/lib64/python2.7/urllib2.py”，第 558 行，在 http_error_default 引发 HTTPError(req.get_full_url(), 代码, msg, hdrs, fp) urllib2.HTTPError: HTTP 错误 500: 内部服务器错误

【问题讨论】：

【参考方案1】：

您是否可能在 'r' 和 'http://...' 这行之间缺少一个 '：

codeurl=r"http://www.xxx/sign_in"

【讨论】：

以上是关于Python 使用 urllib2 抓取网页 Http 错误 500的主要内容，如果未能解决你的问题，请参考以下文章