Python：从 urllib2.urlopen 调用中获取 HTTP 标头？

Posted 2023-02-16

技术标签:

【中文标题】Python：从 urllib2.urlopen 调用中获取 HTTP 标头？【英文标题】：Python: Get HTTP headers from urllib2.urlopen call? 【发布时间】：2010-10-25 00:15:07 【问题描述】：

当调用urlopen 时，urllib2 是否获取整个页面？

我想只读取 HTTP 响应标头而不获取页面。看起来urllib2 打开了 HTTP 连接，然后获取了实际的 html 页面......还是它只是开始使用 urlopen 调用缓冲页面？

import urllib2
myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/'
page = urllib2.urlopen(myurl) // open connection, get headers

html = page.readlines()  // stream page

【问题讨论】：

【参考方案1】：

使用response.info() 方法获取标头。

来自urllib2 docs：

urllib2.urlopen(url[, data][, timeout])

...

这个函数返回一个类似文件的对象，带有两个额外的方法：
geturl() — 返回检索到的资源的 URL，通常用于确定是否遵循了重定向 info() — 以 httplib.HTTPMessage 实例的形式返回页面的元信息，例如标头（请参阅 HTTP 标头快速参考）

因此，以您的示例为例，请尝试逐步遍历 response.info().headers 的结果以查找您要查找的内容。

请注意，使用 httplib.HTTPMessage 的主要注意事项记录在 python issue 4773 中。

【讨论】：

Python 3 Note 首先，没有什么像response.info().headers，做一个dict(response.info())。其次，对于 HTTP 状态代码，请执行 response.status。这只获取标题还是只打印标题？ headers 记录在哪里？还可以考虑使用返回键值字典的response.info().items()。 Python 2 注意 这就是你想要的：response.info().getheader('Content-Type') 来源：***.com/questions/1653591/… 实际上对于 Python 3：response.headers 可以，更多信息 http.client.HTTPResponse【参考方案2】：

发送一个 HEAD 请求而不是一个普通的 GET 请求怎么样。以下片段（从类似的question 复制）正是这样做的。

>>> import httplib
>>> conn = httplib.HTTPConnection("www.google.com")
>>> conn.request("HEAD", "/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> print res.getheaders()
[('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')]

【讨论】：

【参考方案3】：

其实看来urllib2可以做一个HTTP HEAD请求。

上面@reto 链接到的question 显示了如何让 urllib2 执行 HEAD 请求。

这是我的看法：

import urllib2

# Derive from Request class and override get_method to allow a HEAD request.
class HeadRequest(urllib2.Request):
    def get_method(self):
        return "HEAD"

myurl = 'http://bit.ly/doFeT'
request = HeadRequest(myurl)

try:
    response = urllib2.urlopen(request)
    response_headers = response.info()

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response_headers.dict

except urllib2.HTTPError, e:
    # Prints the HTTP Status code of the response but only if there was a 
    # problem.
    print ("Error code: %s" % e.code)

如果您使用 Wireshark 网络协议分析器之类的工具进行检查，您会发现它实际上发送的是 HEAD 请求，而不是 GET。

这是来自上述代码的 HTTP 请求和响应，由 Wireshark 捕获：

HEAD /doFeT HTTP/1.1 接受编码：身份主机： bit.ly 连接：关闭用户代理：Python-urllib/2.7

HTTP/1.1 301 已移动服务器：nginx 日期：2012 年 2 月 19 日，星期日格林威治标准时间 13:20:56 内容类型：文本/html；字符集=utf-8 缓存控制：私有； max-age=90 位置： http://www.kidsidebyside.org/?p=445 MIME 版本：1.0 内容长度：127 连接：关闭设置 Cookie： _bit=4f40f738-00153-02ed0-421cf10a;domain=.bit.ly;expires=2012 年 8 月 17 日星期五 13:20:56;path=/; HttpOnly

但是，正如另一个问题中的一个 cmets 所述，如果相关 URL 包含重定向，则 urllib2 将向目标发出 GET 请求，而不是 HEAD。如果您真的只想发出 HEAD 请求，这可能是一个主要缺点。

上述请求涉及重定向。这是 Wireshark 捕获的对目的地的请求：

GET /2009/05/come-and-draw-the-circle-of-unity-with-us/HTTP/1.1 接受编码：身份主持人：www.kidsidebyside.org 连接：关闭用户代理：Python-urllib/2.7

使用 urllib2 的替代方法是使用 Joe Gregorio 的 httplib2 库：

import httplib2

url = "http://bit.ly/doFeT"
http_interface = httplib2.Http()

try:
    response, content = http_interface.request(url, method="HEAD")
    print ("Response status: %d - %s" % (response.status, response.reason))

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response.__dict__

except httplib2.ServerNotFoundError, e:
    print (e.message)

这具有对初始 HTTP 请求和重定向到目标 URL 的请求都使用 HEAD 请求的优势。

这是第一个请求：

HEAD /doFeT HTTP/1.1 主机：bit.ly 接受编码：gzip，放气用户代理：Python-httplib2/0.7.2 (gzip)

这是第二个请求，发往目的地：

HEAD /2009/05/come-and-draw-the-circle-of-unity-with-us/HTTP/1.1 主持人：www.kidsidebyside.org 接受编码：gzip、deflate 用户代理：Python-httplib2/0.7.2 (gzip)

【讨论】：

我第一次阅读答案时错过了它，但response.info().dict 正是我想要的。这是不解释in the docs。【参考方案4】：

urllib2.urlopen 执行 HTTP GET（或 POST，如果您提供数据参数），而不是 HTTP HEAD（如果它执行后者，您当然不能对页面正文执行 readlines 或其他访问）。

【讨论】：

【参考方案5】：

单线：

$ python -c "import urllib2; print urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1)).open(urllib2.Request('http://google.com'))"

【讨论】：

【参考方案6】：

def _GetHtmlPage(self, addr):
  headers =  'User-Agent' : self.userAgent,
            '  Cookie' : self.cookies

  req = urllib2.Request(addr)
  response = urllib2.urlopen(req)

  print "ResponseInfo="
  print response.info()

  resultsHtml = unicode(response.read(), self.encoding)
  return resultsHtml

【讨论】：

以上是关于Python：从 urllib2.urlopen 调用中获取 HTTP 标头？的主要内容，如果未能解决你的问题，请参考以下文章