python 爬网页遇到重定向怎么处理

Posted 2023-03-25

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python 爬网页遇到重定向怎么处理相关的知识，希望对你有一定的参考价值。

参考技术A
停止条件，这个你懂的，并重复上述过程，不断从当前页面上抽取新的URL放入队列.request模拟构建一个带cookies的浏览器。
2，先到微博登陆页面模拟登录，用来抓取微博内容，其实可以使用urllib、现有的项目
google
project网站有一个项目叫做sinawler1，可以找到一个参考的源码，从页面中找出所有URL，代码可以更加简短，获得初始网页上的URL、策略体系，省去对cookies的处理。
3。然后，抓取页面，选择满足要求的URL文本说明，直到满足要求退出，里面很多比较深入的内容。聚焦爬虫的工作流程较为复杂。网站上不去，比如算法分析，直到达到系统的某一条件时停止，就是专门的新浪微博爬虫。
4，需要根据一定的网页分析算法过滤与主题无关的链接，他是用python2写的，从理论角度提升代码的技术层次，保留有用的链接并将其放入等待抓取的URL队列。不过可以百度一下“python编写的新浪微博爬虫（现在的登陆方法见新的一则微博）“、网络爬虫基本原理
传统爬虫从一个或若干初始网页的URL开始。如果用python3写、设计基本思路
正如你所说，重复上面的抓取动作，模拟点击这些URL，它将根
据一定的搜索策略从队列中选择下一步要抓取的网页URL，会大有帮助

Python网页分析httplib库的重定向处理

1. 网页处理

下图是实际操作抓包分析结果，其他的步骤不再描述。

1、从选定的POST /main.aspx开始

2、后面服务器回复302重定向到/cd_chose.aspx页面

3、抓包数据有GET重定向URL，GET css和js文件不再赘述

4、POST到/cd_chose.aspx

2. Python模拟

2.1 抓包分析，后面的GET方法发送不去

再查看IE上抓包结果

没有出现GET方法

怀疑是需要直接POST，尝试了之后仍然失败，但仔细看了下POST内容，头里面有GET头，由于不太了解IE的头显示，不再深究。

2.2 检查消息格式

由于GET这个重定向页面之前定义了HTTP头，

对比网页上实际操作成功发送的头，发现我在Python中多定义了一个头”Content-Type"，主要是前面的POST方法需要和头

实际流程里面，前面其他GET消息需要这个头，但本消息中确实不需要这个头。

去掉这个头

查看Python的消息流程正常

这个问题由于自己http基础不踏实，遇到问题不太确定方向，总觉得重定向流程有什么其他的复杂处理。耽搁了很多时间，

结果其实就只是一个头的问题。

最后附上封装的http get和post方法，调用的httplib库，比较灵活方便，可以根据前端js代码，模仿自己生成一些特殊字段认证服务器。

def http_get(self,connDefault=None,url=\'\',bodyFlag=False,refererFresh=False,referer = \'\'):

        status,infor = 1,\'\'
        if connDefault is None:
            conn = HTTPConnection(self.host,timeout=60)
        else:
            conn = connDefault

try:

            print \'http_get -> enter to get \',url
            start = time.time()

            print \'http_get -> connect init OK\'
            conn.request(\'GET\',url,headers=self.headers)

            print \'http_get -> wait the response...\'
            response = conn.getresponse()
            end = time.time()
            print "http_get -> info:",end - start,response.status

print \'http_get -> response headers\' ,response.getheaders()

            #状态码
            status = response.status
            if status != 200:
                print \'http_get -> http status error\',status
                infor = \'error\'

            else:
                #获取Cookie，格式如下ASP.NET_SessionId=pzt0bs55tc2fjrbv0canht45; path=/; HttpOnly
                cookie=response.getheader(\'Set-Cookie\',\'\')
                #print "http_get -> cookie -> ",cookie

                """
                Cookie叠加
                """
                if cookie != \'\':
                    #cookie键值分两种类型
                    print \'http_get -> peer Set-Cookie\' , cookie
                    pattern = re.compile(r\'(key=[\\w=+/]+;|ASP.NET_SessionId=[\\w=+/]+;)\')
                    _list = pattern.search(cookie)
                    #print \'http_get -> _list\',_list
                    if _list is not None:
                        #print \'http_get -> _list\' ,url,_list.groups()
                        oCookie = self.headers.get(\'Cookie\',\'\')
                        if oCookie == \'\':
                            self.headers["Cookie"] = str(_list.groups()[0][:-1])
                        else:
                            self.headers["Cookie"] = oCookie + \';\' + str(_list.groups()[0][:-1])
                        print \'http_get -> request Cookie\' ,self.headers["Cookie"]
                    else:
                        pass
                else:
                    pass

                """
                更新Referer
                """

                if refererFresh:
                    if referer != \'\':
                        self.headers["Referer"] = "http://" + self.host + referer
                    else:
                        self.headers["Referer"] = "http://" + self.host + url

                #获取编码格式，gzip编码会在头中显示定义
                content_encoding = response.getheader(\'Content-Encoding\',\'\')
                if bodyFlag:
                    """
                    gzip解码
                    """
                    if content_encoding == \'gzip\':
                        buf = StringIO(response.read())
                        infor = GzipFile(fileobj=buf).read()
                    else:
                        infor = response.read()

        except Exception,ex:
            print \'http_get -> error:\',ex
            status,infor = 1,ex
        finally:
            if connDefault is None:
                conn.close()
            return status,infor

    def http_post(self,connDefault=None,url=\'\',PostStr=\'\'):
        status,response = 1,\'\'
        try:
            headers = deepcopy(self.headers)
            headers["Content-Type"] ="application/x-www-form-urlencoded"
            start = time.time()
            if connDefault is None:
                conn = HTTPConnection(self.host,timeout=60)
            else:
                conn = connDefault

            headers["Content-Length"] = len(PostStr)
            conn.request(\'POST\',url,PostStr,headers=headers)
            response = conn.getresponse()
            end = time.time()
            print "http_post info:",end - start,response.status

            #重定向
            if response.status == 302:
                Location=response.getheader(\'Location\',\'\')
                status,response = 302,Location
            #正常提交
            elif response.status == 200:
                status,response = 200,\'\'
            else:
                status,response = response.status,\'does not support\'
        except Exception,ex:
            print \'http_post -> error:\',ex
            status,response = 1,ex
        finally:
            if connDefault is None:
                conn.close()
            return status,response

以上是关于python 爬网页遇到重定向怎么处理的主要内容，如果未能解决你的问题，请参考以下文章

python 爬网页 遇到重定向怎么处理