分析Ajax请求并抓取今日头条街拍美图
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了分析Ajax请求并抓取今日头条街拍美图相关的知识,希望对你有一定的参考价值。
项目说明
本项目以今日头条为例,通过分析Ajax请求来抓取网页数据。
有些网页请求得到的html代码里面并没有我们在浏览器中看到的内容。这是因为这些信息是通过Ajax加载并且通过javascript渲染生成的,这时候就需要我们分析网页请求。
准备工作
python3、requests、Beautiful Soup、MongoDB、pymongo
抓取分析
在抓取之前首先分析抓取的逻辑,打开今日头条的首页https://www.toutiao.com/如图
右上角有一个搜索入口,这里尝试抓取街拍美图,所有输入“街拍”二字,搜索一下,结果如下图所示:
这时打开发者工具,查看所有网络请求,首先打开第一个网络请求,这个请求的URL就是当前的链接:https://www.toutiao.com/search/?keyword=街拍,
刷新一下界面,查看响应结果,如下没有找到页面上的内容
切换到XHR查看,找到了我们需要的信息
article_url就是内容详细的链接
再看一下Headers,这是我们需要构造的请求参数
进入内容详细页查看响应信息,找到了每个图片链接的页面是在Doc,这种形式的url就不能通过Beautiful Soup这样的解析库来解析,可以使用正则表达式re模块。
流程框架
1.抓取索引页内容
利用requests请求目标站点,得到索引页的HTML代码,返回结果。
2.抓取详情页内容
解析返回结果,得到详情页链接,并进一步抓取详情页内容。
3.下载图片与保存数据库
将图片下载至本地,并把页面 信息以及图片URL保存至MongoDB。
4.开启循环及多线程
对多页面内容遍历,开启多线程提高爬取速度。
实战演练
第一次请求
第二次请求
刚才分析了Ajax请求的逻辑,下面就用程序来实现美图下载吧。
首先,实现方法get_page()来加载单个ajax请求的结果。其中可变化的参数就是offset和keyword,所以我们将它们当作参数传递,请求索引页0实现如下:
import requests from urllib.parse import urlencode from requests.exceptions import RequestException def get_page_index(offset, keyword): parameters = { \'offset\': offset, \'format\': \'json\', \'keyword\': keyword, \'autoload\': \'true\', \'count\': \'20\', \'cur_tab\': \'1\', \'from\': \'search_tab\' } url = \'https://www.toutiao.com/search_content/?\' + urlencode(parameters) try: response = requests.get(url) if response.status_code == 200: return response.text return None except RequestException: print(\'请求索引页出错\') return None def main(): html = get_page_index(0, \'街拍\') print(html) if __name__ == \'__main__\': main()
这里我们用urlencode()方法构造请求的GRT参数,然后用requests请求这个链接,如果返回状态码为200,返回html结果
1 {"count": 20, "action_label": "click_search", "return_count": 16, "has_more": 1, "page_id": "/search/", "request_id": "2018042818492217201614709945825F", "cur_tab": 1, "tab": {"tab_list": [{"tab_name": "\\u7efc\\u5408", "tab_id": 1, "tab_code": "news"}, {"tab_name": "\\u89c6\\u9891", "tab_id": 2, "tab_code": "video"}, {"tab_name": "\\u56fe\\u96c6", "tab_id": 3, "tab_code": "gallery"}, {"tab_name": "\\u7528\\u6237", "tab_id": 4, "tab_code": "pgc"}, {"tab_name": "\\u95ee\\u7b54", "tab_id": 5, "tab_code": "wenda"}], "cur_tab": 1}, "offset": 20, "action_label_web": "click_search", "show_tabs": 1, "data": [{"cell_type": 51, "key_info": {}, "keyinfo": {}, "display": {"status": 1, "term": "\\u8857\\u62cd", "search_url": "https://image.toutiao.com/toutiao/search?trigger_all=1&term=%E8%A1%97%E6%8B%8D&search_from=toutiaosearch&search_id=E6E40D0F15A429E53FF741BBE105D31F", "title": "\\u8def\\u4eba\\u8857\\u62cd, \\u97f5\\u5473\\u5341\\u8db3\\u7684\\u5c11\\u5987\\u548c\\u6e05\\u7eaf\\u53ef\\u7231\\u7684\\u59b9\\u5b50\\u4f60\\u66f4\\u559c\\u6b22\\u54ea\\u4e2a? \\u65f6\\u5c1a\\u8fa3\\u5988\\u9ed1\\u8272\\u7d27\\u8eab\\u88e4\\uff0c\\u5e26\\u7740\\u592a\\u9633\\u5e3d\\u9876\\u7740\\u70c8\\u65e5\\u51fa\\u6765\\u8fd0\\u52a8\\uff0c\\u8fd9\\u8eab\\u88c5\\u626e\\u79d2\\u6740\\u4f17\\u591a\\u8def\\u4eba \\u5f02\\u5e38\\u4e30\\u6ee1\\u7684\\u8fa3\\u5988\\uff0c\\u628a\\u4e0a\\u8863\\u6491\\u5f97\\u6ee1\\u6ee1\\u7684\\uff0c\\u770b\\u4f3c\\u4e0d\\u4f11\\u95f2\\u7684\\u88e4\\u5b50\\u540c\\u6837\\u7d27\\u7d27\\u7684\\u3002 \\u8857\\u62cd\\u65f6\\u5c1a\\uff1a\\u6027\\u611f\\u70ed\\u88e4\\u7f8e\\u5973\\u6237\\u5916\\u73a9\\u8857\\u62cd \\u6444\\u5f71\\u8857\\u62cd-\\u91cd\\u5e86\\u8857\\u62cd\\uff01", "total_count": 446254, "queryLabel": "", "results": [{"original_page_url": "http://www.toutiao.com/a6446609483727667470", "img": "https://p3.pstatp.com/w640/31c600009c0969c4b980", "text": "\\u8def\\u4eba\\u8857\\u62cd, \\u97f5\\u5473\\u5341\\u8db3\\u7684\\u5c11\\u5987\\u548c\\u6e05\\u7eaf\\u53ef\\u7231\\u7684\\u59b9\\u5b50\\u4f60\\u66f4\\u559c\\u6b22\\u54ea\\u4e2a?", "original_image_url": "", "width": 606, "score": 23.700184, "height": 883, "img_url": "https://p3.pstatp.com/w320/31c600009c0969c4b980", "id": "31c600009c0969c4b980", "page_url": "https://image.toutiao.com/toutiao/search?trigger_all=1&term=%E8%A1%97%E6%8B%8D&search_from=toutiaosearch&search_id=E6E40D0F15A429E53FF741BBE105D31F&from_img_id=31c600009c0969c4b980"}, {"original_page_url": "http://www.toutiao.com/a6428929553491034369", "img": "https://p3.pstatp.com/w640/243800026fcf510fc0b7", "text": "\\u65f6\\u5c1a\\u8fa3\\u5988\\u9ed1\\u8272\\u7d27\\u8eab\\u88e4\\uff0c\\u5e26\\u7740\\u592a\\u9633\\u5e3d\\u9876\\u7740\\u70c8\\u65e5\\u51fa\\u6765\\u8fd0\\u52a8\\uff0c\\u8fd9\\u8eab\\u88c5\\u626e\\u79d2\\u6740\\u4f17\\u591a\\u8def\\u4eba", "original_image_url": "", "width": 460, "score": 28.470556, "height": 736, "img_url": "https://p3.pstatp.com/w320/243800026fcf510fc0b7", "id": "243800026fcf510fc0b7", "page_url": "https://image.toutiao.com/toutiao/search?trigger_all=1&term=%E8%A1%97%E6%8B%8D&search_from=toutiaosearch&search_id=E6E40D0F15A429E53FF741BBE105D31F&from_img_id=243800026fcf510fc0b7"}, {"original_page_url": "http://www.toutiao.com/a6451333778619056397", "img": "https://p3.pstatp.com/w640/31e90003b9e4ffcca0e4", "text": "\\u5f02\\u5e38\\u4e30\\u6ee1\\u7684\\u8fa3\\u5988\\uff0c\\u628a\\u4e0a\\u8863\\u6491\\u5f97\\u6ee1\\u6ee1\\u7684\\uff0c\\u770b\\u4f3c\\u4e0d\\u4f11\\u95f2\\u7684\\u88e4\\u5b50\\u540c\\u6837\\u7d27\\u7d27\\u7684\\u3002", "original_image_url": "", "width": 640, "score": 30.017776, "height": 960, "img_url": "https://p3.pstatp.com/w320/31e90003b9e4ffcca0e4", "id": "31e90003b9e4ffcca0e4", "page_url": "https://image.toutiao.com/toutiao/search?trigger_all=1&term=%E8%A1%97%E6%8B%8D&search_from=toutiaosearch&search_id=E6E40D0F15A429E53FF741BBE105D31F&from_img_id=31e90003b9e4ffcca0e4"}, {"original_page_url": "http://www.toutiao.com/a6391287308302401793", "img": "https://p3.pstatp.com/w640/17f300012568f7cabf1a", "text": "\\u8857\\u62cd\\u65f6\\u5c1a\\uff1a\\u6027\\u611f\\u70ed\\u88e4\\u7f8e\\u5973\\u6237\\u5916\\u73a9\\u8857\\u62cd", "original_image_url": "", "width": 796, "score": 23.275711, "height": 1203, "img_url": "https://p3.pstatp.com/w320/17f300012568f7cabf1a", "id": "17f300012568f7cabf1a", "page_url": "https://image.toutiao.com/toutiao/search?trigger_all=1&term=%E8%A1%97%E6%8B%8D&search_from=toutiaosearch&search_id=E6E40D0F15A429E53FF741BBE105D31F&from_img_id=17f300012568f7cabf1a"}, {"original_page_url": "http://www.toutiao.com/a6455485530577568269", "img": "https://p3.pstatp.com/w640/320c0004a4736257757a", "text": "\\u6444\\u5f71\\u8857\\u62cd-\\u91cd\\u5e86\\u8857\\u62cd\\uff01", "original_image_url": "", "width": 960, "score": 31.582335, "height": 1436, "img_url": "https://p3.pstatp.com/w320/320c0004a4736257757a", "id": "320c0004a4736257757a", "page_url": "https://image.toutiao.com/toutiao/search?trigger_all=1&term=%E8%A1%97%E6%8B%8D&search_from=toutiaosearch&search_id=E6E40D0F15A429E53FF741BBE105D31F&from_img_id=320c0004a4736257757a"}], "queryLabelID": -1, "message": ""}, "tokens": ["\\u8857\\u62cd"], "id_str": "1914010784", "ala_src": "image", "id": 1914010784}, {"open_url": "sslocal://detail?aggr_type=0&article_type=0&gd_ext_json=%7B%22city%22%3A%22%22%2C%22log_pb%22%3A%7B%22impr_id%22%3A%222018042818492217201614709945825F%22%7D%2C%22query%22%3A%22%E8%A1%97%E6%8B%8D%22%2C%22search_result_id%22%3A6549135496977580551%2C%22source%22%3A%22%E8%80%83%E7%A0%94%E8%B7%AF%E4%B8%8A%E5%A5%8B%E6%96%97%E7%9A%84%E4%BA%8C%E7%8B%97%E5%AD%90%22%7D&gd_label=click_search&groupid=6549135496977580551&item_id=6549135496977580551", "media_name": "\\u8003\\u7814\\u8def\\u4e0a\\u594b\\u6597\\u7684\\u4e8c\\u72d7\\u5b50", "show_play_effective_count": 0, "item_source_url": "/group/6549135496977580551/", "labels": [], "image_list": [{"url": "//p3.pstatp.com/list/pgc-image/152483930041467a89bbd8e"}, {"url": "//p3.pstatp.com/list/pgc-image/15248393003721597ed0a60"}, {"url": "//p3.pstatp.com/list/pgc-image/1524839300384b0afa5f3f1"}, {"url": "//p9.pstatp.com/list/pgc-image/1524839300323a289ed6c8d"}], "datetime": "2018-04-27 22:30:33", "more_mode": true, "create_time": "1524839433", "has_gallery": true, "id": "6549135496977580551", "user_id": 76888793538, "title": "\\u4e0a\\u6d77\\u8857\\u62cd\\u56fe", "has_video": false, "share_url": "http://toutiao.com/group/6549135496977580551/", "source": "\\u8003\\u7814\\u8def\\u4e0a\\u594b\\u6597\\u7684\\u4e8c\\u72d7\\u5b50", "comment_count": 0, "article_url": "http://toutiao.com/group/6549135496977580551/", "comments_count": 0, "large_mode": false, "abstract": "", "media_url": "http://toutiao.com/m1584754949706766/", "media_avatar_url": "//p3.pstatp.com/medium/54ea0000638f8d64f984", "middle_mode": false, "gallary_image_count": 4, "media_creator_id": 76888793538, "tag_id": 6549135496977580551, "source_url": "/group/6549135496977580551/", "item_id": "6549135496977580551", "user_auth_info": {}, "seo_url": "/group/6549135496977580551/", "keyword": "\\u8857\\u62cd", "behot_time": "1524839433", "tag": "news_fashion", "image_url": "//p1.pstatp.com/large/pgc-image/152483930041467a89bbd8e", "has_image": true, "highlight": {"source": [], "abstract": [], "title": [[2, 2]]}, "group_id": "6549135496977580551"}, {"open_url": "sslocal://detail?aggr_type=0&article_type=0&gd_ext_json=%7B%22city%22%3A%22%22%2C%22log_pb%22%3A%7B%22impr_id%22%3A%222018042818492217201614709945825F%22%7D%2C%22query%22%3A%22%E8%A1%97%E6%8B%8D%22%2C%22search_result_id%22%3A6549091673454936590%2C%22source%22%3A%22%E6%B5%B7%E7%8E%B2%E6%97%B6%E5%B0%9A%22%7D&gd_label=click_search&groupid=6549091673454936590&item_id=6549091673454936590", "media_name": "\\u6d77\\u73b2\\u65f6\\u5c1a", "show_play_effective_count": 0, "item_source_url": "/group/6549091673454936590/", "labels": [], "image_list": [{"url": "//p3.pstatp.com/list/pgc-image/1524829212175be45859842"}, {"url": "//p9.pstatp.com/list/pgc-image/1524829212466ebb4f98cd7"}, {"url": "//p3.pstatp.com/list/pgc-image/15248292128376adb9d7f4e"}, {"url": "//p3.pstatp.com/list/pgc-image/15248292131756d5985c6da"}], "datetime": "2018-04-27 19:40:29", "more_mode": true, "create_time": "1524829229", "has_gallery": true, "id": "6549091673454936590", "user_id": 58444595361, "title": "\\u6b66\\u6c49\\u8857\\u62cd\\uff0c\\u5929\\u84dd\\u8272\\u725b\\u4ed4\\u886c\\u8863\\u642d\\u914d\\u7834\\u6d1e\\u725b\\u4ed4\\u88e4\\uff0c\\u7b80\\u6d01\\u5927\\u65b9\\u53c8\\u767e\\u642d\\uff01", "has_video": false, "share_url": "http://toutiao.com/group/6549091673454936590/", "source": "\\u6d77\\u73b2\\u65f6\\u5c1a", "comment_count": 16, "article_url": "http://toutiao.com/group/6549091673454936590/", "comments_count": 16, "large_mode": false, "abstract": "", "media_url": "http://toutiao.com/m1562530317664257/", "media_avatar_url": "//p3.pstatp.com/medium/382f0005b0e7dbc47d8d", "middle_mode": false, "gallary_image_count": 9, "media_creator_id": 58444595361, "tag_id": 6549091673454936590, "source_url": "/group/6549091673454936590/", "item_id": "6549091673454936590", "user_auth_info": {}, "seo_url": "/group/6549091673454936590/", "keyword": "\\u8857\\u62cd", "behot_time": "1524829229", "tag": "news_fashion", "image_url": "//p1.pstatp.com/large/pgc-image/1524829212175be45859842", "has_image": true, "highlight": {"source": [], "abstract": [], "title": [[2, 2]]}, "group_id": "6549091673454936590"}, {"open_url": "sslocal://detail?aggr_type=0&article_type=0&gd_ext_json=%7B%22city%22%3A%22%22%2C%22log_pb%22%3A%7B%22impr_id%22%3A%222018042818492217201614709945825F%22%7D%2C%22query%22%3A%22%E8%A1%97%E6%8B%8D%22%2C%22search_result_id%22%3A6549099779656253960%2C%22source%22%3A%22%E6%B5%B7%E7%8E%B2%E6%97%B6%E5%B0%9A%22%7D&gd_label=click_search&groupid=6549099779656253960&item_id=6549099779656253960", "media_name": "\\u6d77\\u73b2\\u65f6\\u5c1a", "show_play_effective_count": 0, "single_mode": true, "item_source_url": "/group/6549099779656253960/", "labels": [], "image_list": [{"url": "//p3.pstatp.com/list/pgc-image/15248311067351840c1569b"}, {"url": "//p3.pstatp.com/list/pgc-image/1524831106800649768a475"}, {"url": "//p3.pstatp.com/list/pgc-image/15248311068265c8ee8baa9"}], "article_url": "http://toutiao.com/group/6549099779656253960/", "datetime": "2018-04-27 20:11:57", "more_mode": true, "create_time": "1524831117", "has_gallery": false, "id": "6549099779656253960", "user_id": 58444595361, "title": "\\u5317\\u4eac\\u8857\\u62cd\\uff0c\\u590f\\u65e5\\u4f11\\u95f2\\u7a7f\\u642d\\u98ce\\u683c\\uff0c\\u4f60\\u4eec\\u66f4\\u559c\\u6b22\\u54ea\\u4e00\\u6b3e\\uff1f", "has_video": false, "share_url": "http://toutiao.com/group/6549099779656253960/", "source": "\\u6d77\\u73b2\\u65f6\\u5c1a", "comment_count": 14, "media_creator_id": 58444595361, "comments_count": ajax分析-今日头条街拍美图抓取