抓取猫眼top100电影信息

Posted 2020-10-09

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了抓取猫眼top100电影信息相关的知识，希望对你有一定的参考价值。

1. 在google浏览器中输入maoyan.com, 点击榜单top100.

2.观察分页路由，构造分页url = ‘http://maoyan.com/board/4?offset=‘ + str(offset)

3.卡发者选项，查看排行的电影信息，我们要爬取电影的排行（index）, 图片的url, 标题（title）, 演员，上映时间，评分。

4.获取首页的html代码，

 1 user_agent = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ‘  2             ‘ (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36‘
 3 headers = {‘User-Agent‘: user_agent}
 4 
 5 def get_one_page(url):
 6     try:
 7         response = requests.get(url, headers=headers)
 8         if response.status_code == 200:
 9             return response.text
10         return None
11     except RequestException:
12         return None

5. 解析页面，提取数据

 1 def parse_one_page(html):
 2     soup = BeautifulSoup(html, ‘lxml‘)
 3     items = soup.select(‘dd‘)
 4     if items:
 5         for item in items:
 6             yield {
 7                 ‘index‘: item.find(‘i‘).text,
 8                 ‘image‘: item.find(‘img‘, class_="board-img").get(‘data-src‘),
 9                 ‘title‘: item.find(‘p‘).text,
10                 ‘actor‘: item.find(‘p‘, class_="star").text.strip()[3:],
11                 ‘time‘: item.find(‘p‘, class_="releasetime").text.strip()[5:],
12                 ‘score‘: item.find(‘i‘, class_="integer").text + item.find(‘i‘, class_="fraction").text
13             }

6. 爬虫主函数

1 def main(offset):
2     url = ‘http://maoyan.com/board/4?‘ + str(offset)
3     html = get_one_page(url)
4     for item in parse_one_page(html):
5         print(item)
6         write_to_file(item)

7. 开启多进程

1 if __name__ == ‘__main__‘:
2     pool = Pool()
3     pool.map(main, [i*10 for i in range(10)])

完整代码：https://github.com/huazhicai/Spider/tree/master/maoyantop

以上是关于抓取猫眼top100电影信息的主要内容，如果未能解决你的问题，请参考以下文章