爬虫简单之二---使用进程爬取起点中文网的六万多也页小说的名字,作者,等一些基本信息,并存入csv中

Posted tiandi-fun

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫简单之二---使用进程爬取起点中文网的六万多也页小说的名字,作者,等一些基本信息,并存入csv中相关的知识,希望对你有一定的参考价值。

爬虫简单之二---使用进程爬取起点中文网的六万多也页小说的名字,作者,等一些基本信息,并存入csv中

准备使用的环境和库Python3.6 + requests + bs4 + csv + multiprocessing

库的说明

  • requests模拟计算机对服务器发送requests请求
  • bs4:页面分析功能,分析页面找到所需要的特定内容
  • xlwt:把爬取的内容存入csv文件中
  • multiprocessing:开启多进程爬取

1.准备URLs

起点中文网 起点中文网的URL:https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=2 发现通过改变最后以为数字可以变换页数,由主页内容可知一共有61732页。 使用 urls = [‘https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=‘ + str(k) for k in range(1, 61723)]这个语句可以构造一个所有连接的列表,供后面多进程使用。

技术图片

2.使用requests库获取页面和使用bs4库来解析页面内容

  1. html = requests.get(url, headers=headers)
  2. selector = BeautifulSoup(html.text, ‘lxml‘)
  3. names = selector.select(
  4. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > h4 > a‘)
  5. writers = selector.select(
  6. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a.name‘)
  7. sign1s = selector.select(
  8. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a:nth-child(4)‘)
  9. sign2s = selector.select(
  10. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a.go-sub-type‘)
  11. types = selector.select(
  12. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > span‘)
  13. traductions = selector.select(
  14. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.intro‘)
  15. words = selector.select(
  16. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.update > span > span‘)

3.把信息存储到xls中

  1. head = [‘title‘, ‘author‘, ‘sign1‘, ‘sign2‘, ‘type‘, ‘traduction‘, ‘words‘]
  2. f = open(‘_06_qidian.csv‘, ‘a+‘)
  3. csv_writer = csv.writer(f)
  4. csv_writer.writerow(head)
  5. for info in range(len(names)):
  6. csv_writer.writerow((names[info].get_text(), writers[info].get_text(), sign1s[info].get_text(), sign2s[info].get_text(), types[info].get_text(), traductions[info].get_text(), words[info].get_text()))

4.最后就可以开足马力使用多进程进行爬取了,这里使用的进程数正好是cpu核心的数量。

  1. pool = Pool(processes=multiprocessing.cpu_count())
  2. pool.map(get_info, urls)
  3. pool.close()
  4. pool.join()

5.完整代码

  1. import requests
  2. from bs4 import BeautifulSoup
  3. import xlwt
  4. from multiprocessing import Pool
  5. import multiprocessing
  6. import csv
  7. def get_info(url):
  8. print(url)
  9. global i
  10. html = requests.get(url, headers=headers)
  11. selector = BeautifulSoup(html.text, ‘lxml‘)
  12. names = selector.select(
  13. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > h4 > a‘)
  14. writers = selector.select(
  15. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a.name‘)
  16. sign1s = selector.select(
  17. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a:nth-child(4)‘)
  18. sign2s = selector.select(
  19. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a.go-sub-type‘)
  20. types = selector.select(
  21. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > span‘)
  22. traductions = selector.select(
  23. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.intro‘)
  24. words = selector.select(
  25. ‘body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.update > span > span‘)
  26. for info in range(len(names)):
  27. csv_writer.writerow((names[info].get_text(), writers[info].get_text(), sign1s[info].get_text(), sign2s[info].get_text(), types[info].get_text(), traductions[info].get_text(), words[info].get_text()))
  28. if __name__ == ‘__main__‘:
  29. head = [‘title‘, ‘author‘, ‘sign1‘, ‘sign2‘, ‘type‘, ‘traduction‘, ‘words‘]
  30. f = open(‘_06_qidian.csv‘, ‘a+‘)
  31. csv_writer = csv.writer(f)
  32. csv_writer.writerow(head)
  33. headers = {
  34. ‘User-Agent‘: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36‘,
  35. }
  36. urls = [‘https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=‘ + str(k) for k in range(1, 61728)]
  37. pool = Pool(processes=multiprocessing.cpu_count())
  38. pool.map(get_info, urls)
  39. f.close()

以上是关于爬虫简单之二---使用进程爬取起点中文网的六万多也页小说的名字,作者,等一些基本信息,并存入csv中的主要内容,如果未能解决你的问题,请参考以下文章

Python爬虫入门: 蜂鸟网图片爬取之二

多线程爬虫介绍

爬虫实践-爬取起点中文网小说信息

Python多线程爬虫爬取电影天堂资源

python3爬虫-使用requests爬取起点小说

爬虫练习五:多进程爬取股市通股票数据