爬虫 ——(50页)books

Posted cfancy

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫 ——(50页)books相关的知识,希望对你有一定的参考价值。

代码如下所示

技术图片
 1 import scrapy
 2 from scrapy.selector.unified import SelectorList
 3 from bookspider.items import BooksItem
 4 class BooksSpider(scrapy.Spider):
 5     name = books
 6     allowed_domains = [books.toscrape.com]
 7     start_urls = [http://books.toscrape.com/catalogue/page-1.html]
 8     url = "http://books.toscrape.com"
 9     def parse(self, response):
10         # print(‘*‘)
11         # print(type(response))    #  <class ‘scrapy.http.response.html.HtmlResponse‘>
12         # print(‘*‘)
13 
14         Books_lis = response.xpath("//div/ol[@class=‘row‘]/li")
15         # print(‘*‘)
16         # print(type(Books_lis))
17         # print(‘*‘)
18         for Books_li in Books_lis:
19             book_name = Books_li.xpath(".//h3/a/@title").getall()
20             book_price = Books_li.xpath(".//div[@class=‘product_price‘]/p[@class=‘price_color‘]/text()").getall()
21 
22             book_price = "".join(book_price)
23             item = BooksItem(book_name=book_name,book_price=book_price)
24             # books_all = {"book_name":book_name,"book_price":book_price}
25             # yield books_all
26             yield item
27 
28         next_url = response.xpath("//ul[@class = ‘pager‘]/li[@class = ‘next‘]/a/@href").get()
29         #
30         if  next_url:
31             # 如果找到下一页的URL,就得到绝对路径,构造新的Request对象
32             next_url = response.urljoin(next_url)
33             yield scrapy.Request(next_url,callback=self.parse)
34 
35         # if  not next_url:  #此方法报错  待解决
36         #     return
37         # else:
38         #     yield scrapy.Request(self.url + next_url, callback=self.parse)
View Code

 

 

以上是关于爬虫 ——(50页)books的主要内容,如果未能解决你的问题,请参考以下文章

python爬虫抓取豆瓣电影

孔夫子旧书网数据采集,举一反三学爬虫,Python爬虫120例第21例

孔夫子旧书网数据采集,举一反三学爬虫,Python爬虫120例第21例

python学习之爬虫项目ScrapyProject总结

scrapy按顺序启动多个爬虫代码片段(python3)

scrapy主动退出爬虫的代码片段(python3)