软件环境:
1 gevent (1.2.2) 2 greenlet (0.4.12) 3 lxml (4.1.1) 4 pymongo (3.6.0) 5 pyOpenSSL (17.5.0) 6 requests (2.18.4) 7 Scrapy (1.5.0) 8 SQLAlchemy (1.2.0) 9 Twisted (17.9.0) 10 wheel (0.30.0)
1.创建爬虫项目
2创建京东网站爬虫. 进入爬虫项目目录,执行命令:
scrapy genspider jd www.jd.com
会在spiders目录下会创建和你起的名字一样的py文件:jd.py,这个文件就是用来写你爬虫的请求和响应逻辑的
3. jd.py文件配置
分析的amazon网站的url规则:
https://search.jd.com/Search?
以防关键字是中文,所以要做urlencode
1.首先写一个start_request函数,用来发送第一次请求,并把请求结果发给回调函数parse_index,同时把reponse返回值传递给回调函数,response类型<class ‘scrapy.http.response.html.HtmlResponse‘>
1 def start_requests(self): 2 # https://www.amazon.cn/s/ref=nb_sb_ss_i_1_6?field-keywords=macbook+pro 3 # 拼接处符合条件的URL地址 4 # 并通过scrapy.Requst封装请求,并调用回调函数parse_index处理,同时会把response传递给回调函数 6 url = ‘https://search.jd.com/Search?‘ 7 # 拼接的时候field-keywords后面是不加等号的 9 url += urlencode({"keyword": self.keyword, "enc": "utf-8"}) 10 yield scrapy.Request(url, 11 callback=self.parse_index, 12 )
2.parse_index从reponse中获取所有的产品详情页url地址,并遍历所有的url地址发送request请求,同时调用回调函数parse_detail去处理结果
1 def parse_detail(self, response): 2 """ 3 接收parse_index的回调,并接收response返回值,并解析response 4 :param response: 5 :return: 6 """ 7 jd_url = response.url 8 sku = jd_url.split(‘/‘)[-1].strip(".html") 9 # price信息是通过jsonp获取,可以通过开发者工具中的script找到它的请求地址 10 price_url = "https://p.3.cn/prices/mgets?skuIds=J_" + sku 11 response_price = requests.get(price_url) 12 # extraParam={"originid":"1"} skuIds=J_3726834 13 # 这里是物流信息的请求地址,也是通过jsonp发送的,但目前没有找到它的参数怎么获取的,这个是一个固定的参数,如果有哪位大佬知道,好望指教 14 express_url = "https://c0.3.cn/stock?skuId=3726834&area=1_72_4137_0&cat=9987,653,655&extraParam={%22originid%22:%221%22}" 15 response_express = requests.get(express_url) 16 response_express = json.loads(response_express.text)[‘stock‘][‘serviceInfo‘].split(‘>‘)[1].split(‘<‘)[0] 17 title = response.xpath(‘//*[@class="sku-name"]/text()‘).extract_first().strip() 18 price = json.loads(response_price.text)[0][‘p‘] 19 delivery_method = response_express 20 # # 把需要的数据保存到Item中,用来会后续储存做准备 21 item = AmazonItem() 22 item[‘title‘] = title 23 item[‘price‘] = price 24 item[‘delivery_method‘] = delivery_method 25 26 # 最后返回item,如果返回的数据类型是item,engine会检测到并把返回值发给pipelines处理 27 return item
4. item.py配置
1 import scrapy 2 3 4 class JdItem(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 # amazome Item 8 title = scrapy.Field() 9 price = scrapy.Field() 10 delivery_method = scrapy.Field()
5. pipelines.py配置
1 from pymongo import MongoClient 2 3 4 class MongoPipeline(object): 5 """ 6 用来保存数据到MongoDB的pipeline 7 """ 8 9 def __init__(self, db, collection, host, port, user, pwd): 10 """ 11 连接数据库 12 :param db: databaes name 13 :param collection: table name 14 :param host: the ip for server 15 :param port: thr port for server 16 :param user: the username for login 17 :param pwd: the password for login 18 """ 19 self.db = db 20 self.collection = collection 21 self.host = host 22 self.port = port 23 self.user = user 24 self.pwd = pwd 25 26 @classmethod 27 def from_crawler(cls, crawler): 28 """ 29 this classmethod is used for to get the configuration from settings 30 :param crwaler: 31 :return: 32 """ 33 db = crawler.settings.get(‘DB‘) 34 collection = crawler.settings.get(‘COLLECTION‘) 35 host = crawler.settings.get(‘HOST‘) 36 port = crawler.settings.get(‘PORT‘) 37 user = crawler.settings.get(‘USER‘) 38 pwd = crawler.settings.get(‘PWD‘) 39 40 return cls(db, collection, host, port, user, pwd) 41 42 def open_spider(self, spider): 43 """ 44 run once time when the spider is starting 45 :param spider: 46 :return: 47 """ 48 # 连接数据库 50 self.client = MongoClient("mongodb://%s:%[email protected]%s:%s" % ( 51 self.user, 52 self.pwd, 53 self.host, 54 self.port 55 )) 56 57 def process_item(self, item, spider): 58 """ 59 storage the data into database 60 :param item: 61 :param spider: 62 :return: 63 """
# 获取item数据,并转换成字典格式
64 d = dict(item)
# 有空值得不保存 65 if all(d.values()):
# 保存到mongodb中 66 self.client[self.db][self.collection].save(d) 67 return item 68 69 # 表示将item丢弃,不会被后续pipeline处理 70 # raise DropItem()
6. 配置文件
1 # database server 2 DB = "jd" 3 COLLECTION = "goods" 4 HOST = "127.0.0.1" 5 PORT = 27017 6 USER = "root" 7 PWD = "123" 8 ITEM_PIPELINES = { 9 ‘MyScrapy.pipelines.MongoPipeline‘: 300, 10 }