使用scrapy进行股票数据爬虫

Posted 2020-10-07

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了使用scrapy进行股票数据爬虫相关的知识，希望对你有一定的参考价值。

周末了解了scrapy框架，对上次使用requests+bs4+re进行股票爬虫（http://www.cnblogs.com/wyfighting/p/7497985.html）的代码，使用scrapy进行了重写。

目录结构：

技术分享

stocks.py文件代码

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 import re
 4  
 5  
 6 class StocksSpider(scrapy.Spider):
 7     name = "stocks"
 8     start_urls = [‘http://quote.eastmoney.com/stocklist.html‘]
 9  
10     def parse(self, response):
11         for href in response.css(‘a::attr(href)‘).extract():
12             try:
13                 stock = re.findall(r"[s][hz]\\d{6}", href)[0]
14                 url = ‘https://gupiao.baidu.com/stock/‘ + stock + ‘.html‘
15                 yield scrapy.Request(url, callback=self.parse_stock)
16             except:
17                 continue
18  
19     def parse_stock(self, response):
20         infoDict = {}
21         stockInfo = response.css(‘.stock-bets‘)
22         name = stockInfo.css(‘.bets-name‘).extract()[0]
23         keyList = stockInfo.css(‘dt‘).extract()
24         valueList = stockInfo.css(‘dd‘).extract()
25         for i in range(len(keyList)):
26             key = re.findall(r‘>.*</dt>‘, keyList[i])[0][1:-5]
27             try:
28                 val = re.findall(r‘\\d+\\.?.*</dd>‘, valueList[i])[0][0:-5]
29             except:
30                 val = ‘--‘
31             infoDict[key]=val
32  
33         infoDict.update(
34             {‘股票名称‘: re.findall(‘\\s.*\\(‘,name)[0].split()[0] + 35              re.findall(‘\\>.*\\<‘, name)[0][1:-1]})
36         yield infoDict

pipelines.py文件代码：

 1 # -*- coding: utf-8 -*-
 2  
 3 # Define your item pipelines here
 4 #
 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7  
 8  
 9 class BaidustocksPipeline(object):
10     def process_item(self, item, spider):
11         return item
12  
13 class BaidustocksInfoPipeline(object):
14     def open_spider(self, spider):
15         self.f = open(‘BaiduStockInfo.txt‘, ‘w‘)
16  
17     def close_spider(self, spider):
18         self.f.close()
19  
20     def process_item(self, item, spider):
21         try:
22             line = str(dict(item)) + ‘\\n‘
23             self.f.write(line)
24         except:
25             pass
26         return item

settings.py文件中被修改的区域：

1 # Configure item pipelines
2 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
3 ITEM_PIPELINES = {
4     ‘BaiduStocks.pipelines.BaidustocksInfoPipeline‘: 300,
5 }

以上是关于使用scrapy进行股票数据爬虫的主要内容，如果未能解决你的问题，请参考以下文章

Scrapy 抓取股票行情

scrapy按顺序启动多个爬虫代码片段(python3)

scrapy主动退出爬虫的代码片段(python3)

scrapy框架爬取糗事百科

使用scrapy爬取百度股票

爬取股票scrapy