scrapy学习笔记
下面以爬取1919网站为例子,完成对一整个网站数据爬取的scrapy项目创建。
创建一个scrapy文件
在任意目录下输入命令
scrapy startproject OneNine (文件名)
将会得到如下目录的文件
OneNine/ scrapy.cfg # 部署配置文件 OneNine/ # Python模块,你所有的代码都放这里面 __init__.py items.py # Item定义文件 pipelines.py # pipelines定义文件 settings.py # 配置文件 spiders/ # 所有爬虫spider都放这个文件夹下面 __init__.py ...
接着创建一个spider文件用来编写爬取规则
cd OneNine
scrape genspider onenine onenine.com
此时在spiders文件夹下就会生成一个onenine.py文件,我们将在这个文件中编写爬虫规则
定义Item
在items.py文件中需要编写我们要爬取的字段内容。
import scrapy class OnenineItem(scrapy.Item): url = scrapy.Field() good_name = scrapy.Field() actual_price = scrapy.Field() details = scrapy.Field() year = scrapy.Field() month = scrapy.Field() plateform = scrapy.Field() cat_lv_one = scrapy.Field() cat_lv_two = scrapy.Field() shop_id = scrapy.Field() shop_name = scrapy.Field() shop_area = scrapy.Field() shop_province = scrapy.Field() shop_city = scrapy.Field() good_id = scrapy.Field() brand = scrapy.Field() size = scrapy.Field() percent = scrapy.Field() country = scrapy.Field() area = scrapy.Field() type = scrapy.Field() grape_type = scrapy.Field() num = scrapy.Field() name_price = scrapy.Field() bottle_price = scrapy.Field() comments = scrapy.Field() accumulate_sales = scrapy.Field() month_sales = scrapy.Field() month_bottle_sales = scrapy.Field() month_sale_amounts = scrapy.Field()
scrapy.Field的属性的字段可以直接在后期直接生成你要的文件格式。
spider文件
在spider文件中我们编写了对于网站爬取规则的编写