爬虫项目案例讲解案例二：定位爬虫定位页面元素分别定位简单处理抓取数据（有总结）

Posted 2021-03-26 jxxgg

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了爬虫项目案例讲解案例二：定位爬虫定位页面元素分别定位简单处理抓取数据（有总结）相关的知识，希望对你有一定的参考价值。

1.scrapy shell [要爬取的网址]
他可以很直观的反馈给你要定位的元素是否可以定位到
2.打开后然后再把：
response.xpath("//*[@id="ml_001"]/table/tbody/tr[1]/td[1]/a/text()").extract()；语句写入，看如果可以返回值说明可以定位到
yield 作用：和return类似

总体过程如下：
1.cd part6(转到某个project下)
scrapy startproject [名字1]
cd [名字1]
scrapy genspider stock(所爬取的名字) [地址]
2.有了stock.py文件后，首先打开网页，对需要的数据进行获取，点击cope xpath
获取到xpath路径后，在控制台用语句：scrapy shell [要爬取的网址]进入scrapy模式下，接着使用response.xpath("//*[@id="ml_001"]/table/tbody/tr[1]/td[1]/a/text()").extract()；类似语句试着查找数据，查找到数据后，就可以在stock.py文件下编写代码了，代码如下：
--------------------------------------------------------------------------------
# -*- coding: utf-8 -*-
import scrapy
from urllib import parse
import re
from stock_spider.items import StockItem

class StockSpider(scrapy.Spider):
name = ‘stock‘
allowed_domains = [‘pycs.greedyai.com/‘]
start_urls = [‘http://pycs.greedyai.com/‘]

def parse(self, response):
post_urls=response.xpath("//a/@href").extract();
for post_url in post_urls:
yield scrapy.Request(url=parse.urljoin(response.url, post_url), callback=self.parse_detail, dont_filter=True)

def parse_detail(self,response):
stock_item=StockItem();
# 董事会成员姓名
stock_item["names"]=self.get_tc(response);
# 抓取性别信息
stock_item["sexes"]=self.get_sex(response);
# 抓取年龄信息
stock_item["ages"]=self.get_age(response);
# 股票代码
stock_item["codes"]=self.get_code(response);
# 职位信息
stock_item["leaders"]=self.get_leader(response,len(stock_item["names"]));
#文件存储逻辑
yield stock_item;

def get_tc(self,response):
tc_names=response.xpath("//*[@id="ml_001"]/table/tbody/tr[1]/td[1]/a/text()").extract();
return tc_names;

def get_sex(self,response):
# //*[@id="ml_001"]/table/tbody/tr[2]/td[1]/div/table/thead/tr[2]/td[1]
infos=response.xpath("//*[@class="intro"]/text()").extract();
sex_list=[];
for info in infos:
try:
sex=re.findall("[男|女]",info)[0];
sex_list.append(sex);
except(IndexError):
continue;
return sex_list;

def get_age(self,response):
infos = response.xpath("//*[@class="intro"]/text()").extract();
age_list = [];
for info in infos:
try:
age = re.findall("d+", info)[0];
age_list.append(age);
except(IndexError):
continue;
return age_list;

def get_code(self,response):
infos=response.xpath(‘/html/body/div[3]/div[1]/div[2]/div[1]/h1/a/@title‘).extract();
code_list=[];
for info in infos:
try:
code=re.findall("d+",info)[0];
code_list.append(code);
except():
continue;
return code_list;

def get_leader(self,response,length):
tc_leaders=response.xpath("//*[@class="tl"]/text()").extract();
tc_leaders=tc_leaders[0:length];
return tc_leaders;
--------------------------------------------------------------------------------
3.写好后，调用要从main.py中进行调用，代码如下：
--------------------------------------------------------------------------------
from scrapy.cmdline import execute
import sys
import os
#调试的一个写法
sys.path.append(os.path.dirname(os.path.abspath(__file__)));
# exec("scrapy","crawl","tonghuashun");
# execute(["scrapy","crawl","tonghuashun"]);
# execute(["scrapy","crawl","tonghuashun"]);
execute(["scrapy","crawl","stock"]);
#前两个参数是固定的，最后一个参数是自己创建的名字

--------------------------------------------------------------------------------
4.main写好后，接着再items.py中写入如下代码，目的是为了整合items和stock.py。items.py指明了要爬取哪些数据
--------------------------------------------------------------------------------
import scrapy
class StockSpiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class StockItem(scrapy.Item):
names=scrapy.Field();
sexes=scrapy.Field();
ages=scrapy.Field();
codes=scrapy.Field();
leaders=scrapy.Field();
注意：名称和stock中的保持一致
--------------------------------------------------------------------------------
5.接着再pipeplines.py中编写方法，其中pipeplines.py的主要作用是表明了处理数据的类
--------------------------------------------------------------------------------
class StockSpiderPipeline(object):
def process_item(self, item, spider):
return item
class StockPipeline(object):
def process_item(self, item, spider):
print(item)
return item
--------------------------------------------------------------------------------
6.最后，需要在setting.py中把ITEM_PIPELINES打开，然后加入自己编写的类，代码如下：
--------------------------------------------------------------------------------
ITEM_PIPELINES = {
‘stock_spider.pipelines.StockSpiderPipeline‘: 300,
‘stock_spider.pipelines.StockPipeline‘: 1,
}
--------------------------------------------------------------------------------

总结：在代码编写过程遇到了一些问题，比如说：
1.这段代码要加上异常处理语句

2.要在stock.py文件下引入items.py文件下的类，然后把所有封装的信息全都置在这个类里面

3.每个类里面都要引入response

4.这个title是在网页抓取的时候就有的

5.一般情况下是/text()

6.返回情况，最后用yield将stock_item进行返回

以上是关于爬虫项目案例讲解案例二：定位爬虫定位页面元素分别定位简单处理抓取数据（有总结）的主要内容，如果未能解决你的问题，请参考以下文章

爬虫项目案例讲解 案例二：定位爬虫定位页面元素分别定位简单处理抓取数据（有总结）

爬虫项目案例讲解案例二：定位爬虫定位页面元素分别定位简单处理抓取数据（有总结）