Scrapy导出欠套型JSON

Posted zh672903

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy导出欠套型JSON相关的知识,希望对你有一定的参考价值。

Scrapy导出欠套型JSON

scrapy如何导出类型如下结构的JSON:

[

    "pingPai": ["ALPINA"],
    "carTypes": [
        "carType": ["ALPINA"],
        "carNames": 
            "carName": ["ALPINA B4",
            "ALPINA B3",
            "ALPINA D5",
            "ALPINA B7",
            "ALPINA XD3"]
        
    ],
    "picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M05/AB/2E/100x100_f40_autohomecar__wKgHHls8hiKADrqGAABK67H4HUI503.png"
,

    "pingPai": ["ABT"],
    "carTypes": [
        "carType": ["ABT"],
        "carNames": 
            "carName": ["ABT A3",
            "ABT A5",
            "ABT TT"]
        
    ],
    "picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M07/B0/47/100x100_f40_autohomecar__wKgHPls9vLOAHILAAAAWGGhA_W0282.png"
    
]

解决核心

Scrapy导出欠套型JSON实质是对列表的操作

准备知识

  • python合并多个列表,直接用“+”
list1 = [1,2,3]
list2 = [5,6,7]
print(list1+list2)
# 输出:
[1, 2, 3, 5, 6, 7]
  • python合并多个字典,用“update()”(也可以用其它方法,这里只讲update)(在此文中实际上没有用到)
dic1 = 'a':'1','b':'2'
dic2 = 'c':'3','d':'4'
dic1.update(dic2)
print(dic1)
# 输出:
'a': '1', 'b': '2', 'c': '3', 'd': '4'
  • xpath匹配得到的结果实际上是一个列表
    比如,xpath匹配到一行数据就是“[X]”,X是所匹配的值
    xpath匹配到多行数据就是“[X,Y,Z....]”

解决方法

观察如上欠套JSON,1级节点是:“pingPai”、“carTypes”、“picUrl”三个字段,根据scrapy定义items.py文件的特性,我们只需要定义这三个一级节点,定义为:
打开items.py文件,添加如下代码:

class CarModelItem(scrapy.Item):
    pingPai = scrapy.Field()  # 品牌
    carTypes = scrapy.Field()  # 车型
    picUrl = scrapy.Field()  # 品牌图片

要生成欠套型的JSON,我们只需要在carTypes列表内再添加列表就行(添加值为字典“”类型的列表就可以了)。
比如我们要爬取地址为“https://www.autohome.com.cn/grade/carhtml/A.html”这个地址的内容,打开浏览器查看效果如下:

技术图片

查看源代码,图片和1级节点在“dl/dt”内,如下图:

技术图片

第二节点和第三节点在“dl/dd”内,如下图:

技术图片

这里是比较难处理的地方,一般这里我们要定义的列表为:列表内再添加列表(值为字典)的数据格式才能满足需求。
直接上代码:

class GetcarmodelSpider(scrapy.Spider):
    name = "GetCarModel"
    allowed_domains = ["www.autohome.com.cn"]
    chars = [
        "A",
        """
        "B",
        "C",
        "D",
        "F",
        "G",
        "H",
        "J",
        "K",
        "L",
        "M",
        "N",
        "O",
        "P",
        "Q",
        "R",
        "S",
        "T",
        "W",
        "X",
        "Y",
        "Z",""",
    ]
    start_urls = [
        "https://www.autohome.com.cn/grade/carhtml/%s.html" % i2 for i2 in chars
    ]

    def parse(self, response):
        dtArray = response.xpath("//dl[@id]")
        for dt in dtArray:
            pingPai = dt.xpath("./dt/div/a/text()").extract()
            pingPaiPicArr = dt.xpath("./dt/a/img/@src").extract()
            pingPaiPic = ""
            # 这里图片其实只有一张图片
            for cti in pingPaiPicArr:
                # carTypeImg = "http://" + cti[2:]
                pingPaiPic = parse.urljoin(response.url, cti)

            carTypesTemp = dt.xpath("./dd/div[@class='h3-tit']")
            carTypes = []
            for pp in carTypesTemp:
                print(">>>>>>>>>>>>>>>>>>>>>>", pp.xpath("./a/text()").extract())
                carTypes += [
                    "carType": pp.xpath("./a/text()").extract(), "carNames": 
                ]

            # 获取具体名称
            carNameArray = dt.xpath("./dd/ul[@class='rank-list-ul']")
            carNames = []
            for cn in carNameArray:
                # 直接定义值为字典类型的列表,这样在循环第X次的时候取值就是carNames[X]
                carNames += ["carName": cn.xpath("./li/h4/a/text()").extract()]
                print(".......", ["carName": cn.xpath("./li/h4/a/text()").extract()])

            for i in range(len(carTypes)):
                try:
                    carTypes[i]["carNames"] = carNames[i]
                except Exception as e:
                    print(e)


            print("pingPai:", pingPai)
            print("pingPaiPic:", pingPaiPic)
            print("carTypes:", carTypes)
            print("carNames:", carNames)

            carModel = CarModelItem()
            carModel["pingPai"] = pingPai
            carModel["carTypes"] = carTypes
            carModel["picUrl"] = pingPaiPic
            yield carModel

注意,导出JSON方法这里不再说明,自行搜索,网上一大堆
运行代码,得到导出的JSON文件如下:

[
    "pingPai": ["奥迪"],
    "carTypes": [
        "carType": ["一汽-大众奥迪"],
        "carNames": 
            "carName": ["奥迪Q2L新能源",
            "奥迪A3",
            "奥迪A4L",
            "奥迪A6L",
            "奥迪Q2L",
            "奥迪Q3",
            "奥迪Q5L",
            "奥迪A6L新能源",
            "奥迪Q4",
            "奥迪A4",
            "奥迪A6",
            "奥迪Q5"]
        
    ,
    
        "carType": ["Audi Sport"],
        "carNames": 
            "carName": ["奥迪RS 3",
            "奥迪RS 4",
            "奥迪RS 5",
            "奥迪RS 6",
            "奥迪RS 7",
            "奥迪R8",
            "奥迪TT RS",
            "奥迪RS Q3",
            "奥迪RSQ e-tron"]
        
    ,
    
        "carType": ["奥迪(进口)"],
        "carNames": 
            "carName": ["奥迪e-tron",
            "奥迪A3(进口)",
            "奥迪S3",
            "奥迪A4(进口)",
            "奥迪A5",
            "奥迪S4",
            "奥迪S5",
            "奥迪A6(进口)",
            "奥迪S6",
            "奥迪A7",
            "奥迪S7",
            "奥迪A8",
            "奥迪Q7",
            "奥迪Q7新能源",
            "奥迪TT",
            "奥迪TTS",
            "奥迪A0",
            "奥迪A1",
            "奥迪S1",
            "e-tron Concept",
            "奥迪AI:ME",
            "奥迪A6新能源(进口)",
            "奥迪A7新能源",
            "奥迪Aicon",
            "奥迪e-tron GT",
            "Prologue",
            "奥迪A8新能源",
            "奥迪A9",
            "奥迪S8",
            "allroad",
            "奥迪Q2",
            "奥迪SQ2",
            "奥迪Q3(进口)",
            "奥迪Q4(进口)",
            "奥迪Q4新能源(进口)",
            "奥迪TT offroad",
            "h-tron quattro",
            "奥迪Elaine",
            "奥迪Q5(进口)",
            "奥迪Q5新能源(进口)",
            "奥迪SQ5",
            "奥迪Q8",
            "奥迪SQ7",
            "奥迪Q9",
            "e-tron Vision Gran Turismo",
            "quattro",
            "奥迪PB18",
            "奥迪R18",
            "奥迪Urban",
            "奥迪A2",
            "奥迪80",
            "奥迪A3新能源(进口)",
            "奥迪Coupe",
            "奥迪100",
            "Crosslane Coupe",
            "奥迪Cross",
            "Nanuk"]
        
    ],
    "picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"
,

    "pingPai": ["阿斯顿·马丁"],
    "carTypes": [
        "carType": ["阿斯顿·马丁"],
        "carNames": 
            "carName": ["Rapide",
            "V8 Vantage",
            "Vanquish",
            "阿斯顿·马丁DB11",
            "阿斯顿·马丁DBS",
            "Cygnet",
            "Rapide E",
            "阿斯顿·马丁DBX",
            "V12 Vantage",
            "阿斯顿·马丁DB9",
            "AM-RB 003",
            "Heritage EV",
            "Virage",
            "Vulcan",
            "阿斯顿·马丁CC100",
            "阿斯顿·马丁DB10",
            "阿斯顿·马丁DB5",
            "阿斯顿·马丁DP-100",
            "战神",
            "拉共达Taraf",
            "Ulster",
            "V12 Zagato",
            "阿斯顿·马丁DB6",
            "阿斯顿·马丁One-77"]
        
    ],
    "picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M06/AE/B5/100x100_f40_autohomecar__wKgHEVs9u6GAPWN8AAAYsmBsCWs847.png"
,

    "pingPai": ["AC Schnitzer"],
    "carTypes": [
        "carType": ["AC Schnitzer"],
        "carNames": 
            "carName": ["AC Schnitzer 3系",
            "AC Schnitzer M4",
            "AC Schnitzer 7系",
            "AC Schnitzer X6",
            "AC Schnitzer X5"]
        
    ],
    "picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M01/B0/62/100x100_f40_autohomecar__ChcCQFs9vBKAO3YSAAAW0WOWvRc555.png"
,

    "pingPai": ["安凯客车"],
    "carTypes": [
        "carType": ["安凯客车"],
        "carNames": 
            "carName": ["宝斯通"]
        
    ],
    "picUrl": "https://car2.autoimg.cn/cardfs/series/g29/M00/AB/C8/100x100_f40_autohomecar__ChcCSFs8riCAYVA2AAApQLgf8a0969.png"
,

    "pingPai": ["阿尔法·罗密欧"],
    "carTypes": [
        "carType": ["阿尔法·罗密欧"],
        "carNames": 
            "carName": ["Giulia",
            "Stelvio",
            "MiTo",
            "Giulietta",
            "Tonale",
            "ALFA 4C",
            "Disco Volante",
            "Gloria",
            "ALFA 147",
            "ALFA 156",
            "ALFA 159",
            "ALFA 166",
            "ALFA 2uettottanta",
            "ALFA 8C",
            "ALFA GT",
            "ALFA S.Z.",
            "ALFA TZ3"]
        
    ],
    "picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M05/B0/29/100x100_f40_autohomecar__ChcCP1s9u5qAemANAABON_GMdvI451.png"
,

    "pingPai": ["ALPINA"],
    "carTypes": [
        "carType": ["ALPINA"],
        "carNames": 
            "carName": ["ALPINA B4",
            "ALPINA B3",
            "ALPINA D5",
            "ALPINA B7",
            "ALPINA XD3"]
        
    ],
    "picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M05/AB/2E/100x100_f40_autohomecar__wKgHHls8hiKADrqGAABK67H4HUI503.png"
,

    "pingPai": ["ABT"],
    "carTypes": [
        "carType": ["ABT"],
        "carNames": 
            "carName": ["ABT A3",
            "ABT A5",
            "ABT TT"]
        
    ],
    "picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M07/B0/47/100x100_f40_autohomecar__wKgHPls9vLOAHILAAAAWGGhA_W0282.png"
,

    "pingPai": ["AEV ROBOTICS"],
    "carTypes": [
        "carType": ["AEV ROBOTICS"],
        "carNames": 
            "carName": ["Modular Vehicle System"]
        
    ],
    "picUrl": "https://car2.autoimg.cn/cardfs/series/g3/M02/58/D3/autohomecar__ChcCRVw0TJaAM8BmAAAS-7AD7DQ372.png"
,

    "pingPai": ["Agile Automotive"],
    "carTypes": [
        "carType": ["Agile Automotive"],
        "carNames": 
            "carName": ["Agile Automotive SC122",
            "Agile Automotive SCX"]
        
    ],
    "picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M09/AF/8C/100x100_f40_autohomecar__wKgHHVs9r62AIbiYAAAvAsqdpoA594.png"
,

    "pingPai": ["Apollo"],
    "carTypes": [
        "carType": ["Apollo"],
        "carNames": 
            "carName": ["Apollo N",
            "Arrow",
            "Intensa Emozione"]
        
    ],
    "picUrl": "https://car3.autoimg.cn/cardfs/series/g28/M06/B0/C6/100x100_f40_autohomecar__ChcCR1s90RGASBRgAACz67wh_68723.png"
,

    "pingPai": ["Arash"],
    "carTypes": [
        "carType": ["Arash"],
        "carNames": 
            "carName": ["AF8 Cassini",
            "Arash AF10"]
        
    ],
    "picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M05/AA/D4/100x100_f40_autohomecar__wKgHHFs8n1CAVhcNAAAV3xEAiDM531.png"
,

    "pingPai": ["ARCFOX"],
    "carTypes": [
        "carType": ["北汽新能源"],
        "carNames": 
            "carName": ["ARCFOX-1",
            "ARCFOX ECF Concept",
            "ARCFOX-7",
            "ARCFOX-GT"]
        
    ],
    "picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M02/AB/F7/100x100_f40_autohomecar__ChcCQFs8nA6AP-h5AABsvxhHw3E709.png"
,

    "pingPai": ["Aria"],
    "carTypes": [
        "carType": ["Aria"],
        "carNames": 
            "carName": ["Aria FXE"]
        
    ],
    "picUrl": "https://car3.autoimg.cn/cardfs/series/g28/M0B/B0/0D/100x100_f40_autohomecar__wKgHI1s9r2iAJwIXAAAIBShzq60456.png"
,

    "pingPai": ["ATS"],
    "carTypes": [
        "carType": ["ATS"],
        "carNames": 
            "carName": ["ATS GT"]
        
    ],
    "picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M08/D7/D3/autohomecar__ChsEe1wYwKmAY2p9AAA1NP0jCHk594.png"
,

    "pingPai": ["Aurus"],
    "carTypes": [
        "carType": ["Aurus"],
        "carNames": 
            "carName": ["Senat"]
        
    ],
    "picUrl": "https://car2.autoimg.cn/cardfs/series/g27/M07/F3/E1/autohomecar__ChcCQFuN6WiAcztKAAAsLfBmU9g074.png"
,

    "pingPai": ["艾康尼克"],
    "carTypes": [
        "carType": ["艾康尼克ICONIQ Motors"],
        "carNames": 
            "carName": ["MUSE",
            "艾康尼克七系"]
        
    ],
    "picUrl": "https://car2.autoimg.cn/cardfs/series/g29/M0A/A9/EC/100x100_f40_autohomecar__wKgHG1s8iP6ASbjTAAAOIwskkzo314.png"
,

    "pingPai": ["爱驰"],
    "carTypes": [
        "carType": ["爱驰汽车"],
        "carNames": 
            "carName": ["爱驰U5",
            "爱驰U7",
            "RG Nathalie"]
        
    ],
    "picUrl": "https://car3.autoimg.cn/cardfs/series/g29/M09/A9/9B/100x100_f40_autohomecar__wKgHG1s8fwqAOp3IAAALEeTkn6c536.png"
]

以上是关于Scrapy导出欠套型JSON的主要内容,如果未能解决你的问题,请参考以下文章

使用Scrapy命令行工具导出JSON文件时编码设置

Scrapy process.crawl() 将数据导出到 json

scrapy导出文件中文乱码问题

scrapy pipelines导出各种格式

Scrapy Spider没有返回所有元素

scrapy主动退出爬虫的代码片段(python3)