Python - Scrapy到Json输出分裂

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python - Scrapy到Json输出分裂相关的知识,希望对你有一定的参考价值。

我正在学习NLP,为此,我正在使用Scrapy抓取亚马逊的书评。我已经提取了我想要的字段,并将它们输出为Json文件格式。当此文件作为df加载时,每个字段都记录为列表而不是单独的每行格式。如何拆分此列表,以便df每个项目都有一行,而不是所有项目条目都记录在单独的列表中?码:

import scrapy


class ReviewspiderSpider(scrapy.Spider):
    name = 'reviewspider'
    allowed_domains = ['amazon.co.uk']
    start_urls = ['https://www.amazon.com/Gone-Girl-Gillian-Flynn/product-reviews/0307588378/ref=cm_cr_othr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1']

def parse(self, response):
    users = response.xpath('//a[contains(@data-hook, "review-author")]/text()').extract()
    titles = response.xpath('//a[contains(@data-hook, "review-title")]/text()').extract()
    dates = response.xpath('//span[contains(@data-hook, "review-date")]/text()').extract()
    found_helpful = response.xpath('//span[contains(@data-hook, "helpful-vote-statement")]/text()').extract()
    rating = response.xpath('//i[contains(@data-hook, "review-star-rating")]/span[contains(@class, "a-icon-alt")]/text()').extract()
    content = response.xpath('//span[contains(@data-hook, "review-body")]/text()').extract()

    yield {
        'users' : users.extract(),
        'titles' : titles.extract(),
        'dates' : dates.extract(),
        'found_helpful' : found_helpful.extract(),
        'rating' : rating.extract(),
        'content' : content.extract()
    }

样本输出:

users = ['Lauren', 'James'...'John']
dates = ['on September 28, 2017', 'on December 26, 2017'...'on November 17, 2016']
rating = ['5.0 out of 5 stars', '2.0 out of 5 stars'...'5.0 out of 5 stars']

期望的输出:

index 1: [users='Lauren', dates='on September 28, 2017', rating='5.0 out of 5 stars']
index 2: [users='James', dates='On December 26, 2017', rating='5.0 out of 5 stars']
...

我知道可能需要编辑与蜘蛛相关的管道来实现这一点,但是我的Python知识有限,无法理解Scrapy文档。我也尝试过herehere的解决方案,但是我不知道能够用我自己的代码巩固答案。任何帮助将非常感激。

答案

重新阅读你的问题后,我很确定这是你想要的:

def parse(self, response):
    users = response.xpath('//a[contains(@data-hook, "review-author")]/text()').extract()
    titles = response.xpath('//a[contains(@data-hook, "review-title")]/text()').extract()
    dates = response.xpath('//span[contains(@data-hook, "review-date")]/text()').extract()
    found_helpful = response.xpath('//span[contains(@data-hook, "helpful-vote-statement")]/text()').extract()
    rating = response.xpath('//i[contains(@data-hook, "review-star-rating")]/span[contains(@class, "a-icon-alt")]/text()').extract()
    content = response.xpath('//span[contains(@data-hook, "review-body")]/text()').extract()

    for user, title, date, found_helpful, rating, content in zip(users, titles, dates, found_helpful, rating, content):
        yield {
            'user': user,
            'title': title,
            'date': date,
            'found_helpful': found_helpful,
            'rating': rating,
            'content': content
        }

或者那种效果。这就是我在第一次评论中试图暗示的内容。

另一答案

编辑:我能够通过使用.css方法而不是.xpath来提出解决方案。我用来从时装零售商那里刮下衬衫的蜘蛛:

import scrapy
from ..items import ProductItem

class SportsdirectSpider(scrapy.Spider):
    name = 'sportsdirect'
    allowed_domains = ['www.sportsdirect.com']
    start_urls = ['https://www.sportsdirect.com/mens/mens-shirts']

def parse(self, response):
    products = response.css('.s-productthumbbox')
    for p in products:
        brand = p.css('.productdescriptionbrand::text').extract_first()
        name = p.css('.productdescriptionname::text').extract_first()
        price = p.css('.curprice::text').extract_first()
        item = ProductItem()
        item['brand'] = brand
        item['name'] = name
        item['price'] = price
        yield item

相关的items.py脚本:

import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    brand = scrapy.Field()
    name = scrapy.Field()
    price = scrapy.Field()

创建json-lines文件(在Anaconda提示符下):

>>> cd simple_crawler
>>> scrapy crawl sportsdirect --set FEED_URI=products.jl

用于将创建的.jl文件转换为数据帧的代码:

import json
import pandas as pd
contents = open('products3.jl', "r").read() 
data = [json.loads(str(item)) for item in contents.strip().split('
')]
df2 = pd.DataFrame(data)

最终输出:

        brand        name                        price
0   Pierre Cardin    Short Sleeve Shirt Mens     £6.50 
1   Pierre Cardin    Short Sleeve Shirt Mens     £7.00 
...

以上是关于Python - Scrapy到Json输出分裂的主要内容,如果未能解决你的问题,请参考以下文章

从python脚本调用scrapy而不创建JSON输出文件

Scrapy - 输出到多个 JSON 文件

只用scrapy获取一行输出到json文件

Scrapy 解析 JSON 输出

scrapy中输出中文保存中文

[Project] Simulate HTTP Post Request to obtain data from Web Page by using Python Scrapy Framework(代