Scrapy python csv输出每行之间有空行

Posted

技术标签:

【中文标题】Scrapy python csv输出每行之间有空行【英文标题】:Scrapy python csv output has blank lines between each row 【发布时间】:2017-09-14 07:58:48 【问题描述】:

我在生成的 csv 输出文件中的每一行 scrapy 输出之间出现了不需要的空白行。

我已经从 python2 迁移到 python 3,我使用的是 Windows 10。因此,我正在为 python3 调整我的 scrapy 项目。

我目前(目前也是唯一的)问题是,当我将 scrapy 输出写入 CSV 文件时,每行之间都会出现一个空行。这已在此处的几篇文章中突出显示(与 Windows 有关),但我无法找到可行的解决方案。

碰巧,我还在 piplines.py 文件中添加了一些代码,以确保 csv 输出按照给定的列顺序而不是随机顺序。因此,我可以使用普通的scrapy crawl charleschurch 来运行这段代码,而不是使用scrapy crawl charleschurch -o charleschurch2017xxxx.csv

有人知道如何在 CSV 输出中跳过/省略这个空行吗?

我的 pipelines.py 代码如下(我可能不需要 import csv 行,但我怀疑我可以为最终答案做):

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = 

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

我在settings.py文件中添加了这一行(不确定300的相关性):

ITEM_PIPELINES = 'CharlesChurch.pipelines.CSVPipeline': 300 

我的scrapy代码如下:

import scrapy
from urllib.parse import urljoin

from CharlesChurch.items import CharleschurchItem

class charleschurchSpider(scrapy.Spider):
    name = "charleschurch"
    allowed_domains = ["charleschurch.com"]    
    start_urls = ["https://www.charleschurch.com/county-durham_willington/the-ridings-1111"]


    def parse(self, response):

        for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
           item = CharleschurchItem()
           item['name'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/span[1]/b/text()').extract()
           item['address'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[@itemprop="postalCode"]/text()').extract()
           plotnames = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/text()').extract()
           plotnames = [plotname.strip() for plotname in plotnames]
           plotids = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/@href').extract()
           plotids = [plotid.strip() for plotid in plotids]
           plotprices = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__price"]/text()').extract()
           plotprices = [plotprice.strip() for plotprice in plotprices]
           result = zip(plotnames, plotids, plotprices)
           for plotname, plotid, plotprice in result:
               item['plotname'] = plotname
               item['plotid'] = plotid
               item['plotprice'] = plotprice
               yield item

【问题讨论】:

你能试试把这行file = open('%s_items.csv' % spider.name, 'w+b')改成file = open('%s_items.csv' % spider.name, 'w', newline="")吗? @Jean-FrançoisFabre 我在尝试时收到错误TypeError: write() argument must be str, not bytes 好的,然后file = open('%s_items.csv' % spider.name, 'wb', newline="") @Jean-FrançoisFabre 给出错误ValueError: binary mode doesn't take a newline argument 【参考方案1】:

我怀疑不理想,但我找到了解决此问题的方法。在 pipelines.py 文件中,我添加了更多代码,这些代码基本上将带有空白行的 csv 文件读取到列表中,因此删除了空白行,然后将清理后的列表写入新文件。

我添加的代码是:

with open('%s_items.csv' % spider.name, 'r') as f:
  reader = csv.reader(f)
  original_list = list(reader)
  cleaned_list = list(filter(None,original_list))

with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
    wr = csv.writer(output_file, dialect='excel')
    for data in cleaned_list:
      wr.writerow(data)

所以整个 pipelines.py 文件是:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = 

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

    #given I am using Windows i need to elimate the blank lines in the csv file
    print("Starting csv blank line cleaning")
    with open('%s_items.csv' % spider.name, 'r') as f:
      reader = csv.reader(f)
      original_list = list(reader)
      cleaned_list = list(filter(None,original_list))

    with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
        wr = csv.writer(output_file, dialect='excel')
        for data in cleaned_list:
          wr.writerow(data)

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item


class CharleschurchPipeline(object):
    def process_item(self, item, spider):
        return item

不理想,但现在解决了问题。

【讨论】:

【参考方案2】:

w+b 中的 b 很可能是问题的一部分,因为这会使文件被视为二进制文件,因此换行符按原样编写。

所以第一步是删除b。然后通过添加U,您还可以激活通用换行支持(参见:https://docs.python.org/3/glossary.html#term-universal-newlines)

所以有问题的行应该是这样的:

file = open('%s_items.csv' % spider.name, 'Uw+')

【讨论】:

我没有想到这一点,但是当我尝试这样做时,我得到了错误ValueError: mode U cannot be combined with x', 'w', 'a', or '+' 奇怪,它在我的系统上运行。下一个最佳解决方案是“w+”,但根据其他 cmets,您将遇到不同的错误消息。对不起。 有,直到大约 6 个月前我完全迁移到 Ubuntu。在我的旧资源中查找了这个 sn-p。

以上是关于Scrapy python csv输出每行之间有空行的主要内容,如果未能解决你的问题,请参考以下文章

scrapy1.4的csv输出时出现空行问题的源码修复-patch

Python - Scrapy到Json输出分裂

如何从python的scrapy输出中删除'\ n'

oj提交时常见错误归纳

Scrapy spider 输出 empy csv 文件

python csv模块写入有空行问题