如何从python的scrapy输出中删除'\ n'

Posted 2023-02-23

技术标签:

【中文标题】如何从python的scrapy输出中删除\'\\ n\'【英文标题】：How to remove '\n' from scrapy output in python如何从python的scrapy输出中删除'\ n' 【发布时间】：2015-10-11 20:08:12 【问题描述】：

我正在尝试输出到 CSV，但我意识到，在抓取 tripadvisor 时，我得到了很多回车，因此数组超过 30，而只有 10 条评论，所以我丢失了很多字段。有没有办法去掉回车符。

蜘蛛。

from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
from collections import OrderedDict
import json
from scrapy.selector.lxmlsel import htmlXPathSelector
import csv
import html2text
import unicodedata


class scrapingtestspider(Spider):
    name = "scrapytesting"
    allowed_domains = ["tripadvisor.in"]
    base_uri = ["tripadvisor.in"]
    start_urls = [
        "http://www.tripadvisor.in/Hotel_Review-g297679-d736080-Reviews-Ooty_Elk_Hill_A_Sterling_Holidays_Resort-Ooty_Tamil_Nadu.html"]



    def parse(self, response):
        item = ScrapingTestingItem()
        sel = HtmlXPathSelector(response)
        converter = html2text.HTML2Text()
        sites = sel.xpath('//a[contains(text(), "Next")]/@href').extract()
##        dummy_test = [ "" for k in range(10)]

        item['reviews'] = sel.xpath('//div[@class="col2of2"]//p[@class="partial_entry"]/text()').extract()
        item['subjects'] = sel.xpath('//span[@class="noQuotes"]/text()').extract()
        item['stars'] = sel.xpath('//*[@class="rating reviewItemInline"]//img/@alt').extract()
        item['names'] = sel.xpath('//*[@class="username mo"]/span/text()').extract()
        item['location'] = sel.xpath('//*[@class="location"]/text()').extract()
        item['date'] = sel.xpath('//*[@class="ratingDate relativeDate"]/@title').extract()
        item['date'] += sel.xpath('//div[@class="col2of2"]//span[@class="ratingDate"]/text()').extract()


        startingrange = len(sel.xpath('//*[@class="ratingDate relativeDate"]/@title').extract())

        for j in range(startingrange,len(item['date'])):
            item['date'][j] = item['date'][j][9:].strip()

        for i in range(len(item['stars'])):
            item['stars'][i] = item['stars'][i][:1].strip()

        for o in range(len(item['reviews'])):
            print unicodedata.normalize('NFKD', unicode(item['reviews'][o])).encode('ascii', 'ignore')

        for y in range(len(item['subjects'])):
            item['subjects'][y] = unicodedata.normalize('NFKD', unicode(item['subjects'][y])).encode('ascii', 'ignore')

        yield item

#        print item['reviews']

        if(sites and len(sites) > 0):
            for site in sites:
                yield Request(url="http://tripadvisor.in" + site, callback=self.parse)

是否有可能使用正则表达式来遍历 for 循环并替换它。我尝试替换，但没有做任何事情。还有为什么scrapy会这样做。

【问题讨论】：

你可以使用.replace("\n", "") 看看，但那是行不通的，因为在数组中它将是 30 包含“”。我需要从数组中删除它。就像它找到它然后删除索引的东西一样。我习惯了 Java，所以我不知道该怎么做。 TL/DR 有效，但数组索引仍然存在，因此输出空格我明白了，所以您想从数组中删除它们并只保留 10 条评论？你能发布你的数组吗？想通了，在阅读了 python.org 上的列表后，我发现我可以使用 while "\n" in list: list.remove("\n") 【参考方案1】：

我通常使用Input and/or Output Processors 和Item Loaders 来修剪和清理输出 - 它使事情更加模块化和干净：

class ScrapingTestingLoader(ItemLoader):
    default_input_processor = MapCompose(unicode.strip)
    default_output_processor = TakeFirst()

然后，如果您将使用此项目加载器来加载您的项目，您将获得提取的值并作为字符串（而不是列表）。例如，如果提取的字段是 ["my value \n"] - 您将得到 my value 作为输出。

【讨论】：

我认为这会超出 Items 所在的蜘蛛文件夹。我是scrapy的新手，我需要尝试一下。我去看看。 @Smashed 感谢您的更新。我通常在items.py 附近有一个单独的loaders.py，但您也可以将加载程序放在items.py 中（以防Item 和ItemLoader 类简短而透明）。 @Smashed 你一定要研究一下——这实际上会帮助你更好地组织你的网络抓取代码，让你的项目更加模块化和干净。【参考方案2】：

阅读列表文档后的简单解决方案。

while "\n" in some_list: some_list.remove("\n")

【讨论】：

随意接受你自己的答案，这样别人就不会打扰，这不是什么稀罕事。是的，可能是这样。尽量不要忘记。 :)

以上是关于如何从python的scrapy输出中删除'\ n'的主要内容，如果未能解决你的问题，请参考以下文章