Scrapy spider不会在start-url列表上进行迭代

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy spider不会在start-url列表上进行迭代相关的知识,希望对你有一定的参考价值。

我正在尝试构建一个电子邮件抓取工具,它接收一个网址的csv文件,并使用电子邮件地址返回它们;包括在此过程中被删除的其他网址/地址。我似乎无法让我的蜘蛛迭代csv文件中的每一行,即使它们在我测试我正在调用的函数时返回正常。

这是代码;我改编自here

import os, re, csv, scrapy, logging
import pandas as pd
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from googlesearch import search
from time import sleep

# Avoid getting too many logs and warnings when using Scrapy inside Jupyter Notebook.
logging.getLogger('scrapy').propagate = False

# Extract urls from file.
def get_urls():
    urls = pd.read_csv('food_urls.csv')
    url = list(urls) 
    for i in url: 
        return urls

# Test it.
# get_urls()

# Create mail spider.
class MailSpider(scrapy.Spider):

    name = 'email'

    def parse(self, response):
#       Search for links inside URLs.
        links = LxmlLinkExtractor(allow=()).extract_links(response)

#       Take in a list of URLs as input and read their source codes one by one.
        links = [str(link.url) for link in links]
        links.append(str(response.url))

#       Send links from one parse method to another.
        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 

#   Pass URLS to the parse_link method — this is the method we'll apply our regex findall to look for emails            
    def parse_link(self, response):

        html_text = str(response.text)
        mail_list = re.findall('w+@w+.{1}w+', html_text)

        dic = {'email': mail_list, 'link': str(response.url)}
        df = pd.DataFrame(dic)

        df.to_csv(self.path, mode='a', header=False)
        df.to_csv(self.path, mode='a', header=False)

# Save emails in a CSV file
def ask_user(question):
    response = input(question + ' y/n' + '
')
    if response == 'y':
        return True
    else:
        return False
def create_file(path):
    response = False
    if os.path.exists(path):
        response = ask_user('File already exists, replace?')
        if response == False: return 

    with open(path, 'wb') as file: 
        file.close()

# Combine everything 
def get_info(root_file, path): 
    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)

    print('Collecting urls...')
    urls_list = get_urls()

    print('Searching for emails...')
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    process.crawl(MailSpider, start_urls=urls_list, path=path)
    process.start()

    print('Cleaning emails...')
    df = pd.read_csv(path, index_col=0)
    df.columns = ['email', 'link']
    df = df.drop_duplicates(subset='email')
    df = df.reset_index(drop=True)
    df.to_csv(path, mode='w', header=True)

    return df

最后,当我打电话给df = get_info('food_urls.csv', 'food_emails.csv')时,刮刀需要很长时间才能运行。

完成后,我跑了df.head()并得到了这个:

    email   link
0   NaN NaN
1   alyssa@therecipecritic.com  https://therecipecritic.com/food-blogger/
2   shop@therecipecritic.com    https://therecipecritic.com/terms/

所以,它正在运行,但它只是抓取列表中的第一个网址。

有谁知道我做错了什么?

谢谢!

答案

使用嵌套列表创建了一个python dict并导入它:

from Base_URLS import URL_List

然后我打电话给:

def get_urls():

 urls = URL_List['urls']
 return urls

像魅力一样工作!

感谢@ rodrigo-nader的帮助

以上是关于Scrapy spider不会在start-url列表上进行迭代的主要内容,如果未能解决你的问题,请参考以下文章

scrapy spider及其子类

scrapy知识补充--scrapy shell 及Spider

Python3中Scrapy爬虫框架Spider的用法

Python3中Scrapy爬虫框架Spider的用法

scrapy之 Spider Middleware(爬虫中间件)

Scrapy 组件的具体用法