Scrapy spider不会在start-url列表上进行迭代
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy spider不会在start-url列表上进行迭代相关的知识,希望对你有一定的参考价值。
我正在尝试构建一个电子邮件抓取工具,它接收一个网址的csv文件,并使用电子邮件地址返回它们;包括在此过程中被删除的其他网址/地址。我似乎无法让我的蜘蛛迭代csv文件中的每一行,即使它们在我测试我正在调用的函数时返回正常。
这是代码;我改编自here:
import os, re, csv, scrapy, logging
import pandas as pd
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from googlesearch import search
from time import sleep
# Avoid getting too many logs and warnings when using Scrapy inside Jupyter Notebook.
logging.getLogger('scrapy').propagate = False
# Extract urls from file.
def get_urls():
urls = pd.read_csv('food_urls.csv')
url = list(urls)
for i in url:
return urls
# Test it.
# get_urls()
# Create mail spider.
class MailSpider(scrapy.Spider):
name = 'email'
def parse(self, response):
# Search for links inside URLs.
links = LxmlLinkExtractor(allow=()).extract_links(response)
# Take in a list of URLs as input and read their source codes one by one.
links = [str(link.url) for link in links]
links.append(str(response.url))
# Send links from one parse method to another.
for link in links:
yield scrapy.Request(url=link, callback=self.parse_link)
# Pass URLS to the parse_link method — this is the method we'll apply our regex findall to look for emails
def parse_link(self, response):
html_text = str(response.text)
mail_list = re.findall('w+@w+.{1}w+', html_text)
dic = {'email': mail_list, 'link': str(response.url)}
df = pd.DataFrame(dic)
df.to_csv(self.path, mode='a', header=False)
df.to_csv(self.path, mode='a', header=False)
# Save emails in a CSV file
def ask_user(question):
response = input(question + ' y/n' + '
')
if response == 'y':
return True
else:
return False
def create_file(path):
response = False
if os.path.exists(path):
response = ask_user('File already exists, replace?')
if response == False: return
with open(path, 'wb') as file:
file.close()
# Combine everything
def get_info(root_file, path):
create_file(path)
df = pd.DataFrame(columns=['email', 'link'], index=[0])
df.to_csv(path, mode='w', header=True)
print('Collecting urls...')
urls_list = get_urls()
print('Searching for emails...')
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
process.crawl(MailSpider, start_urls=urls_list, path=path)
process.start()
print('Cleaning emails...')
df = pd.read_csv(path, index_col=0)
df.columns = ['email', 'link']
df = df.drop_duplicates(subset='email')
df = df.reset_index(drop=True)
df.to_csv(path, mode='w', header=True)
return df
最后,当我打电话给df = get_info('food_urls.csv', 'food_emails.csv')
时,刮刀需要很长时间才能运行。
完成后,我跑了df.head()
并得到了这个:
email link
0 NaN NaN
1 alyssa@therecipecritic.com https://therecipecritic.com/food-blogger/
2 shop@therecipecritic.com https://therecipecritic.com/terms/
所以,它正在运行,但它只是抓取列表中的第一个网址。
有谁知道我做错了什么?
谢谢!
答案
使用嵌套列表创建了一个python dict并导入它:
from Base_URLS import URL_List
然后我打电话给:
def get_urls():
urls = URL_List['urls']
return urls
像魅力一样工作!
感谢@ rodrigo-nader的帮助
以上是关于Scrapy spider不会在start-url列表上进行迭代的主要内容,如果未能解决你的问题,请参考以下文章
scrapy知识补充--scrapy shell 及Spider