从网站抓取电子邮件

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了从网站抓取电子邮件相关的知识,希望对你有一定的参考价值。

我已经尝试了其他文章的多次迭代,但似乎没有任何帮助或满足我的需求。

我有一个要遍历的URL列表,并提取包含电子邮件地址的所有相关URL。然后,我想将URL和电子邮件地址存储到csv文件中。

例如,如果我去了10torr.com,该程序应在主URL中找到每个站点(即:10torr.com/about),并提取所有电子邮件。

下面是5个示例网站的列表,这些网站在通过我的代码运行时当前处于数据框格式。它们保存在变量small_site下。

    Website
    http://10torr.com/
    https://www.10000drops.com/
    https://www.11wells.com/
    https://117westspirits.com/
    https://www.onpointdistillery.com/

下面是我正在运行的代码。蜘蛛程序似乎正在运行,但是我的csv文件中没有输出。


import os
import pandas as pd
import re
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

small_site = site.head()


#%% Start Spider
class MailSpider(scrapy.Spider):

    name = 'email'

    def parse(self, response):

        links = LxmlLinkExtractor(allow=()).extract_links(response)
        links = [str(link.url) for link in links]
        links.append(str(response.url))

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 

    def parse_link(self, response):

        for word in self.reject:
            if word in str(response.url):
                return

        html_text = str(response.text)
        mail_list = re.findall('\w+@\w+\.1\w+', html_text)

        dic = 'email': mail_list, 'link': str(response.url)
        df = pd.DataFrame(dic)

        df.to_csv(self.path, mode='a', header=False)
        df.to_csv(self.path, mode='a', header=False)


#%% Preps a CSV File
def ask_user(question):
    response = input(question + ' y/n' + '\n')
    if response == 'y':
        return True
    else:
        return False
def create_file(path):
    response = False
    if os.path.exists(path):
        response = ask_user('File already exists, replace?')
        if response == False: return 

    with open(path, 'wb') as file: 
        file.close()


#%% Defines function that will extract emails and enter it into CSV
def get_info(url_list, path, reject=[]):

    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)


    print('Collecting Google urls...')
    google_urls = url_list


    print('Searching for emails...')
    process = CrawlerProcess('USER_AGENT': 'Mozilla/5.0')
    process.start() 

    for i in small_site.Website.iteritems():
        print('Searching for emails...')
        process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
        ##process.start()

        print('Cleaning emails...')
        df = pd.read_csv(path, index_col=0)
        df.columns = ['email', 'link']
        df = df.drop_duplicates(subset='email')
        df = df.reset_index(drop=True)
        df.to_csv(path, mode='w', header=True)


    return df


url_list = small_site
path = 'email.csv'

df = get_info(url_list, path)

我不确定我要去哪里错了,因为我没有收到任何错误消息。如果您需要其他信息,请询问。我一直在努力争取将近一个月的时间,现在我感觉就像是把头撞在墙上。

几周后,这些代码的大部分都在文章Web scraping to extract contact information— Part 1: Mailing Lists上找到了。但是,我没有成功将其扩展到我的需求。在整合他们的Google搜索功能以获取基本URL时,它一劳永逸。

谢谢您能提供的任何帮助。

答案

我修改了一些脚本,这些脚本通过Shell运行了以下脚本,并且可以运行。也许它将为您提供起点。

我建议您使用外壳程序,因为它在抓取过程中总是会引发错误和其他消息


class MailSpider(scrapy.Spider):

    name = 'email'
    start_urls = [
        'http://10torr.com/',
        'https://www.10000drops.com/',
        'https://www.11wells.com/',
        'https://117westspirits.com/',
        'https://www.onpointdistillery.com/',
    ]

    def parse(self, response):
        self.log('A response from %s just arrived!' % response.url)
        links = LxmlLinkExtractor(allow=()).extract_links(response)
        links = [str(link.url) for link in links]
        links.append(str(response.url))

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 

    def parse_link(self, response):

        html_text = str(response.text)
        mail_list = re.findall('\w+@\w+\.1\w+', html_text)

        dic = 'email': mail_list, 'link': str(response.url)

        for key in dic.keys():
            yield 
                'email' : dic['email'],
                'link': dic['link'],
            

当通过Anaconda shell scrapy crawl email -o test.jl进行爬网时,将产生以下output

"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/"
"email": ["8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress", "bundle@3.2", "fetch@3.0", "bolt@2.3", "5oclock@11wells.com", "5oclock@11wells.com", "5oclock@11wells.com"], "link": "https://www.11wells.com"
"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop?olsPage=search&keywords="
"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop?olsPage=search&keywords="
"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop"
"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop?olsPage=cart"
"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/home"
"email": ["8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress", "bundle@3.2", "fetch@3.0", "bolt@2.3", "5oclock@11wells.com", "5oclock@11wells.com", "5oclock@11wells.com"], "link": "https://www.11wells.com"
"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/home"
"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/117%C2%B0-west-spirits-1"
...
...
...

请参阅Scrapy docs以获取更多信息

以上是关于从网站抓取电子邮件的主要内容,如果未能解决你的问题,请参考以下文章

使用 Python 从电子商务 Ajax 站点抓取 JSON 数据

抓取电子邮件地址时无法删除不需要的东西

如何使用 Python 抓取需要先登录的网站

使用python抓取AJAX电子商务网站

使用scrapy抓取电子商务

Selenium 无法使用 python 抓取 Shopee 电子商务网站