使用 AsyncIO 和 aiohttp 爬取网站并收集所有 url 的程序

Posted

技术标签:

【中文标题】使用 AsyncIO 和 aiohttp 爬取网站并收集所有 url 的程序【英文标题】:A program using AsyncIO and aiohttp to crawl a site and collect all urls 【发布时间】:2017-06-20 13:59:13 【问题描述】:

使用这种异步(更快)的解决方案,您可以从站点收集所有 url,并将它们写入与程序所在目录相同的文本文件中。 然后这些文本文件可用于进一步爬取或用作数据库,可能会定期进行比较。

我已经修改了原始代码,使其完全可以工作并做我想做的事情。您可以增加并行任务的数量或任何您喜欢的。如果您想出一些好的解决方案,请随时在此处发布您的修改。

我对分布式计算的一些想法html-javascript-python

#!/usr/bin/env python3

# python 3.5 async web crawler.
# https://github.com/mehmetkose/python3.5-async-crawler

# Licensed under the MIT license:
# http://www.opensource.org/licenses/mit-license
# Copyright (c) 2016 Mehmet Kose mehmet@linux.com

# Copyright () 2017 Drill Bit but I am not so fuzzy about it.


'''
A bot that surfs a site and extracts all links from it. It does not at the moment move to other sites.
It puts in site pages in one text file, external sites in another text-file and left overs such as media files in a separate file.
After the search is complete it cleans the data and removes duplicates which is put in cleaned files. Than a back up is written and the content is rewritten to the original file for the next search. This way a database can increase over time.

You are only given web adresses with this program.

You can either put in content extraction into this program or write another program to read from the list this program generates and extract relevant content. It will really hammer the server so thread I mean async carefully. On a normal laptop, run at maximum four instances on different sites at the same time. It will use CPU. Not so much RAM though.

There are many interesting things one could do with multiple computers, a central web address database and distributed crawling. Any hand held device could help with a scraping collaboration in exchange for access to the total effort. Trade in crawling for access. If you want more info on the idea all I need is a place to cook ( Fear and Loathing in Las Vegas ).

It is sensitive to the root_url.


'''
import aiohttp
import asyncio
import async_timeout
from urllib.parse import urljoin, urldefrag

#calculate execution time. Hours...
import time
start = time.time()

antal_arbetare = 10 #original är 3

#PUT IN THE SITE TO CRAWL HERE

root_url = "http://www.bbc.com"

crawled_urls, url_hub = [], [root_url, "%s/sitemap.xml" % (root_url), "%s/robots.txt" % (root_url)]

#Choose a webbrowser to present to the server.
#headers = 'user-agent': 'Opera/9.80 (X11; Linux x86_64; U; en) Presto/2.2.15 Version/10.10'
#headers = 'user-agent': 'Firefox/53 (X11; Linux x86_64; U; en) sinGUlaro/5.2.12 Ver/2.8'
headers = 'user-agent': 'Chrome (Windows 7; B; br) Presto/2.2.15 Version/10.10'
#headers = 'user-agent': 'Chromium112AD (Windows 10; CZ) Xenialtrouble-0.2.0'
#headers = 'user-agent': 'Firefox12 (Win XP; build-1076-1) tubular bells'
#headers = ADD YOUR OWN

async def get_body(url):
    async with aiohttp.ClientSession() as session:
        try:
            with async_timeout.timeout(5):
                async with session.get(url, headers=headers) as response:
                    if response.status == 200:
                        html = await response.text()
                        return 'error': '', 'html': html
                    else:
                        return 'error': response.status, 'html': ''
        except Exception as err:
            return 'error': err, 'html': ''


async def handle_task(task_id, work_queue):
    while not work_queue.empty():
        queue_url = await work_queue.get()
        if not queue_url in crawled_urls:
            a = open('crawledsites.txt','a')
            a.write(queue_url+'\n')
            crawled_urls.append(queue_url)
            body = await get_body(queue_url)
            if not body['error']:
                for new_url in get_urls(body['html']):
                    if root_url in new_url and not new_url in crawled_urls:
                        work_queue.put_nowait(new_url)
                        #Tests
                    else:
                        if root_url.split("//")[-1].split("/")[0].split('www.')[-1] not in new_url:
                            if 'file:///' in new_url:
                                wt = open('localfiles.txt','a')
                                wt.write(new_url+'\n')
                                wt.close()
                            else:
                                if 'adserver' not in new_url and 'facebook' not in new_url and 'tel:' not in new_url and 'file:///' not in new_url and 'googleapis' not in new_url and 'javascript' not in new_url and 'yimg.com' not in new_url and 'btrll.com' not in new_url and 'flickr.com' not in new_url and 'tv.nu' not in new_url and 'klart.se' not in new_url and 'twitter.com' not in new_url and 'linkedin.com' not in new_url and 'facebook.com' not in new_url and 'instagram.com' not in new_url and 'mailto:' not in new_url:
                                    ct = open('externaurl.txt','a')
                                    ct.write(new_url+'\n')
                                    ct.close() #sluttest
            else:
                erroro = open('notapprovedlinks.txt','a')
                erroro.write(queue_url+'\n')
                erroro.close() #https://***.com/questions/19457227/how-to-print-like-printf-in-python3



def remove_fragment(url):
    pure_url, frag = urldefrag(url)
    return pure_url

def get_urls(html):
    new_urls = [url.split('"')[0] for url in str(html).replace("'",'"').split('href="')[1:]]
    return [urljoin(root_url, remove_fragment(new_url)) for new_url in new_urls]

if __name__ == "__main__":
    q = asyncio.Queue()
    [q.put_nowait(url) for url in url_hub]    
    loop = asyncio.get_event_loop()
    tasks = [handle_task(task_id, q) for task_id in range(antal_arbetare)]
    loop.run_until_complete(asyncio.wait(tasks))
    loop.close()


#print and log the time
end = time.time()
langd = end-start
logg= open('logg.txt','a')
logg.write(str(root_url)+' - '+str(langd)+' sek - '+str(len(crawled_urls))+'\n')
logg.close()
print(str(root_url)+'   '+str(len(crawled_urls)))

#################################################################
#           AFTERBIRTH - The miracle of birth                   #
#################################################################
lines_seen = set() # holds lines already seen
outfile = open('externa-sorterade.txt', "w")
for line in open('externaurl.txt', "r"):
    if line not in lines_seen:# not a duplicate
                 lines_seen.add(line)
outfile.writelines(sorted(lines_seen))
outfile.close()


outfile = open(str(time.localtime()[0])+'_'+str(time.localtime()[2])+'_'+str(time.localtime()[4])+'_externbackupp.txt', 'w')
for line in open('externa-sorterade.txt', 'r'):
                 outfile.write(line)
outfile.close()

#Gör ny externfil med uppdaterade rader
outfile = open('externaurl.txt', "w")
for line in open('externa-sorterade.txt', "r"):
               outfile.write(line)
outfile.close()

##########################################

lines_seen = set() # holds lines already seen
outfile = open('crawledsites-sorterad.txt', "w")
for line in open('crawledsites.txt', "r"):
    if line not in lines_seen: # not a duplicate
                 lines_seen.add(line)
outfile.writelines(sorted(lines_seen))
outfile.close()


outfile = open(str(time.localtime()[0])+'_'+str(time.localtime()[2])+'_'+str(time.localtime()[4])+'_kravlarbackupp.txt', 'w')
for line in open('crawledsites-sorterad.txt', 'r'):
                 outfile.write(line)
outfile.close()



outfile = open('crawledsites.txt', "w")
for line in open('crawledsites-sorterad.txt', "r"):
               outfile.write(line)
outfile.close()

【问题讨论】:

你的问题是什么? 【参考方案1】:

使用答案创建已解决的状态。

【讨论】:

以上是关于使用 AsyncIO 和 aiohttp 爬取网站并收集所有 url 的程序的主要内容,如果未能解决你的问题,请参考以下文章

python爬虫 asyncio aiohttp aiofiles 单线程多任务异步协程爬取图片

python3异步爬虫:asyncio + aiohttp + aiofiles(python经典编程案例)

异常事件循环在 python 3.8 中使用 aiohttp 和 asyncio 关闭

即使使用 asyncio 和 aiohttp,方法也会等待请求响应

python 使用asyncio和aiohttp的scrapper的一个例子

python 使用asyncio和aiohttp的scrapper的一个例子