抓取一个网站的所有网址链接
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了抓取一个网站的所有网址链接相关的知识,希望对你有一定的参考价值。
作者QQ:231469242
关键字:爬虫,网址抓取,python
测试
url=http://db.yaozh.com/
import requests,bs4,re url="http://db.yaozh.com/" def getLinks(url): res = requests.get(url) soup = bs4.BeautifulSoup(res.text,"lxml") links = [] for link in soup.findAll(‘a‘, attrs={‘href‘: re.compile("^http://")}): links.append(link.get(‘href‘)) return links list_links=getLinks(url) print( getLinks(url) )
测试
url=http://www.yaozh.com
网址共121个
[‘http://db.yaozh.com‘, ‘http://www.yaozh.com/zhihui‘, ‘http://s.yaozh.com‘, ‘http://news.yaozh.com‘, ‘http://star.yaozh.com‘, ‘http://club.yaozh.com/‘, ‘http://edu.yaozh.com/‘, ‘http://bbs.yaozh.com‘, ‘http://www.35yao.com/‘, ‘http://top.yaozh.com/‘, ‘http://www.yaozh.com/‘, ‘http://news.yaozh.com‘, ‘http://db.yaozh.com‘, ‘http://www.yaozh.com/zhihui‘, ‘http://s.yaozh.com‘, ‘http://bbs.yaozh.com‘, ‘http://club.yaozh.com/‘, ‘http://edu.yaozh.com/‘, ‘http://star.yaozh.com/‘, ‘http://top.yaozh.com/‘, ‘http://db.yaozh.com/Zhuanti/zhuanti_20160721?source=www&name=index_slide‘, ‘http://bbs.yaozh.com/thread-260886-1-1.html?source=www&name=index_slide‘, ‘http://edu.yaozh.com/Course/84.html?source=www&name=index_slide‘, ‘http://bbs.yaozh.com/thread-262781-1-1.html?source=www&name=index_slide‘, ‘http://db.yaozh.com/pqf?source=www&position=index_newdb‘, ‘http://db.yaozh.com/orangebook?source=www&position=index_newdb‘, ‘http://db.yaozh.com/qxzhuce?source=www&position=index_newdb‘, ‘http://db.yaozh.com/yfpw?source=www&name=index_newdb‘, ‘http://db.yaozh.com/orangebook?source=www&name=long_banner‘, ‘http://news.yaozh.com/Detail/index/id/16689.html‘, ‘http://s.yaozh.com/Article/index?id=48?source=www&name=long_banner‘, ‘http://bbs.yaozh.com/thread-262329-1-1.html?source=www&name=long_banner‘, ‘http://bbs.yaozh.com/thread-216363-1-1.html‘, ‘http://s.yaozh.com/3832‘, ‘http://s.yaozh.com/3122‘, ‘http://s.yaozh.com/1389‘, ‘http://s.yaozh.com/1952‘, ‘http://s.yaozh.com/4043‘, ‘http://s.yaozh.com/4058‘, ‘http://s.yaozh.com/3858‘, ‘http://s.yaozh.com/1392‘, ‘http://db.yaozh.com?source=www&name=long_banner‘, ‘http://db.yaozh.com/app?source=www&name=db_left‘, ‘http://news.yaozh.com/archive/12393.html‘, ‘http://news.yaozh.com/archive/12275.html‘, ‘http://news.yaozh.com/archive/12218.html‘, ‘http://news.yaozh.com/archive/12138.html‘, ‘http://news.yaozh.com/archive/11449.html‘, ‘http://news.yaozh.com/archive/11340.html‘, ‘http://news.yaozh.com/archive/11026.html‘, ‘http://news.yaozh.com/archive/10433.html‘, ‘http://news.yaozh.com/archive/10219.html‘, ‘http://news.yaozh.com/archive/10206.html‘, ‘http://news.yaozh.com/archive/9732.html‘, ‘http://news.yaozh.com/archive/9648.html‘, ‘http://news.yaozh.com/archive/9373.html‘, ‘http://news.yaozh.com/archive/9279.html‘, ‘http://news.yaozh.com/archive/8806.html‘, ‘http://news.yaozh.com/archive/8479.html‘, ‘http://news.yaozh.com/archive/8480.html‘, ‘http://news.yaozh.com/archive/7897.html‘, ‘http://news.yaozh.com/archive/7898.html‘, ‘http://news.yaozh.com/archive/7899.html‘, ‘http://db.yaozh.com/zhuangrang?from=www&position=index_hotdb‘, ‘http://db.yaozh.com/zhuce?from=www&position=index_hotdb‘, ‘http://db.yaozh.com/yaopinzhongbiao?from=www&position=index_hotdb‘, ‘http://db.yaozh.com/index.php?action=foreign?from=www&position=index_hotdb‘, ‘http://db.yaozh.com/screening?from=www&position=index_hotdb‘, ‘http://db.yaozh.com/jiegou?from=www&position=index_hotdb‘, ‘http://db.yaozh.com/vip/custom.html?source=www&name=db_right‘, ‘http://www.xinyaohui.com‘, ‘http://www.cqyjs.com/index.html‘, ‘http://www.reed-sinopharm.com/‘, ‘http://www.bucm.edu.cn‘, ‘http://pharmacy.scu.edu.cn/‘, ‘http://www.cqmu.edu.cn/‘, ‘http://www.cqacmm.com/‘, ‘http://www.cpri.com.cn/‘, ‘http://www.cpu.edu.cn/‘, ‘http://pharmacy.swu.edu.cn‘, ‘http://www.mypharma.com‘, ‘http://www.psyzg.com/‘, ‘http://www.viptijian.com/0592‘, ‘http://www.7lk.com/‘, ‘http://www.j1.com/pd-baojianpin.html‘, ‘http://www.3156.cn/‘, ‘http://www.sdyyzs.cn‘, ‘http://www.jiankangle.com‘, ‘http://www.rmttjkw.com‘, ‘http://www.sunyet.com‘, ‘http://www.chinayikao.com/‘, ‘http://www.hxyjw.com/ ‘, ‘http://www.100xhs.com/ask/ ‘, ‘http://www.360zhengrong.com/‘, ‘http://www.ftxk.cn/‘, ‘http://www.jianke.com/zypd/‘, ‘http://www.medlive.cn/‘, ‘http://health.qqyy.com/‘, ‘http://www.wendaifu.com/‘, ‘http://health.cqnews.net/‘, ‘http://www.yaofu.cn/‘, ‘http://www.ipathology.cn/‘, ‘http://nk.99.com.cn/‘, ‘http://www.baxsun.cn/‘, ‘http://www.yiliu88.com ‘, ‘http://www.wiki8.com/‘, ‘http://www.39kf.com/‘, ‘http://www.10000yao.com‘, ‘http://www.3618med.com‘, ‘http://www.chem960.com‘, ‘http://www.yaojobs.com‘, ‘http://www.hc3i.cn/‘, ‘http://about.yaozh.com/about.html‘, ‘http://about.yaozh.com/contact.html‘, ‘http://about.yaozh.com/Qualification.html‘, ‘http://about.yaozh.com/join.html‘, ‘http://about.yaozh.com/link.html‘, ‘http://help.yaozh.com/‘, ‘http://bbs.yaozh.com/forum-123-1.html‘, ‘http://bbs.yaozh.com‘, ‘http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=50010802001068‘]
测试3
url=http://www.dxy.cn/
丁香园网址链接是501个
测试4
url=http://www.hsmap.com
只有一个链接
以上是关于抓取一个网站的所有网址链接的主要内容,如果未能解决你的问题,请参考以下文章