python BS4获取href网址
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python BS4获取href网址相关的知识,希望对你有一定的参考价值。
近期看那个scrape章节。有个s_urls[0][‘href‘] 没法理解。以为python 有非数字下标数组。后面多方查询才知道这个是beautifulsoup 中的tag查询
https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href?noredirect=1&lq=1
from bs4 import BeautifulSoup # what does Thread means from threading import Thread import urllib.request #Location of restaurants home_url="https://www.yelp.com" find_what="Restaurants" location="London" #Get all restaurants that match the search criteria #https://www.yelp.com/search?find_desc=Restaurants&find_loc=London search_url="https://www.yelp.com/search?find_desc=" +find_what+"&find_loc="+location s_html= urllib.request.urlopen(search_url).read() #urlopen(search_url).read() print("here") soups_s=BeautifulSoup(s_html,"lxml") #Get URLs of top 10 Restaurants in London s_urls=soups_s.select(‘.biz-name‘[:10]) print(len(s_urls)) print(s_urls) url=[] print(type(s_urls)) print(type(s_urls[0])) print(s_urls[0]) print(s_urls[0][‘href‘]) for u in range(len(s_urls)): url.append(home_url+s_urls[u][‘href‘]) #https://www.yelp.com/biz/duck-and-waffle-london-3?osq=Restaurants print(url) #Function that will do actual scraping job def scrape(ur): html=urllib.request.urlopen(ur).read() soup=BeautifulSoup(html,"lxml") title=soup.select(‘.biz-page-title‘) saddress=soup.select(‘.street-address‘) phone=soup.select(‘.biz-phone‘) if title: print("Title:",title[0].getText().strip()) if saddress: print("Streeet Address:",saddress[0].getText().strip()) if phone: print("Phone number:",phone[0].getText().strip()) print("---------------------") threadlist=[] i=0 #Making thereads to perform scraping while(i<len(url)): t=Thread(target=scrape,args=(url[i],)) t.start() threadlist.append(t) i=i+1 for t in threadlist: t.join()
以上是关于python BS4获取href网址的主要内容,如果未能解决你的问题,请参考以下文章
python3实践-从网站获取数据(Carbon Market Data-GD) (bs4/Beautifulsoup)