在网络抓取时深入了解网站
Posted
技术标签:
【中文标题】在网络抓取时深入了解网站【英文标题】:Going deeper into a website while web scraping 【发布时间】:2019-06-29 10:40:21 【问题描述】:我正在尝试从一堆网站上抓取文本,以便我可以与语料库进行交叉验证并显示特定单词在这些网站上的点击次数。 谁能帮我让我的网络爬虫自动深入网站。
import requests
from bs4 import BeautifulSoup
url = 'https://www.theleela.com/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/'
page = requests.get(url) #to extract page from website
html = page.content
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
我要求网页上的所有链接都像这样:
links=[]
for link in soup.find_all('a'):
a = link.get('href')
if type(a) == str and "https:" not in a:
links.append(a)
links
这是我得到的:
['/en_us/offers/index',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/overview',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/rooms-and-suites',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/offers',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/meetings',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/celebrations',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/dining',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/Spa',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/overview',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/rooms-and-suites',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/offers',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/meetings',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/celebrations',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/dining',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/spa',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/overview',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/rooms-and-suites',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/offers',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/meetings',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/celebrations',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/dining',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/spa',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/overview',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/rooms-and-suites',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/offers',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/meetings',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/celebrations',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/dining',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/spa',
'/en_us/hotels-in-goa/the-leela-goa-hotel',
'/en_us/hotels-in-goa/the-leela-goa-hotel/overview',
'/en_us/hotels-in-goa/the-leela-goa-hotel/rooms-and-suites',
'/en_us/hotels-in-goa/the-leela-goa-hotel/offers',
'/en_us/hotels-in-goa/the-leela-goa-hotel/meetings',
'/en_us/hotels-in-goa/the-leela-goa-hotel/celebrations',
'/en_us/hotels-in-goa/the-leela-goa-hotel/dining',
'/en_us/hotels-in-goa/the-leela-goa-hotel/spa',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/overview',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/rooms-and-suites',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/offers',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/meetings',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/celebrations',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/dining',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/spa',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/overview',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/rooms-and-suites',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/offers',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/meetings',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/celebrations',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/dining',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/spa',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/overview',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/rooms-and-suites',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/offers',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/meetings',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/celebrations',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/dining',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/overview',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/rooms-and-suites',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/offers',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/meetings',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/celebrations',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/dining',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/spa',
'javascript:facebookLogin();',
'javascript:forgot_password(this);',
'/application/spring/myprofile/my-profile-edit',
'/en_us',
'/application/spring/myprofile/login',
'/the-leela/best-rates-guaranteed',
'#',
'javascript:facebookLogin();',
'/application/spring/myprofile/my-profile-edit',
'/en_us',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/signature-spa-treatments-',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/holistic-treatments-',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/fitness',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/wellness',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/salon',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/signature-spa-treatments-',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/holistic-treatments-',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/fitness',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/wellness',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/salon',
'/contentAsset/raw-data/d1e3f704-be84-4353-a95e-28629651db00/fileAsset',
'/the-leela/about-the-leela/history',
'/the-leela/about-the-leela/company-information',
'/the-leela/about-the-leela/alliances',
'/the-leela/about-the-leela/investor-relations',
'/the-leela/about-the-leela/future-openings',
'javascript:void(0);',
'/the-leela/media/media-coverage',
'/the-leela/media/press-releases',
'/the-leela/media/media-contacts',
'/the-leela/media/the-leela-magazine',
'/the-leela/media/awards',
'/the-leela/Loyalty/the-leela-discovery',
'/the-leela/Loyalty/leela-solitaire-line',
'/the-leela/Loyalty/connoisseur-club',
'/the-leela/Loyalty/the-leela-preferred-partners-membership-program',
'/the-leela/careers/opportunities',
'/the-leela/contact-us/hotels',
'/the-leela/contact-us/convention-centre',
'/the-leela/contact-us/reservations',
'/the-leela/contact-us/sales-marketing-offices',
'javascript:void(0);',
'/the-leela/others/art',
'/the-leela/others/boutique',
'/the-leela/termsConditions/legal',
'/the-leela/termsConditions/siteMap',
'/the-leela/termsConditions/privacy-policy',
'/the-leela/termsConditions/general-terms-and-conditions']
如您所见,这里仍有一些不相关的链接
'javascript:void(0);',
/application/spring/myprofile/login',
'/the-leela/best-rates-guaranteed',
'#',
'javascript:facebookLogin();',
'/application/spring/myprofile/my-profile-edit',
'/en_us',
我需要帮助摆脱这些,以便我可以在输出列表的循环上运行刮板。感谢任何帮助。
【问题讨论】:
所以你基本上是在问如何去除不以'/'开头的元素列表? @BoboDarph 他的不良名单中的一些元素以/
开头
您不能,因为这个特定的示例:'/the-leela/best-rates-guaranteed'
与您想要的链接相同。你可以做的是删除 'javascript...' 和 '/en_us' 寻找双 '/' 模式
上面还有profile
和login
的链接...
您必须实际定义什么被认为是“不相关的”。在这种情况下,/application/spring...
和 /en_us
链接看起来同样有效。编程没有什么“自动”的,这都是你告诉程序去做的事情。因此,您必须首先明确区分“相关”和“无关”,然后才能编写脚本为您执行此操作。
【参考方案1】:
我怀疑是否有现成可用的解决方案不是特定于站点的。根据我对爬虫的经验,我想到了一些事情:
您可以使用站点的sitemap 页面,该页面通常用于志同道合的爬虫,其中包含指向站点所有者希望您爬取的所有重要页面的链接。robots.txt
也很有用。
您可以尝试下载所有页面并使用mimetypes
lib 和/或甚至使用Content-Type
标头
您可能需要放置一些启发式关键字或规则,例如正则表达式,以防止您的爬虫到达或爬取某些 URL。
最后(如果这是一个涉及数百或数千个网站的大型项目,需要数月时间),您可以尝试使用机器学习进一步限制 URL。
【讨论】:
以上是关于在网络抓取时深入了解网站的主要内容,如果未能解决你的问题,请参考以下文章