解析美丽汤后原始网页上的链接丢失

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了解析美丽汤后原始网页上的链接丢失相关的知识,希望对你有一定的参考价值。

如果我的解释看起来很简单,请原谅。我是蟒蛇和美味汤的新手。

我正在尝试从以下网站提取数据:

https://valor.militarytimes.com/award/5?page=1

我想提取与网站上24位奖牌获得者中的每一位相对应的链接。我可以从Firefox检查器看到他们的链接中都有“英雄”这个词。但是,当我使用漂亮的汤来解析网站时,这些链接不会出现。

我尝试使用标准的html解析器以及html5lib解析器,但它们都没有显示与这些奖牌收件人相对应的链接。

page = requests.get('https://valor.militarytimes.com/award/5?page=1')
soup = BeautifulSoup(page.text, "html5lib")
for idx, link in enumerate(soup.find_all('a', href = True)):
        print(link)

上述代码仅查找原始网站上的一些链接,特别是没有与奖牌获得者相对应的链接。即使运行soup.prettify()也表明这些链接不在解析文本中。

我希望有一个简单的代码可以提取本网站上24位奖牌获得者的链接。

答案

如果您想避免使用硒,可以通过一种简单的方法获取所需的数据。页面通过向URL后面的URL发送请求来加载数据,

https://valor.militarytimes.com/api/awards/5?page=1

这会发送一个json响应,然后使用javascript填充页面。您所要做的就是使用python-requests发送相同的请求,然后从json响应中获取数据。

import requests
r=requests.post('https://valor.militarytimes.com/api/awards/5?page=1')
for item in r.json()['data']:
    name=item['recipient']['name']
    url='https://valor.militarytimes.com/hero/'+str(item['recipient']['id'])
    print(name,url)

输出:

EUGENE MCCARLEY https://valor.militarytimes.com/hero/500963
TIMOTHY KEENAN https://valor.militarytimes.com/hero/500962
JOHN THOMPSON https://valor.militarytimes.com/hero/500961
WALTER BORDEN https://valor.militarytimes.com/hero/500941
WILLIAM ROSE https://valor.militarytimes.com/hero/94465
YUKITAKA MIZUTARI https://valor.militarytimes.com/hero/94175
ALBERT MARTIN https://valor.militarytimes.com/hero/92498
FRANCIS CODY https://valor.militarytimes.com/hero/500944
JAMES O'KEEFFE https://valor.militarytimes.com/hero/500943
PHILLIP FLEMING https://valor.militarytimes.com/hero/500942
JOHN WANAMAKER https://valor.militarytimes.com/hero/314466
ROBERT CHILSON https://valor.militarytimes.com/hero/102316
CHRISTOPHER NELMS https://valor.militarytimes.com/hero/89255
SAMUEL BARNETT https://valor.militarytimes.com/hero/71533
ANDREW BYERS https://valor.militarytimes.com/hero/500938
ANDREW RUSSELL https://valor.militarytimes.com/hero/500937
****** CALDWELL https://valor.militarytimes.com/hero/500935
****** WALWRATH https://valor.militarytimes.com/hero/500934
****** MADSEN https://valor.militarytimes.com/hero/500933
****** NELSON https://valor.militarytimes.com/hero/500932
WILLIAM SOUKUP https://valor.militarytimes.com/hero/500931
BENJAMIN WILSON https://valor.militarytimes.com/hero/500930
ANDREW MARCKESANO https://valor.militarytimes.com/hero/500929
WAYNE KUNZ https://valor.militarytimes.com/hero/500927

我也取了这个名字。如果只需要,您可以获取链接。

编辑

要从多个页面获取URL,请使用此代码

import requests
list_of_urls=[]
last_page=9 #replace this with your last page
for i in range(1,last_page+1):
    r=requests.post('https://valor.militarytimes.com/api/awards/5?page='.format(i))
    for item in r.json()['data']:
        url='https://valor.militarytimes.com/hero/'+str(item['recipient']['id'])
        list_of_urls.append(url)
print(list_of_urls)

输出:

['https://valor.militarytimes.com/hero/500963', 'https://valor.militarytimes.com/hero/500962', 'https://valor.militarytimes.com/hero/500961', 'https://valor.militarytimes.com/hero/500941', 'https://valor.militarytimes.com/hero/94465', 'https://valor.militarytimes.com/hero/94175', 'https://valor.militarytimes.com/hero/92498', 'https://valor.militarytimes.com/hero/500944', 'https://valor.militarytimes.com/hero/500943', 'https://valor.militarytimes.com/hero/500942', 'https://valor.militarytimes.com/hero/314466', 'https://valor.militarytimes.com/hero/102316', 'https://valor.militarytimes.com/hero/89255', 'https://valor.militarytimes.com/hero/71533', 'https://valor.militarytimes.com/hero/500938', 'https://valor.militarytimes.com/hero/500937', 'https://valor.militarytimes.com/hero/500935', 'https://valor.militarytimes.com/hero/500934', 'https://valor.militarytimes.com/hero/500933', 'https://valor.militarytimes.com/hero/500932', 'https://valor.militarytimes.com/hero/500931', 'https://valor.militarytimes.com/hero/500930', 'https://valor.militarytimes.com/hero/500929', 'https://valor.militarytimes.com/hero/500927', 'https://valor.militarytimes.com/hero/500926', 'https://valor.militarytimes.com/hero/500925', 'https://valor.militarytimes.com/hero/500924', 'https://valor.militarytimes.com/hero/500923', 'https://valor.militarytimes.com/hero/500922', 'https://valor.militarytimes.com/hero/500921', 'https://valor.militarytimes.com/hero/500920', 'https://valor.militarytimes.com/hero/500919', 'https://valor.militarytimes.com/hero/500918', 'https://valor.militarytimes.com/hero/500917', 'https://valor.militarytimes.com/hero/500916', 'https://valor.militarytimes.com/hero/500915', 'https://valor.militarytimes.com/hero/500914', 'https://valor.militarytimes.com/hero/500913', 'https://valor.militarytimes.com/hero/500912', 'https://valor.militarytimes.com/hero/500911', 'https://valor.militarytimes.com/hero/500910', 'https://valor.militarytimes.com/hero/500909', 'https://valor.militarytimes.com/hero/500908', 'https://valor.militarytimes.com/hero/500907', 'https://valor.militarytimes.com/hero/500906', 'https://valor.militarytimes.com/hero/500905', 'https://valor.militarytimes.com/hero/500904', 'https://valor.militarytimes.com/hero/500903', 'https://valor.militarytimes.com/hero/500902', 'https://valor.militarytimes.com/hero/500901', 'https://valor.militarytimes.com/hero/500900', 'https://valor.militarytimes.com/hero/500899', 'https://valor.militarytimes.com/hero/500898', 'https://valor.militarytimes.com/hero/500897', 'https://valor.militarytimes.com/hero/500896', 'https://valor.militarytimes.com/hero/500895', 'https://valor.militarytimes.com/hero/500894', 'https://valor.militarytimes.com/hero/500893', 'https://valor.militarytimes.com/hero/500892', 'https://valor.militarytimes.com/hero/500891', 'https://valor.militarytimes.com/hero/500890', 'https://valor.militarytimes.com/hero/500889', 'https://valor.militarytimes.com/hero/500888', 'https://valor.militarytimes.com/hero/29160', 'https://valor.militarytimes.com/hero/106931', 'https://valor.militarytimes.com/hero/106375', 'https://valor.militarytimes.com/hero/94936', 'https://valor.militarytimes.com/hero/94928', 'https://valor.militarytimes.com/hero/94927', 'https://valor.militarytimes.com/hero/94926', 'https://valor.militarytimes.com/hero/94923', 'https://valor.militarytimes.com/hero/94777', 'https://valor.militarytimes.com/hero/94769', 'https://valor.militarytimes.com/hero/94711', 'https://valor.militarytimes.com/hero/94644', 'https://valor.militarytimes.com/hero/94571', 'https://valor.militarytimes.com/hero/94570', 'https://valor.militarytimes.com/hero/94494', 'https://valor.militarytimes.com/hero/94468', 'https://valor.militarytimes.com/hero/94454', 'https://valor.militarytimes.com/hero/94388', 'https://valor.militarytimes.com/hero/94358', 'https://valor.militarytimes.com/hero/94279', 'https://valor.militarytimes.com/hero/94275', 'https://valor.militarytimes.com/hero/94253', 'https://valor.militarytimes.com/hero/94251', 'https://valor.militarytimes.com/hero/94223', 'https://valor.militarytimes.com/hero/94222', 'https://valor.militarytimes.com/hero/94217', 'https://valor.militarytimes.com/hero/94211', 'https://valor.militarytimes.com/hero/94210', 'https://valor.militarytimes.com/hero/94195', 'https://valor.militarytimes.com/hero/94194', 'https://valor.militarytimes.com/hero/94173', 'https://valor.militarytimes.com/hero/94168', 'https://valor.militarytimes.com/hero/94055', 'https://valor.militarytimes.com/hero/93916', 'https://valor.militarytimes.com/hero/93847', 'https://valor.militarytimes.com/hero/93780', 'https://valor.militarytimes.com/hero/93779', 'https://valor.militarytimes.com/hero/93775', 'https://valor.militarytimes.com/hero/93774', 'https://valor.militarytimes.com/hero/93733', 'https://valor.militarytimes.com/hero/93722', 'https://valor.militarytimes.com/hero/93706', 'https://valor.militarytimes.com/hero/93551', 'https://valor.militarytimes.com/hero/93435', 'https://valor.militarytimes.com/hero/93407', 'https://valor.militarytimes.com/hero/93374', 'https://valor.militarytimes.com/hero/93277', 'https://valor.militarytimes.com/hero/93243', 'https://valor.militarytimes.com/hero/93193', 'https://valor.militarytimes.com/hero/92989', 'https://valor.militarytimes.com/hero/92972', 'https://valor.militarytimes.com/hero/92958', 'https://valor.militarytimes.com/hero/93923', 'https://valor.militarytimes.com/hero/90130', 'https://valor.militarytimes.com/hero/90128', 'https://valor.militarytimes.com/hero/89704', 'https://valor.mil

以上是关于解析美丽汤后原始网页上的链接丢失的主要内容,如果未能解决你的问题,请参考以下文章

CSS3给网页穿上美丽的外衣

CSS3给网页穿上美丽的外衣

美丽的汤和桌子刮 - lxml 与 html 解析器

美丽的汤在源文件中找到标记的位置?

使用需要单击“我同意cookies”按钮的Python(美丽的汤)抓取网页?

用美丽的汤解析谷歌内部卡片