如何通过自动下载链接使用 Python 访问 PDF 文件？

Posted 2023-02-23

技术标签:

【中文标题】如何通过自动下载链接使用 Python 访问 PDF 文件？【英文标题】：How can I access a PDF file with Python through an automatic download link? 【发布时间】：2021-07-15 01:02:38 【问题描述】：

我正在尝试创建一个自动 Python 脚本，该脚本可以转到 this 之类的网页，在正文底部找到链接（锚文本“此处”），然后下载单击所述下载后加载的 PDF关联。我能够从原件中检索 html 并找到下载链接，但我不知道如何从那里获取 link to the PDF。任何帮助将非常感激。到目前为止，这是我所拥有的：

import urllib3
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Open page and locate href for bill text
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
links = [] 
for link in soup.findAll('a', href=True, text=['HERE', 'here', 'Here']):
    links.append(link.get('href'))  
links2 = [x for x in links if x is not None]

# Open download link to get PDF
html = urlopen(links2[0])
soup = BeautifulSoup(html, 'html.parser')
links = [] 
for link in soup.findAll('a'):
    links.append(link.get('href'))  
links2 = [x for x in links if x is not None]

此时，我获得的链接列表不包括我正在寻找的 PDF。有什么方法可以在不硬编码代码中指向 PDF 的链接的情况下获取它（这与我在这里尝试做的事情有悖常理）？谢谢！

【问题讨论】：

您可以requests.get(url, allow_redirects=True) 并使用r.history 查看重定向历史记录。从那里，检索您的 pdf 文件应该很简单。看起来像在“这里”链接中添加download=1 参数可以让它为您下载 PDF。 https://www.murphy.senate.gov/download/green-bank-act-2021?download=1。您可能必须允许重定向才能正常工作。使用脚本检索的链接是页面通过 href 链接到的链接。所以这是关于缺少重定向的问题（请参阅其他 cmets）。 【参考方案1】：

查找带有文本here 的a 元素，然后跟踪。

import requests
from bs4 import BeautifulSoup

url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'

user_agent = 'User-agent': 'Mozilla/5.0'

s = requests.Session()

r = s.get(url, headers=user_agent)
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.select('a'):
    if a.text == 'here':
        href = a['href']
        r = s.get(href, headers=user_agent)
        print(r.status_code, r.reason)
        print(r.headers)
        _, dl_url = r.headers['refresh'].split('url=', 1)
        r = s.get(dl_url, headers=user_agent)
        print(r.status_code, r.reason)
        print(r.headers)
        file_bytes = r.content # here's your PDF; you can write it out to a file

【讨论】：

以上是关于如何通过自动下载链接使用 Python 访问 PDF 文件？的主要内容，如果未能解决你的问题，请参考以下文章