使用需要登录的 Beautiful Soup 抓取网站

Posted 2023-02-23

技术标签:

【中文标题】使用需要登录的 Beautiful Soup 抓取网站【英文标题】：Scraping website with Beautiful Soup that requires login 【发布时间】：2021-04-17 17:14:55 【问题描述】：

我试图抓取一个需要使用 Python 和 Beautiful Soup 登录的网站。我想抓取这个页面（当你点击它时，它会将你重定向到登录页面）。： https://www.eurekalert.org/reporter/embargoed.php

这是登录页面： https://www.eurekalert.org/login.php

在我提供的第一个链接上，有很多新闻文章都有这样的链接： https://www.eurekalert.org/emb_releases/2021-01/embl-ebn011121.php

所以每个“href”都有“/emb_releases/2021-01/embl-ebn011121.php”

问题是我无法获取可以提取hrefs 的页面（第一个链接）的html。想要的 hrefs 有这个 css 标签“article.post a”。这是我的代码：

from bs4 import BeautifulSoup
import requests

url = 'https://www.eurekalert.org/'
login = 'login'

headers = 'origin': url,
           'referer': url+login

s = requests.session()

login_payload = 'login': 'xxx',
                 'password': 'xxx'

# Each YT tutorial says that it should be .post here, but on my website the request is get, not post. I have tried both ways, its the same result
login_req = s.post(url+login, headers=headers, data = login_payload)
print(login_req) # returns 200, if i try .get it also returns 200


login_response = s.get(url+'reporter/embargoed.php')
print(login_response) # returns 200
soup = BeautifulSoup(login_response.content, 'html.parser')
print(soup) # prints HTML but not the HTML that I want

我也试过这个，但我得到了相同的结果：

login_response = requests.get(url+'reporter/embargoed.php', auth = ('username', 'password'))
soup = BeautifulSoup(login_response.content, 'html.parser')
print(soup) # prints HTML but not the HTML that I want

这是我第一次尝试抓取需要登录的网站，所以我的代码中可能有一些愚蠢的东西。我在做什么坏事？我用谷歌搜索了很多，我尝试了很多不同的东西，但我总是失败。

感谢您帮助我。

【问题讨论】：

【参考方案1】：

进入登录页面，输入您的用户名和密码，按 F12 并从网络选项卡记录

然后点击登录，然后按照下图复制curl，然后搜索curl to python转换器并获取第二张图片的代码，代码将作为示例附加给您

代码会是这样的

    import requests

cookies = 
    '__utmt_8254f77d54ec9886070127029a0b81da': '1',
    '_fbp': 'fb.1.1610535613017.434450469',
    '__utmt': '1',
    '_ga': 'GA1.2.1008639424.1610535613',
    '_gid': 'GA1.2.56271763.1610535614',
    '__utma': '28029352.1008639424.1610535613.1610535864.1610535864.1',
    '__utmc': '28029352',
    '__utmz': '28029352.1610535864.1.1.utmcsr=(direct)^|utmccn=(direct)^|utmcmd=(none)',
    '__utmb': '28029352.1.10.1610535864',
    'sat_ppv': '84',


headers = 
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'Origin': 'https://www.eurekalert.org',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Referer': 'https://www.eurekalert.org/login.php',
    'Accept-Language': 'en-US,en;q=0.9',


data = 
  'frompage': '^',
  'username': 'Username',
  'password': 'Password'


def loginToPage():


# Perform login
response = requests.session().post('https://www.eurekalert.org/login.php', headers=headers, cookies=cookies, data=data)

if response.ok:
    print(' logged in successfully')
    return True

else:
    print('failed to log in')
    return False

【讨论】：

它适用于本网站，但不适用于press.nature.com/press-releases 你能帮我做那个网站吗？ ***.com/questions/65774164/…

以上是关于使用需要登录的 Beautiful Soup 抓取网站的主要内容，如果未能解决你的问题，请参考以下文章

Python3 爬虫Beautiful Soup库的使用

如何使用 Python 和 Beautiful Soup 从框架中抓取信息

Python 使用 Selenium 和 Beautiful Soup 抓取 JavaScript

Python3网络爬虫：使用Beautiful Soup爬取小说

python之Beautiful Soup库

如何解决用 Beautiful Soup 抓取网页却得到乱码的问题