来自 Python All in One for Dummies 的 Python Web scraper 副本上的 HTTP 错误 406
Posted
技术标签:
【中文标题】来自 Python All in One for Dummies 的 Python Web scraper 副本上的 HTTP 错误 406【英文标题】:HTTP Error 406 on Python Web scraper copy from Python All in One for Dummies 【发布时间】:2021-12-02 15:41:46 【问题描述】:下午好,
我正在关注 Python All In One for Dummies 并且已经来到关于网络抓取的章节。我正在尝试与他们专门为本章设计的网站进行交互,但在我的所有请求中不断收到“HTTP 错误 406”。最初的“打开一个页面并获得响应”有同样的问题,直到我把它指向谷歌,所以决定是那个网页有问题。 这是我的代码:
# get request module from URL lib
from urllib import request
# Get Beautiful Soup to help with the scraped data
from bs4 import BeautifulSoup
# sample page for practice
page_url = 'https://alansimpson.me/python/scrape_sample.html'
# open that page:
rawpage = request.urlopen(page_url)
#make a BS object from the html page
soup = BeautifulSoup(rawpage, 'html5lib')
# isolate the content block
content = soup.article
# create an empty list for dictionary items
links_list = []
#loop through all the links in the article
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
links_list['url':url, 'img':img, 'text':text]
except AttributeError:
pass
print(links_list)
这是控制台的输出:
(base) youngdad33@penguin:~/Python/AIO Python$ /usr/bin/python3 "/home/youngdad33/Python/AIO Python/webscrapper.py"
Traceback (most recent call last):
File "/home/youngdad33/Python/AIO Python/webscrapper.py", line 10, in <module>
rawpage = request.urlopen(page_url)
File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 406: Not Acceptable
我认为最重要的一行是底部的“HTTP 错误 406:不可接受”,经过一番挖掘,我明白这意味着我的请求标头未被接受。
那么我该如何让它工作呢?我在 Anaconda 3 上使用 Linux Debian 在 Chromebook 上使用 VS Code。
谢谢!
【问题讨论】:
【参考方案1】:您需要按如下方式注入用户代理:
# get request module from URL lib
import requests
# Get Beautiful Soup to help with the scraped data
from bs4 import BeautifulSoup
# sample page for practice
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
# open that page:
rawpage = requests.get(page_url,headers=headers)
#make a BS object from the html page
soup = BeautifulSoup(rawpage.content, 'html5lib')
# isolate the content block
content = soup.article
# create an empty list for dictionary items
links_list = []
#loop through all the links in the article
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
links_list.append(['url':url, 'img':img, 'text':text])
except AttributeError:
pass
print(links_list)
输出
[['url': 'http://www.sixthresearcher.com/python-3-reference-cheat-sheet-for-beginners/', 'img': '../datascience/python/basics/basics256.jpg', 'text': 'Basics'], ['url':
'https://alansimpson.me/datascience/python/beginner/', 'img': '../datascience/python/beginner/beginner256.jpg', 'text': 'Beginner'], ['url': 'https://alansimpson.me/datascience/python/justbasics/', 'img': '../datascience/python/justbasics/justbasics256.jpg', 'text': 'Just the Basics'], ['url': 'https://alansimpson.me/datascience/python/cheatography/', 'img': '../datascience/python/cheatography/cheatography256.jpg', 'text': 'Cheatography'], ['url': 'https://alansimpson.me/datascience/python/dataquest/', 'img': '../datascience/python/dataquest/dataquest256.jpg', 'text': 'Dataquest'], ['url': 'https://alansimpson.me/datascience/python/essentials/', 'img': '../datascience/python/essentials/essentials256.jpg', 'text': 'Essentials'], ['url': 'https://alansimpson.me/datascience/python/memento/', 'img': '../datascience/python/memento/memento256.jpg', 'text': 'Memento'], ['url': 'https://alansimpson.me/datascience/python/syntax/', 'img': '../datascience/python/syntax/syntax256.jpg', 'text': 'Syntax'], ['url': 'https://alansimpson.me/datascience/python/classes/', 'img': '../datascience/python/classes/classes256.jpg', 'text': 'Classes'], ['url': 'https://alansimpson.me/datascience/python/dictionaries/', 'img': '../datascience/python/dictionaries/dictionaries256.jpg', 'text': 'Dictionaries'], ['url': 'https://alansimpson.me/datascience/python/functions/', 'img': '../datascience/python/functions/functions256.jpg', 'text': 'Functions'], ['url': 'https://alansimpson.me/datascience/python/ifwhile/', 'img': '../datascience/python/ifwhile/ifwhile256.jpg', 'text': 'If & While Loops'], ['url': 'https://alansimpson.me/datascience/python/lists/', 'img': '../datascience/python/lists/lists256.jpg', 'text': 'Lists']]
【讨论】:
以上是关于来自 Python All in One for Dummies 的 Python Web scraper 副本上的 HTTP 错误 406的主要内容,如果未能解决你的问题,请参考以下文章
Python timezone package All In One
sh 来自https://stackoverflow.com/questions/23222616/copy-all-keys-from-one-db-to-another-in-redis
如何把一个 Python 项目发布到 PyPI 上指南教程 All In One