Web抓取导致403 Forbidden Error
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Web抓取导致403 Forbidden Error相关的知识,希望对你有一定的参考价值。
我正在尝试使用BeautifulSoup网上查找SeekingAlpha的每家公司的收入。但是,似乎网站检测到正在使用网络刮刀?我收到“HTTP错误403:禁止”
我试图抓的页面是:https://seekingalpha.com/symbol/AMAT/earnings
有谁知道可以做些什么来绕过这个?
答案
我可以使用代理访问网站内容,从这里找到:
然后,使用requests
模块创建有效负载,您可以抓取该站点:
import requests
import re
from bs4 import BeautifulSoup as soup
r = requests.get('https://seekingalpha.com/symbol/AMAT/earnings', proxies={'http':'50.207.31.221:80'}).text
results = re.findall('Revenue of $[a-zA-Z0-9.]+', r)
s = soup(r, 'lxml')
titles = list(map(lambda x:x.text, s.find_all('span', {'class':'title-period'})))
epas = list(map(lambda x:x.text, s.find_all('span', {'class':'eps'})))
deciding = list(map(lambda x:x.text, s.find_all('span', {'class':re.compile('green|red')})))
results = list(map(list, zip(titles, epas, results, epas)))
输出:
[[u'Q4: 11-16-17', u'EPS of $0.93 beat by $0.02', u'Revenue of $3.97B', u'EPS of $0.93 beat by $0.02'], [u'Q3: 08-17-17', u'EPS of $0.86 beat by $0.02', u'Revenue of $3.74B', u'EPS of $0.86 beat by $0.02'], [u'Q2: 05-18-17', u'EPS of $0.79 beat by $0.03', u'Revenue of $3.55B', u'EPS of $0.79 beat by $0.03'], [u'Q1: 02-15-17', u'EPS of $0.67 beat by $0.01', u'Revenue of $3.28B', u'EPS of $0.67 beat by $0.01'], [u'Q4: 11-17-16', u'EPS of $0.66 beat by $0.01', u'Revenue of $3.30B', u'EPS of $0.66 beat by $0.01'], [u'Q3: 08-18-16', u'EPS of $0.50 beat by $0.02', u'Revenue of $2.82B', u'EPS of $0.50 beat by $0.02'], [u'Q2: 05-19-16', u'EPS of $0.34 beat by $0.02', u'Revenue of $2.45B', u'EPS of $0.34 beat by $0.02'], [u'Q1: 02-18-16', u'EPS of $0.26 beat by $0.01', u'Revenue of $2.26B', u'EPS of $0.26 beat by $0.01'], [u'Q4: 11-12-15', u'EPS of $0.29 in-line ', u'Revenue of $2.37B', u'EPS of $0.29 in-line '], [u'Q3: 08-13-15', u'EPS of $0.33 in-line ', u'Revenue of $2.49B', u'EPS of $0.33 in-line '], [u'Q2: 05-14-15', u'EPS of $0.29 beat by $0.01', u'Revenue of $2.44B', u'EPS of $0.29 beat by $0.01'], [u'Q1: 02-11-15', u'EPS of $0.27 in-line ', u'Revenue of $2.36B', u'EPS of $0.27 in-line '], [u'Q4: 11-13-14', u'EPS of $0.27 in-line ', u'Revenue of $2.26B', u'EPS of $0.27 in-line '], [u'Q3: 08-14-14', u'EPS of $0.28 beat by $0.01', u'Revenue of $2.27B', u'EPS of $0.28 beat by $0.01'], [u'Q2: 05-15-14', u'EPS of $0.28 in-line ', u'Revenue of $2.35B', u'EPS of $0.28 in-line '], [u'Q1: 02-11-14', u'EPS of $0.23 beat by $0.01', u'Revenue of $2.19B', u'EPS of $0.23 beat by $0.01']]
另一答案
您应该尝试将User-Agent
设置为请求标头之一。值可以是任何已知的浏览器。
例:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/63.0.3239.132 Safari/537.36
另一答案
对于那些使用PyQuery的人:
from pyquery import PyQuery as pq
import requests
page = pq('https://seekingalpha.com/article/4151372-tesla-fools-media-model-s-model-x-demand', proxies={'http':'34.231.147.235:8080'})
print(page)
- (来自https://free-proxy-list.net/的代理信息)
- 确保您使用的是Requests库而不是Urllib。不要尝试使用'urlopen'加载页面。
以上是关于Web抓取导致403 Forbidden Error的主要内容,如果未能解决你的问题,请参考以下文章
Maven Nexus 问题 - 403 Forbidden 授权失败
403 Forbidden on laravel项目,但没有别的
.htaccess RewriteRule 导致 403 Forbidden
从 Chrome 扩展程序获取请求导致 403 Forbidden