在抓取 Google Scholar 时防止 503 错误
Posted
技术标签:
【中文标题】在抓取 Google Scholar 时防止 503 错误【英文标题】:Prevent 503 Error when scraping Google Scholar 【发布时间】:2017-05-10 22:56:41 【问题描述】:我编写了以下代码来从Google Scholar security page. 中抓取数据。但是,每当我运行它时,我都会收到此错误:
Traceback (most recent call last):
File "/Users/.../Documents/GS_Tag_Scraper/scrape-modified.py", line 53, in <module>
getProfileFromTag(each)
File "/Users/.../Documents/GS_Tag_Scraper/scrape-modified.py", line 32, in getProfileFromTag
page = urllib.request.urlopen(url)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 472, in open
response = meth(req, response)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 504, in error
result = self._call_chain(*args)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 696, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 472, in open
response = meth(req, response)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 510, in error
return self._call_chain(*args)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable
我认为这是因为 GS 阻止了我的请求。我怎样才能防止这种情况发生?
代码是:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib.request
import string
import csv
import time
#Declares array's to store data
name = []
urlList =[]
#Opens and writer header of CSV file
outputFile = open('sample.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
outputWriter.writerow(['Name', 'URL', 'Total Citations', 'h-index', 'i10-index'])
def getStat (url):
#Given an authors URL it retunrs an array of stats.
url = 'https://scholar.google.pl' + url
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
buttons = soup.findAll("td", "class" : "gsc_rsb_std" )
list=[]
return (list)
def getProfileFromTag(tag):
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:" + tag
while True:
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
mydivs = BeautifulSoup(urllib.request.urlopen(url), 'lxml').findAll("h3", "class" : "gsc_1usr_name")
for each in mydivs:
for anchor in each.find_all('a'):
name.append(anchor.text)
urlList.append(anchor['href'])
time.sleep(0.001)
buttons = soup.findAll("button", "aria-label": "Następna")
if not buttons:
break
on_click = buttons[0].get('onclick')
url = 'http://scholar.google.pl' + on_click[17:-1]
url = url.encode('utf-8').decode('unicode_escape')
for each in name:
list = getStat(urlList[i])
outputWriter.writerow([each, urlList[i], list[0], list[2], list[4]])
tags = ['security']
for each in tags:
getProfileFromTag(each)
【问题讨论】:
请简化您的代码示例(有些尴尬)。并提供堆栈跟踪。在打开它之前打印计算的 URL 以进行调试。我敢肯定你会自己发现这个错误。 您可以尝试在请求标头中设置referer
字段。这在某些网站上对我有用。 en.wikipedia.org/wiki/HTTP_referer
@LaurentLAPORTE 我已经这样做了,但是我仍然找不到错误。
【参考方案1】:
改用requests
和适当的请求标头。
import requests
url = 'https://scholar.google.pl/citations?view_op=search_authors&mauthors=label:security'
request_headers =
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
with requests.Session() as s:
r = s.get(url, headers=request_headers)
你得到的结果:
Adrian Perrig /citations?user=n-Oret4AAAAJ&hl=pl
Vern Paxson /citations?user=HvwPRJ0AAAAJ&hl=pl
Frans Kaashoek /citations?user=YCoLskoAAAAJ&hl=pl
Mihir Bellare /citations?user=2pW1g5IAAAAJ&hl=pl
Xuemin Shen /citations?user=Bjl3GwoAAAAJ&hl=pl
Helen J. Wang /citations?user=qhu-DxwAAAAJ&hl=pl
Sushil Jajodia /citations?user=lOZ1vHIAAAAJ&hl=pl
Martin Abadi /citations?user=vWTI60AAAAAJ&hl=pl
Jean-Pierre Hubaux /citations?user=W7YBLlEAAAAJ&hl=pl
Ross Anderson /citations?user=WgyDcoUAAAAJ&hl=pl
使用这个:
users = soup.findAll('h3', 'class': 'gsc_oai_name')
for user in users:
name = user.a.text.strip()
link = user.a['href']
print(name, '\t', link)
您可以通过研究 Chrome 开发者工具的网络标签找到浏览器发送的标头。
【讨论】:
以上是关于在抓取 Google Scholar 时防止 503 错误的主要内容,如果未能解决你的问题,请参考以下文章
google scholar vs baidu scholar-what is your problem
(Google Scholar)谷歌学术打不开怎么办,图文详解
文献搜索与下载——Google Chrome+Google Scholar插件+SCI-Hub插件