谷歌搜索使用 python3 爬行时出现 503 错误——请求,Beautifulsoup4
Posted
技术标签:
【中文标题】谷歌搜索使用 python3 爬行时出现 503 错误——请求,Beautifulsoup4【英文标题】:503 error while google search crawling with python3 -- requests, Beautifulsoup4 【发布时间】:2018-03-15 20:24:14 【问题描述】:我想将 google 搜索的链接标题废弃仅 20 页左右。 我在前一天尝试过这段代码,它正在工作!但是今天,它向我发送了 503 错误。
我搜索了解决这个问题的方法。以下是我尝试过的。
延迟时间(通过在 25 之后的行中插入 'time.sleep(60)' 代码。 “假用户代理”库。但是,看到 503 错误.. 这是文件。
import requests
from bs4 import BeautifulSoup
from collections import Counter
#google, '소프트웨어 교육'
base_google1_url = "https://www.google.co.kr/search?q=%EC%86%8C%ED%94%84%ED%8A%B8%EC%9B%A8%EC%96%B4+%EA%B5%90%EC%9C%A1&safe=active&ei=rv_RWYyaKcmW0gTqsa_IDg&start="
extra_google1_url="&sa=N&biw=958&bih=954"
#google, 'sw교육'
base_google2_url="https://www.google.co.kr/search?q=sw%EA%B5%90%EC%9C%A1&safe=active&ei=kLzUWYONLYa30QS4r5KACA&start="
extra_google2_url="&sa=N&biw=887&bih=950"
#book.naver, '소프트웨어 교육'
base_naver_url = "http://book.naver.com/search/search_in.nhn?query=%EC%86%8C%ED%94%84%ED%8A%B8%EC%9B%A8%EC%96%B4+%EA%B5%90%EC%9C%A1&&pattern=0&orderType=rel.desc&viewType=list&searchType=bookSearch&serviceSm=service.basic&title=&author=&publisher=&isbn=&toc=&subject=&publishStartDay=&publishEndDay=&categoryId=&qdt=1&filterType=0&filterValue=&serviceIc=service.author&buyAllow=0&ebook=0&page="
#from: https://docs.python.org/2/library/collections.html
cnt = Counter()
#bring search info
def get_html (site_name, content_num):
_html = ""
if site_name == 'google1':
google1_url = base_google1_url + str(content_num) + extra_google1_url
resp = requests.get(google1_url)
elif site_name == 'google2':
google2_url = base_google2_url + str(content_num) + extra_google2_url
resp = requests.get(google2_url)
elif site_name == 'naver':
naver_url = base_naver_url + str(content_num)
resp = requests.get(naver_url)
if resp.status_code == 200:
_html = resp.text
return _html
def word_count (name):
for content in name.contents:
words = content.split()
for word in words:
cnt[word] += 1
counting = cnt
return counting
def main():
cnt.clear()
counting = cnt
page_num = 0
#bring google '소프트웨어 교육' search info~~
while page_num < 20:
content_num = page_num*10
html = get_html("google1", content_num)
soup = BeautifulSoup(html, 'html.parser')
texts = soup.find_all('h3')
invalid_tag = ['b']
for text in texts:
for match in text.find_all(invalid_tag):
match.replaceWithChildren()
names = text.find_all('a')
for name in names:
counting = word_count(name)
page_num += 1
page_num = 0
#bring google 'sw교육' search info~~
while page_num < 20:
content_num = page_num*10
html = get_html("google2", content_num)
soup = BeautifulSoup(html, 'html.parser')
texts = soup.find_all('h3')
invalid_tag = ['b', 'a']
for text in texts:
for match in text.find_all(invalid_tag):
match.replaceWithChildren()
counting = word_count(text)
print(text)
page_num += 1
#bring naver book search info~~
page_num = 1
while page_num < 40:
html = get_html("naver", page_num)
soup = BeautifulSoup(html, 'html.parser')
texts = soup.find_all("dt")
invalid_tag = ['a','strong', 'span', 'img']
for text in texts:
for match in text.find_all(invalid_tag):
match.replaceWithChildren()
counting = word_count(text)
page_num += 1
#deleting useless keywords: if need to include len(k) == 1, instead of 'len(k) == 1 and ~ ' use following code --'or (len(k) == 1 and ord(k) >=33 and ord(k)<65)'
#https://***.com/questions/8448202/remove-more-than-one-key-from-python-dict
del counting['소프트웨어'], counting['교육']
for key in [k for k in counting if len(k) == 1 or type(k) == int]: del counting[key]
count_20 = counting.most_common(20)
print(count_20)
if __name__ == '__main__':
main()
请帮帮我! 先感谢您。
【问题讨论】:
我自己得到 200 个。您是否自己从浏览器中打开了这些 URL?您的 IP 可能被 Google 屏蔽了,您可能需要输入验证码或类似的东西? 【参考方案1】:尝试在headers
中手动添加User-agent
(List 的user-agents
或check what's your user-agent
)
使用list()
的链接可能会更好。我认为不需要变量。
您可以这样做 (example in the online IDE):
import requests, lxml
from bs4 import BeautifulSoup
headers =
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
links = [
'https://www.google.com/search?q=chuck norris',
'https://www.google.com/search?q=minecraft fandom',
'https://www.google.com/search?q=fus ro dah'
]
for url in links:
html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, 'lxml')
for titles in soup.select('.DKV0Md'):
title = titles.text
print(title)
# just for separating print results
print()
输出:
Chuck Norris - Wikipedia
Chuck Norris: Home
Chuck Norris - IMDb
Chuck Norris | Facebook
Chuck Norris (@chucknorris) | Twitter
Chuck Norris - Age, Facts & Movies - Biography
101 Best Chuck Norris Jokes - Chuck Norris Facts - Parade
Chuck Norris, Famous Veteran | Military.com
These Chuck Norris Facts Will Make You Love Him Even More ...
Official Minecraft Wiki – The Ultimate Resource for Minecraft
Official Minecraft Wiki - Minecraft Wiki - Fandom
the minecraft fandom shut down : Minecraft - Reddit
900+ Minecraft Fandom ideas in 2021 | dream team, my ...
14 Minecraft Fandom ideas | minecraft fan art, dream team, my ...
Minecraft Fandom - Minecraft Wiki Guide - IGN
Unrelenting Force (Skyrim) | Elder Scrolls | Fandom
Fus Ro Dah | Know Your Meme
Fus ro dah - Urban Dictionary
Skyrim:Unrelenting Force - The Unofficial Elder Scrolls Pages ...
Fus | Thuum.org - The Dragon Language Dictionary
60 “Fus ro dah!” (The Elder Scrolls V: Skyrim) ideas | skyrim ...
或者,您可以使用来自 SerpApi 的 Google Search Engine Results API。这是一个付费 API,可免费试用 5,000 次搜索。
要集成的代码:
from serpapi import GoogleSearch
links = [
'fus ro dah',
'minecraft lets play',
'gordon ramsay memes',
]
for url in links:
params =
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": url,
"google_domain": "google.com",
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
print(title)
print()
输出:
Unrelenting Force (Skyrim) | Elder Scrolls | Fandom
Fus Ro Dah | Know Your Meme
Fus ro dah - Urban Dictionary
Skyrim:Unrelenting Force - The Unofficial Elder Scrolls Pages ...
Fus | Thuum.org - The Dragon Language Dictionary
60 “Fus ro dah!” (The Elder Scrolls V: Skyrim) ideas | skyrim ...
The Fun Begins! | Let's Play Minecraft Survival Episode 1 ...
Beginning a NEW Minecraft Adventure! | Let's Play Minecraft ...
Minecraft: A New Beginning - 1.16 Survival Let's play | Ep 1 ...
An Epic New Minecraft Adventure - 1.16 Survival Let's Play ...
STARTING A NEW WORLD! - 1.16.2 Lets Play)
A New Start in Minecraft 1.16.5 (Survival Let's Play) Episode 1 ...
minecraft lets plays be like - YouTube
A NEW MINECRAFT JOURNEY!!! - Minecraft 1.16 Survival ...
Let's Play Minecraft 1.16 - Getting Started on a New World ...
Let's Play Minecraft Episode 1 - YouTube
These 29 Memes Of Gordon Ramsay Insulting People Are Too ...
51 Best Gordan Ramsey Meme ideas | ramsey, gordon ...
70 Gordon Ramsay Memes! ideas - Pinterest
50+ Iconic Gordon Ramsay Memes, Quotes, And Hilarious ...
Gordon Ramsay Memes - Pinterest
56 Gordan ramsey meme ideas | gordon ramsay funny ...
Gordon Ramsay Humor - Pinterest
The Best Chef Ramsay Memes That Capture His Endless ...
免责声明,我为 SerpApi 工作。
【讨论】:
以上是关于谷歌搜索使用 python3 爬行时出现 503 错误——请求,Beautifulsoup4的主要内容,如果未能解决你的问题,请参考以下文章
检查服务器状态时出现 Elasticsearch 503 错误