python: 带有 BeautifulSoup 的 Google 搜索刮板
Posted
技术标签:
【中文标题】python: 带有 BeautifulSoup 的 Google 搜索刮板【英文标题】:python: Google Search Scraper with BeautifulSoup 【发布时间】:2012-07-15 20:24:50 【问题描述】:目标:传递一个搜索字符串在 google 上搜索并抓取 url、标题和与 url 标题一起发布的小描述。
我有以下代码,目前我的代码只给出前 10 个结果,这是一页的默认谷歌限制。我不确定如何在网页抓取期间真正处理分页。此外,当我查看实际页面结果和打印出来的内容时,存在差异。我也不确定解析 span 元素的最佳方法是什么。
到目前为止,我的跨度如下,我想删除 <em>
元素并连接其余的刺。最好的方法是什么?
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span
代码:
from BeautifulSoup import BeautifulSoup
import urllib, urllib2
def google_scrape(query):
address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
request = urllib2.Request(address, None, 'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (Khtml, like Gecko) Chrome/20.0.1132.57 Safari/536.11')
urlfile = urllib2.urlopen(request)
page = urlfile.read()
soup = BeautifulSoup(page)
linkdictionary =
for li in soup.findAll('li', attrs='class':'g'):
sLink = li.find('a')
print sLink['href']
sSpan = li.find('span', attrs='class':'st')
print sSpan
return linkdictionary
if __name__ == '__main__':
links = google_scrape('beautifulsoup')
我的输出如下所示:
http://www.crummy.com/software/BeautifulSoup/
<span class="st"><em>Beautiful Soup</em>: a library designed for screen-scraping HTML and XML.<br /></span>
http://pypi.python.org/pypi/BeautifulSoup/3.2.1
<span class="st"><span class="f">Feb 16, 2012 – </span>HTML/XML parser for quick-turnaround applications like screen-scraping.<br /></span>
http://www.beautifulsouptheatercollective.org/
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
http://lxml.de/elementsoup.html
<span class="st"><em>BeautifulSoup</em> is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. <em>BeautifulSoup</em> uses a different parsing <b>...</b><br /></span>
https://launchpad.net/beautifulsoup/
<span class="st">The discussion group is at: http://groups.google.com/group/<em>beautifulsoup</em> · Home page <b>...</b> <em>Beautiful Soup</em> 4.0 series is the current focus of development <b>...</b><br /></span>
http://www.poetry-online.org/carroll_beautiful_soup.htm
<span class="st"><em>Beautiful Soup BEAUTIFUL Soup</em>, so rich and green, Waiting in a hot tureen! Who for such dainties would not stoop? Soup of the evening, <em>beautiful Soup</em>!<br /></span>
http://www.youtube.com/watch?v=hDG73IAO5M8
<span class="st"><span class="f">Jul 6, 2009 – </span>taken from the motion picture "Alice in wonderland" (1999) http://www.imdb.com/<wbr>title/tt0164993/<br /></wbr></span>
http://www.soupsong.com/
<span class="st">A witty and substantive research effort on the history of soup and food in all cultures, with over 400 pages of recipes, quotations, stories, traditions, literary <b>...</b><br /></span>
http://www.facebook.com/beautifulsouptc
<span class="st">To connect with The <em>Beautiful Soup</em> Theater Collective, sign up for Facebook <b>...</b> We're thrilled to announce the cast of <em>Beautiful Soup's</em> upcoming production of <b>...</b><br /></span>
http://blog.dispatched.ch/webscraping-with-python-and-beautifulsoup/
<span class="st"><span class="f">Mar 15, 2009 – </span>Recently my life has been a hype; partly due to my upcoming Python addiction. There's simply no way around it; so I should better confess it in <b>...</b><br /></span>
Google 搜索页面结果具有以下结构:
<li class="g">
<div class="vsc" sig="bl_" bved="0CAkQkQo" pved="0CAgQkgowBQ">
<h3 class="r">
<div class="vspib" aria-label="Result details" role="button" tabindex="0">
<div class="s">
<div class="f kv">
<div id="poS5" class="esc slp" style="display:none">
<div class="f slp">3 answers - Jan 16, 2009</div>
<span class="st">
I read this without finding the solution:
<b>...</b>
The "normal" way is to: Go to the
<em>Beautiful Soup</em>
web site,
<b>...</b>
Brian beat me too it, but since I already have
<b>...</b>
<br>
</span>
</div>
<div>
</div>
<h3 id="tbpr_6" class="tbpr" style="display:none">
</li>
每个搜索结果都列在<li>
元素下。
【问题讨论】:
【参考方案1】:这个列表推导会去除标签。
>>> sSpan
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
>>> [em.replaceWithChildren() for em in sSpan.findAll('em')]
[None]
>>> sSpan
<span class="st">The Beautiful Soup Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
【讨论】:
知道如何从结果中获取超过 10 条记录吗? 遍历URL中的'start'参数:num=10&hl=en&start=0
num=10&hl=en&start=10
num=10&hl=en&start=20
嗨,克里斯,上面的解决方案不起作用,所以我编辑了。但我看到你已经删除了它。我将添加我的解决方案。感谢您的调查。
NH,如果这对您不起作用,我很高兴看到失败的案例。虽然您可以在像这样的简单情况下使用正则表达式来标记标签,但这是一种非常糟糕的做法(请参阅下面的链接)。 RegEx 方法在现实世界的复杂性中迅速变得不可行。如果您已经在使用像 BeuatifulSoup 这样强大的包来构建您的 DOM,那么您不妨保持简单并使用相同的工具来操作 DOM。注意:您最初的问题只要求剥离 标签。如果你只想要文本内容,你可以sSpan.text
.
[***.com/questions/1732348/…【参考方案2】:
我构造了一个简单的html正则表达式,然后在清理后的字符串上调用replace函数来去除点
import re
p = re.compile(r'<.*?>')
print p.sub('',str(sSpan)).replace('.','')
之前
<span class="st">The <em>Beautiful Soup</em> is a collection of all the pretty places you would rather be. All posts are credited via a click through link. For further inspiration of pretty things, <b>...</b><br /></span>
之后
The Beautiful Soup is a collection of all the pretty places you would rather be All posts are credited via a click through link For further inspiration of pretty things,
【讨论】:
【参考方案3】:要从span
标记中获取文本元素,您可以使用.text
/get_text()
方法beautifulsoup
provides。 Bs4
做所有艰苦的提升,你不必担心如何摆脱<em>
标签。
代码和full example(谷歌won't show more than ~400 results。):
from bs4 import BeautifulSoup
import requests, lxml, urllib.parse
def print_extracted_data_from_url(url):
headers =
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
print(f'Current page: int(soup.select_one(".YyVfkd").text)')
print(f'Current URL: url')
print()
for container in soup.findAll('div', class_='tF2Cxc'):
head_text = container.find('h3', class_='LC20lb DKV0Md').text
head_sum = container.find('div', class_='IsZvec').text
head_link = container.a['href']
print(head_text)
print(head_sum)
print(head_link)
print()
return soup.select_one('a#pnnext')
def scrape():
next_page_node = print_extracted_data_from_url(
'https://www.google.com/search?hl=en-US&q=coca cola')
while next_page_node is not None:
next_page_url = urllib.parse.urljoin('https://www.google.com',
next_page_node['href'])
next_page_node = print_extracted_data_from_url(next_page_url)
scrape()
输出:
Results via beautifulsoup
Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=coca cola
The Coca-Cola Company: Refresh the World. Make a Difference
We are here to refresh the world and make a difference. Learn more about the Coca-Cola Company, our brands, and how we strive to do business the right way.Contact Us · Careers · Coca-Cola · Coca-Cola System
https://www.coca-colacompany.com/home
Coca-Cola
2021 The Coca-Cola Company, all rights reserved. COCA-COLA®, "TASTE THE FEELING", and the Contour Bottle are trademarks of The Coca-Cola Company.
https://www.coca-cola.com/
Together Tastes Better | Coca-Cola®
Coca-Cola is pairing up with celebrity chefs, talented athletes and more surprise guests all summer long to bring you and your loved ones together over the love ...
https://us.coca-cola.com/
或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来实现此目的。这是一个带有免费计划的付费 API 查看Playground 进行测试。
要集成的代码:
import os
from serpapi import GoogleSearch
def scrape():
params =
"engine": "google",
"q": "coca cola",
"api_key": os.getenv("API_KEY"),
search = GoogleSearch(params)
results = search.get_dict()
print(f"Current page: results['serpapi_pagination']['current']")
for result in results["organic_results"]:
print(f"Title: result['title']\nLink: result['link']\n")
while 'next' in results['serpapi_pagination']:
search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
results = search.get_dict()
print(f"Current page: results['serpapi_pagination']['current']")
for result in results["organic_results"]:
print(f"Title: result['title']\nLink: result['link']\n")
输出:
Results from SerpApi
Current page: 1
Title: The Coca-Cola Company: Refresh the World. Make a Difference
Link: https://www.coca-colacompany.com/home
Title: Coca-Cola
Link: https://www.coca-cola.com/
Title: Together Tastes Better | Coca-Cola®
Link: https://us.coca-cola.com/
Title: Coca-Cola - Wikipedia
Link: https://en.wikipedia.org/wiki/Coca-Cola
Title: Coca-Cola - Home | Facebook
Link: https://www.facebook.com/Coca-Cola/
Title: The Coca-Cola Company | LinkedIn
Link: https://www.linkedin.com/company/the-coca-cola-company
Title: Coca-Cola UNITED: Home
Link: https://cocacolaunited.com/
Title: World of Coca-Cola: Atlanta Museum & Tourist Attraction
Link: https://www.worldofcoca-cola.com/
Current page: 2
Title: Coca-Cola (@CocaCola) | Twitter
Link: https://twitter.com/cocacola?lang=en
免责声明,我为 SerpApi 工作。
【讨论】:
以上是关于python: 带有 BeautifulSoup 的 Google 搜索刮板的主要内容,如果未能解决你的问题,请参考以下文章
python/beautifulsoup 查找所有带有特定锚文本的 <a href>
在 python BeautifulSoup 上获取带有特定前缀的超链接
在 python 上使用 selenium 或 beautifulsoup 从带有链接的页面中抓取数据,没有类,没有 id
带有 BeautifulSoup 的 Python Requests/Selenium 每次都没有返回 find_all