如何在 Python 中下载谷歌图片搜索结果

Posted 2023-02-19

技术标签:

【中文标题】如何在 Python 中下载谷歌图片搜索结果【英文标题】：How to download google image search results in Python 【发布时间】：2016-06-18 23:38:53 【问题描述】：

这个问题之前已经被问过很多次了，但所有答案都至少有几年的历史，并且目前基于不再支持的 ajax.googleapis.com API。

有人知道另一种方法吗？我正在尝试下载大约一百个搜索结果，除了 Python API 之外，我还尝试了许多桌面、基于浏览器或浏览器插件的程序来执行此操作，但都失败了。

【问题讨论】：

你试过 Selenium 吗？ Selenium 解决了它！我使用了代码simplypython.wordpress.com/2015/05/18/…，对滚动代码稍作改动。（直接跳到页面底部不一定会导致延迟加载页面加载所有图像，所以我让它逐渐滚动。） github.com/hardikvasa/google-images-download 【参考方案1】：

使用Google Custom Search 来实现您想要实现的目标。请参阅Python - Download Images from google Image search? 的@i08in 的 答案，它有很好的描述、脚本示例和库参考。

【讨论】：

我接受这一点，因为它肯定回答了问题！我还想指出，Google 的 API 有一些限制，旨在禁止人们使用它们，例如，像我一样自动收集搜索结果试图这样做，所以这种方法可能会遇到权限问题。 @Morgan G 使用 Selenium 的建议对我很有用！【参考方案2】：

要使用 Selenium 从 Google 图片搜索中下载任意数量的图片：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import json
import urllib2
import sys
import time

# adding path to geckodriver to the OS environment variable
# assuming that it is stored at the same path as this script
os.environ["PATH"] += os.pathsep + os.getcwd()
download_path = "dataset/"

def main():
    searchtext = sys.argv[1] # the search query
    num_requested = int(sys.argv[2]) # number of images to download
    number_of_scrolls = num_requested / 400 + 1 
    # number_of_scrolls * 400 images will be opened in the browser

    if not os.path.exists(download_path + searchtext.replace(" ", "_")):
        os.makedirs(download_path + searchtext.replace(" ", "_"))

    url = "https://www.google.co.in/search?q="+searchtext+"&source=lnms&tbm=isch"
    driver = webdriver.Firefox()
    driver.get(url)

    headers = 
    headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    extensions = "jpg", "jpeg", "png", "gif"
    img_count = 0
    downloaded_img_count = 0

    for _ in xrange(number_of_scrolls):
        for __ in xrange(10):
            # multiple scrolls needed to show all 400 images
            driver.execute_script("window.scrollBy(0, 1000000)")
            time.sleep(0.2)
        # to load next 400 images
        time.sleep(0.5)
        try:
            driver.find_element_by_xpath("//input[@value='Show more results']").click()
        except Exception as e:
            print "Less images found:", e
            break

    # imges = driver.find_elements_by_xpath('//div[@class="rg_meta"]') # not working anymore
    imges = driver.find_elements_by_xpath('//div[contains(@class,"rg_meta")]')
    print "Total images:", len(imges), "\n"
    for img in imges:
        img_count += 1
        img_url = json.loads(img.get_attribute('innerHTML'))["ou"]
        img_type = json.loads(img.get_attribute('innerHTML'))["ity"]
        print "Downloading image", img_count, ": ", img_url
        try:
            if img_type not in extensions:
                img_type = "jpg"
            req = urllib2.Request(img_url, headers=headers)
            raw_img = urllib2.urlopen(req).read()
            f = open(download_path+searchtext.replace(" ", "_")+"/"+str(downloaded_img_count)+"."+img_type, "wb")
            f.write(raw_img)
            f.close
            downloaded_img_count += 1
        except Exception as e:
            print "Download failed:", e
        finally:
            print
        if downloaded_img_count >= num_requested:
            break

    print "Total downloaded: ", downloaded_img_count, "/", img_count
    driver.quit()

if __name__ == "__main__":
    main()

完整代码是here。

【讨论】：

这适用于 18 年 12 月。我最多可以下载 1000 张图片【参考方案3】：

确保先安装 icrawler 库，然后使用。

pip install icrawler

from icrawler.builtin import GoogleImageCrawler
google_Crawler = GoogleImageCrawler(storage = 'root_dir': r'write the name of the directory you want to save to here')
google_Crawler.crawl(keyword = 'sad human faces', max_num = 800)

【讨论】：

【参考方案4】：

对 Ravi Hirani 的回答稍加改进，最简单的方法就是这样做：

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage='root_dir': 'D:\\projects\\data core\\helmet detection\\images')
google_crawler.crawl(keyword='cat', max_num=100)

来源：https://pypi.org/project/icrawler/

【讨论】：

【参考方案5】：

这个怎么样？

https://github.com/hardikvasa/google-images-download

它允许您下载数百张图片，并有大量过滤器可供选择以自定义您的搜索

如果您希望每个关键字下载 100 多张图片，则需要安装“selenium”和“chromedriver”。

如果您已经 pip 安装了库或运行 setup.py 文件，Selenium 会自动安装在您的机器上。您的机器上还需要 Chrome 浏览器。对于 chromedriver：

根据您的操作系统下载正确的 chromedriver。

在 Windows 或 MAC 上，如果由于某种原因 chromedriver 给您带来麻烦，请在当前目录下下载并运行命令。

然而，在 windows 上，chromedriver 的路径必须以以下格式给出：

C:\complete\path\to\chromedriver.exe

在 Linux 上，如果您在安装 google chrome 浏览器时遇到问题，请参阅此 CentOS 或 Amazon Linux 指南或 Ubuntu 指南

对于所有操作系统，您都必须使用“--chromedriver”或“-cd”参数来指定已下载到机器中的 chromedriver 的路径。

【讨论】：

这只允许下载最多 100 张图片使用 chromedriver，您可以从上述库中下载数百张图片...不仅限于 100 张。说明在 README 文件中。 :) 有没有办法让它停止跳过没有图像格式的图像？（例如partycity6.scene7.com/is/image/PartyCity/…），而是以其他方式下载它们？【参考方案6】：

我一直在使用这个脚本从谷歌搜索中下载图像，我一直在使用它们来训练我的分类器下面的代码可以下载100张与查询相关的图片

from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
import cookielib
import json

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')


query = raw_input("query image")# you can change the query for the image  here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="Pictures"
header='User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"

soup = get_soup(url,header)


ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div","class":"rg_meta"):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

print  "there are total" , len(ActualImages),"images"

if not os.path.exists(DIR):
            os.mkdir(DIR)
DIR = os.path.join(DIR, query.split()[0])

if not os.path.exists(DIR):
            os.mkdir(DIR)
###print images
for i , (img , Type) in enumerate( ActualImages):
    try:
        req = urllib2.Request(img, headers='User-Agent' : header)
        raw_img = urllib2.urlopen(req).read()

        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb')
        else :
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

【讨论】：

【参考方案7】：

我正在尝试this library，它可以同时用作命令行工具或 python 库。它有很多论据来查找具有不同标准的图像。

这些是从其文档中提取的示例，将其用作 python 库：

from google_images_download import google_images_download   #importing the library

response = google_images_download.googleimagesdownload()   #class instantiation

arguments = "keywords":"Polar bears,baloons,Beaches","limit":20,"print_urls":True   #creating list of arguments
paths = response.download(arguments)   #passing the arguments to the function
print(paths)   #printing absolute paths of the downloaded images

或者作为命令行工具，如下：

$ googleimagesdownload --k "car" -sk 'red,blue,white' -l 10

您可以使用pip install google_images_download 安装它

【讨论】：

【参考方案8】：

解决这个问题的一个简单方法是安装一个名为google_images_download的python包

pip install google_images_download

使用这个 python 代码

from google_images_download import google_images_download  

response = google_images_download.googleimagesdownload()
keywords = "apple fruit"
arguments = "keywords":keywords,"limit":20,"print_urls":True
paths = response.download(arguments)
print(paths)

调整限制以控制要下载的图像数量

但有些图片可能已损坏，因此无法打开

更改 keywords 字符串以获得您需要的输出

【讨论】：

【参考方案9】：

您需要使用自定义搜索 API。这里有一个方便的explorer。我使用 urllib2。您还需要从开发者控制台为您的应用程序创建一个 API 密钥。

【讨论】：

更好的解决方案是通过更改要从类而不是独立 python 脚本运行的代码，将 hardikvasa 代码包装在 API 中。这样就不需要 API 密钥。 API 密钥都很好，但它们只是测试的另一个障碍。【参考方案10】：

我尝试了很多代码，但没有一个适合我。我在这里发布我的工作代码。希望它会帮助别人。

我正在使用 Python 3.6 版并使用 icrawler

首先，您需要在系统中下载icrawler。

然后运行下面的代码。

from icrawler.examples import GoogleImageCrawler
google_crawler = GoogleImageCrawler()
google_crawler.crawl(keyword='krishna', max_num=100)

将keyword krishna 替换为您想要的文本。

注意：- 下载的图片需要路径。现在我使用放置脚本的相同目录。您可以通过以下代码设置自定义目录。

google_crawler = GoogleImageCrawler('path_to_your_folder')

【讨论】：

什么是icrawler.examples？我写代码的时候，icrawler的版本是0.1.5。 pypi.org/project/icrawler/0.1.5 我已经修改了该代码行。感谢您指出。 @SoumyaBoral ：安装pip install icrawler==0.1.5 应该是from icrawler.builtin import GoogleImageCrawler。

以上是关于如何在 Python 中下载谷歌图片搜索结果的主要内容，如果未能解决你的问题，请参考以下文章