如何使用 python(最好是 BS4)从 Google 图片(或 bing)中找到图片的 url?

Posted

技术标签:

【中文标题】如何使用 python(最好是 BS4)从 Google 图片(或 bing)中找到图片的 url?【英文标题】:How can I find the url of an image from Google Images (or bing) using python (and preferably BS4)? 【发布时间】:2021-02-23 21:26:58 【问题描述】:

我尝试使用谷歌浏览器上的检查元素来查找图片链接,我看到以下内容:

<img class="mimg rms_img" style="color: rgb(192, 54, 11);"    id="emb48403F0A" src="https://th.bing.com/th/id/OIP.uEcdCNY9nFhqWqbz4B0mFQHaEo?w=297&amp;h=185&amp;c=7&amp;o=5&amp;dpr=1.5&amp;pid=1.7" data-thhnrepbd="1" data-bm="186">

当我尝试使用以下方法搜索此元素和其他类似元素时:soup.findAll("img","class": "mimg rms_img") 我什么也没得到。

是我做错了什么还是这不是解决问题的最佳方法?

【问题讨论】:

试试soup.select("img.mimg.rms_img") @JustinEzequiel 它返回一个空列表,soup.find("img.mimg.rms_img") 和 soup.findAll("img.mimg.rms_img") 也是如此。 发布您的代码。 【参考方案1】:

要为 thumbnail原始大小 URL 提取图像 URL,您需要:

    找到所有&lt;script&gt;标签。 通过regex匹配和提取图片网址。 遍历找到的匹配项并对其进行解码。

代码和full example in the online IDE(简单明了,慢慢看):

import requests, lxml, re, json
from bs4 import BeautifulSoup

headers = 
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"


params = 
    "q": "pexels cat",
    "tbm": "isch", 
    "hl": "en",
    "ijn": "0",


html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')


def get_images_data():

    # this steps could be refactored to a more compact
    all_script_tags = soup.select('script')

    # # https://regex101.com/r/48UZhY/4
    matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    
    # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
    # if you try to json.loads() without json.dumps it will throw an error:
    # "Expecting property name enclosed in double quotes"
    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)

    # https://regex101.com/r/pdZOnW/3
    matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)

    # https://regex101.com/r/NnRg27/1
    matched_google_images_thumbnails = ', '.join(
        re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(', ')

    print('Google Image Thumbnails:')  # in order
    for fixed_google_image_thumbnail in matched_google_images_thumbnails:
        # https://***.com/a/4004439/15164646 comment by Frédéric Hamidi
        google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')

        # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
        google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
        print(google_image_thumbnail)

    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))

    # https://regex101.com/r/fXjfb1/4
    # https://***.com/a/19821774/15164646
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                       removed_matched_google_images_thumbnails)

    print('\nGoogle Full Resolution Images:')  # in order
    for fixed_full_res_image in matched_google_full_resolution_images:
        # https://***.com/a/4004439/15164646 comment by Frédéric Hamidi
        original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
        original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
        print(original_size_img)

get_images_data()

--------------
'''
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSb48h3zks_bf6y7HnZGyGPn3s2TAHKKm_7kzxufi5nzbouJcQderHqoEoOZ4SpOuPDjfw&usqp=CAU
...

Google Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''

或者,您也可以使用来自 SerpApi 的 Google Images API 来完成此操作。这是一个带有免费计划的付费 API。

本质上,主要区别之一是您不需要深入研究页面的源代码并通过regex提取某些内容,只需遍历结构化的 JSON 字符串并获取你想要的数据。查看playground。

要集成的代码:

import os, json # json for pretty output
from serpapi import GoogleSearch


def get_google_images():
    params = 
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "pexels cat",
      "tbm": "isch"
    

    search = GoogleSearch(params)
    results = search.get_dict()

    print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))

get_google_images()

----------
'''
...
  
    "position": 60, # img number
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRt-tXSZMBNLLX8MhavbBNkKmjJ7wNXxtdr5Q&usqp=CAU",
    "source": "pexels.com",
    "title": "1,000+ Best Cats Videos · 100% Free Download · Pexels Stock Videos",
    "link": "https://www.pexels.com/search/videos/cats/",
    "original": "https://images.pexels.com/videos/855282/free-video-855282.jpg?auto=compress&cs=tinysrgb&dpr=1&w=500",
    "is_product": false
  
...
'''

P.S - 我写了一篇更深入的博客文章,其中包含有关如何抓取 Google Images 的 GIF 和屏幕截图。

免责声明,我为 SerpApi 工作。

【讨论】:

【参考方案2】:

这是在另一个搜索引擎DuckDuckGo 中执行此操作的方法:

search_query = 'what you want to find'
num_images = 10
driver_location = '/put/location/of/your/driver/here'

# setting up the driver
ser = Service(driver_location)
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=ser, options=op)

# searching the query
driver.get(f'https://duckduckgo.com/?q=search_query&kl=us-en&ia=web')

# going to Images Section
ba = driver.find_element(By.XPATH, "//a[@class='zcm__link  js-zci-link  js-zci-link--images']")
ba.click()

# getting the images URLs
for result in driver.find_elements(By.CSS_SELECTOR, '.js-images-link')[0:0+num_images]:
    imageURL = result.get_attribute('data-id')

    print(f'imageURL\n')

driver.quit()

【讨论】:

以上是关于如何使用 python(最好是 BS4)从 Google 图片(或 bing)中找到图片的 url?的主要内容,如果未能解决你的问题,请参考以下文章

需要帮助使用 bs4 和 python 从幻灯片中抓取图像

在 Python 中使用 BS4 抓取数据,嵌套表

如何使用 bs4 打印第一个 google 搜索结果链接?

Python - BS4 - 仅使用表头+保存为字典从维基百科表中提取子表

Python 从底层结构聊 Beautiful Soup 4(内置豆瓣最新电影排行榜爬取案例)!

如何使用 bs4 正确解析谷歌搜索结果?