python图像刮刀,在bing上无法正常工作

Posted

技术标签:

【中文标题】python图像刮刀,在bing上无法正常工作【英文标题】:python image scraper, not working properly on bing 【发布时间】:2020-07-25 00:18:19 【问题描述】:

我正在尝试构建图像抓取工具,我首先在 Google 上尝试过,但没有图像被抓取 所以我尝试了 Bing,它成功了,但是有一些问题

    被抓取的图片链接只是显示的一小部分 在搜索引擎中。 抓取图像来自所显示预览中的未知页面。 默认情况下,图像会在安全模式过滤器中抓取

我想抓取 bing.com/images/search 中显示的所有图像(或某些页面) 但它做的却很少。

经过检查,我发现图片链接存储在 bing 的 'thumb' 类中,因此我抓取所有具有 thumb 类的链接, 但看起来这还不够。

在查看源代码后,只发现最终实际上有 .jpg 的 thumb 类链接

import requests
from bs4 import BeautifulSoup
import os
import random
from urllib.parse import urljoin


url = "https://www.bing.com"

search = input("enter the search term: ")
r = requests.get(url + "/images/search", params="q":search)

soup = BeautifulSoup(r.content,"html.parser")

li = soup.find_all("a",class_="thumb")

# getting links from thumb class  

links = [l.get("href") for l in li]


print("0 results found with the search term: 1".format(len(links), search))
choice = input("Do You Want To Extract The Images? Y or N ")
dir_name = "Result"

# Creating the Result named directory if it didn't existed
if os.path.isdir(dir_name) == False:
    print("[+] Creating Directory Named '0'".format(dir_name))
    os.mkdir(dir_name)
    
n = 1
if(choice == 'Y' or choice == 'y'):
    for i in links:
        req = requests.get(i)

        #title = links[z].split("/")[-1]
        #there were some issues with the default titles so I instead used names generated by
        #random sequence

        print("[+] Extracting Image #",n)
        with open(("0/" + generateRandomSequence() + ".jpg").format(dir_name),"wb") as img:
            img.write(req.content)
        n += 1

  #for generating random sequence
def generateRandomSequence():
    seq = ""
    letters = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",
               "A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",
                ]
    for i in range(0,5):
        seq = seq + random.choice(letters) + str(random.randrange(1,1000))
    
    return seq

【问题讨论】:

【参考方案1】:

这是给你的一个刮:

import requests
from bs4 import BeautifulSoup
seartext = input("enter the search term: ")
count = input("Enter the number of images you need:")
adlt = 'off' # can be set to 'moderate'
sear=seartext.strip()
sear=sear.replace(' ','+')
URL='https://bing.com/images/search?q=' + sear + '&safeSearch=' + adlt + '&count=' + count
print(URL)
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = "user-agent": USER_AGENT
resp = requests.get(URL, headers=headers)
results=[]
soup = BeautifulSoup(resp.content, "html.parser")
print(soup)
wow = soup.find_all('a',class_='iusc')
for i in wow:
    try:
        print(eval(i['m'])['murl'])
        print()
    except:
        pass

Here,你会找到bing的查询参数。

【讨论】:

很好,但它仍然限制为 35 个结果,即使包含偏移参数,我也尝试进行偏移,结果相同,如果限制超过 35 个,则会下载相同的图像 我尝试将偏移量作为整数和字符串,但结果没有什么不同 我相信35个结果的限制是由于页面的动态生成。由于页面是动态生成的,您可能需要使用 selenium 即使我们欺骗了我们的用户代理。该页面使用 javascript。所以 requests 和 beautifulsoup 在这里有一个缺点 但 bing 并没有说“这适用于 requests 和 beautifulsoup”

以上是关于python图像刮刀,在bing上无法正常工作的主要内容,如果未能解决你的问题,请参考以下文章

上传Django的个人资料图片无法正常工作

Anaconda3 libhdf5.so.9:无法打开共享对象文件[在 py2.7 上工作正常,但在 py3.4 上不能正常工作]

python pip安装使用wheel文件无法正常工作

使用python(自定义网址)下载bing图像搜索结果

缩放图像上的 imgareaselect 预览无法正常工作

phaser 3117打印机怎么加墨粉