是否可以使用 Google 的 Vision API 或 Amazon 的 Rekognition 来获取对象的数量?

Posted

技术标签:

【中文标题】是否可以使用 Google 的 Vision API 或 Amazon 的 Rekognition 来获取对象的数量?【英文标题】:Is it possible to get the count of objects using Google's Vision API or Amazon's Rekognition? 【发布时间】:2018-11-21 17:38:48 【问题描述】:

我一直在探索使用 AWS Rekognition 和 Google 的 Vision 来获取图像/视频中的对象数量,但一直未能找到出路。尽管在Google's Vision 站点,他们确实有一个“从图像中洞察”部分,显然该数量似乎已被捕获。

附件是来自该 URL 的快照。

有人可以建议是否可以使用 Google 的 Vision 或任何其他 API 来帮助获取图像中的对象数量。谢谢

编辑:

例如 - 对于下面显示的图像,返回的计数应为 10 辆汽车。正如 Torry Yang 在他的回答中所建议的那样,标签 Annotations 计数可以给出所需的数字,但似乎并非如此,因为标签注释的计数是 18。返回的对象有点像这样。

"labelAnnotations": [ "mid": "/m/0k4j", "description": "car", "score": 0.98658943, "topicality": 0.98658943 , "mid": "/m/012f08", "description": "motor vehicle", "score": 0.9631113, "topicality": 0.9631113 , "mid": "/m/07yv9", "description": "vehicle", "score": 0.9223521, "topicality": 0.9223521 , "mid": "/m/01w71f", "description": "personal luxury car", "score": 0.8976857, "topicality": 0.8976857 , "mid": "/m/068mqj", "description": "automotive design", "score": 0.8736646, "topicality": 0.8736646 , "mid": "/m/012mq4", "description": "sports car", "score": 0.8418799, "topicality": 0.8418799 , "mid": "/m/01lcwm", "description": "luxury vehicle", "score": 0.7761523, "topicality": 0.7761523 , "mid": "/m/06j11d", "description": "performance car", "score": 0.76816446, "topicality": 0.76816446 , "mid": "/m/03vnt4", "description": "mid size car", "score": 0.75732976, "topicality": 0.75732976 , "mid": "/m/03vntj", "description": "full size car", "score": 0.6855145, "topicality": 0.6855145 , "mid": "/m/0h8ls87", "description": "automotive exterior", "score": 0.66056395, "topicality": 0.66056395 , "mid": "/m/014f__", "description": "supercar", "score": 0.592226, "topicality": 0.592226 , "mid": "/m/02swz_", "description": "compact car", "score": 0.5807265, "topicality": 0.5807265 , "mid": "/m/0h6dlrc", "description": "bmw", "score": 0.5801241, "topicality": 0.5801241 , "mid": "/m/01h80k", "description": "muscle car", "score": 0.55745816, "topicality": 0.55745816 , "mid": "/m/021mp2", "description": "sedan", "score": 0.5522745, "topicality": 0.5522745 , "mid": "/m/0369ss", "description": "city car", "score": 0.52938646, "topicality": 0.52938646 , "mid": "/m/01d1dj", "description": "coupé", "score": 0.50642073, "topicality": 0.50642073 ]

【问题讨论】:

这样=one here 【参考方案1】:

在 Google Cloud Vision 上,您应该可以进行计数。例如,如果你想用 Python 计算人脸的数量,你可以这样做:

def detect_faces(path):
    """Detects faces in an image."""
    client = vision.ImageAnnotatorClient()

    with io.open(path, 'rb') as image_file:
        content = image_file.read()

    image = vision.types.Image(content=content)

    response = client.face_detection(image=image)
    faces = response.face_annotations
    print(len(faces))

注意最后一行。在每种受支持的语言中,您都应该能够计算结果。

这里是每个标签的计数方法。

def detect_labels(path):
    """Detects labels in the file."""
    client = vision.ImageAnnotatorClient()

    with io.open(path, 'rb') as image_file:
        content = image_file.read()

    image = vision.types.Image(content=content)

    response = client.label_detection(image=image)
    labels = response.label_annotations

    count = 
    for label in labels:
        if label in count:
            count[label] += 1
        else:
            count[label] = 1

在第二个示例中,count 将是每个标签的字典以及它在图像中出现的次数。

【讨论】:

我的错,可能我不是很清楚。我假设我们只能获得面孔的数量,但无法计算花朵、瓶子、鸟类等物品的数量。请建议是否有办法实现这一点。 API 没有该功能,但您可以遍历结果并为每个标签创建计数 只有在图像中存在人脸的情况下才有可能。在检测汽车数量、鸟类数量等情况下,据我了解,这是行不通的。 感谢您的投入,Torry。我已经编辑了这个问题以便更好地理解。请看一看。【参考方案2】:

Google visionAWS Rekognition 都不支持照片中的对象计数。

https://forums.aws.amazon.com/thread.jspa?threadID=254814

但是,您可以在 VisionRekognition 中计算图像中的人脸数量。

在 AWS Rekognition 中,您会收到 DetectFaces API 作为 json 的响应:

HTTP/1.1 200 OK
Content-Type: application/x-amz-json-1.1
Date: Wed, 04 Jan 2017 23:37:03 GMT
x-amzn-RequestId: b1827570-d2d6-11e6-a51e-73b99a9bb0b9
Content-Length: 1355
Connection: keep-alive



   "FaceDetails":[
      
         "BoundingBox":
            "Height":0.18000000715255737,
            "Left":0.5555555820465088,
            "Top":0.33666667342185974,
            "Width":0.23999999463558197
         ,
         "Confidence":100.0,
         "Landmarks":[
            
               "Type":"eyeLeft",
               "X":0.6394737362861633,
               "Y":0.40819624066352844
            ,
            
               "Type":"eyeRight",
               "X":0.7266660928726196,
               "Y":0.41039225459098816
            ,
            
               "Type":"nose",
               "X":0.6912462115287781,
               "Y":0.44240960478782654
            ,
            
               "Type":"mouthLeft",
               "X":0.6306198239326477,
               "Y":0.46700039505958557
            ,
            
               "Type":"mouthRight",
               "X":0.7215608954429626,
               "Y":0.47114261984825134
            
         ],
         "Pose":
            "Pitch":4.050806522369385,
            "Roll":0.9950747489929199,
            "Yaw":13.693790435791016
         ,
         "Quality":
            "Brightness":37.60169982910156,
            "Sharpness":80.0
         
      ,
      
         "BoundingBox":
            "Height":0.16555555164813995,
            "Left":0.3096296191215515,
            "Top":0.7066666483879089,
            "Width":0.22074073553085327
         ,
         "Confidence":99.99998474121094,
         "Landmarks":[
            
               "Type":"eyeLeft",
               "X":0.3767718970775604,
               "Y":0.7863991856575012
            ,
            
               "Type":"eyeRight",
               "X":0.4517287313938141,
               "Y":0.7715709209442139
            ,
            
               "Type":"nose",
               "X":0.42001065611839294,
               "Y":0.8192070126533508
            ,
            
               "Type":"mouthLeft",
               "X":0.3915625810623169,
               "Y":0.8374140858650208
            ,
            
               "Type":"mouthRight",
               "X":0.46825936436653137,
               "Y":0.823401689529419
            
         ],
         "Pose":
            "Pitch":-16.320178985595703,
            "Roll":-15.097439765930176,
            "Yaw":-5.771541118621826
         ,
         "Quality":
            "Brightness":31.440860748291016,
            "Sharpness":60.000003814697266
         
      
   ],
   "OrientationCorrection":"ROTATE_0"

然后您可以使用此响应来计算边界框的数量,这些边界框最终将对应于图像中的人脸数量。

此外,如果您想计算照片中的对象,您可以在 AWS SageMaker 上设置自定义机器学习模型来执行此操作。例如:https://github.com/cosmincatalin/object-counting-with-mxnet-and-sagemaker

【讨论】:

以上是关于是否可以使用 Google 的 Vision API 或 Amazon 的 Rekognition 来获取对象的数量?的主要内容,如果未能解决你的问题,请参考以下文章

Google Vision 隐私:图片删除

Google Vision API 中的 Euler X 支持

是否需要图像预处理(Google Mobile Vision Text Recognition API)?

使用 Google Cloud Vision 的 OCR PDF 文件?

Google Vision API:图片上包含英文和阿拉伯文

如何使用 Google Cloud Vision API 确认图像(包含手写和打印文本)是不是包含手写文本?