与姿态、动作相关的数据集介绍
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了与姿态、动作相关的数据集介绍相关的知识,希望对你有一定的参考价值。
参考技术A 参考:https://blog.csdn.net/qq_38522972/article/details/82953477姿态论文整理:https://blog.csdn.net/zziahgf/article/details/78203621
经典项目:https://blog.csdn.net/ls83776736/article/details/87991515
姿态识别和动作识别任务本质不一样,动作识别可以认为是人定位和动作分类任务,姿态识别可理解为关键点的检测和为关键点赋id任务(多人姿态识别和单人姿态识别任务)
由于受到收集数据设备的限制,目前大部分姿态数据都是收集公共视频数据截取得到,因此2D数据集相对来说容易获取,与之相比,3D数据集较难获取。2D数据集有室内场景和室外场景,而3D目前只有室内场景。
地址:http://cocodataset.org/#download
样本数:>= 30W
关节点个数:18
全身,多人,keypoints on 10W people
地址:http://sam.johnson.io/research/lsp.html
样本数:2K
关节点个数:14
全身,单人
LSP dataset to 10; 000 images of people performing gymnastics, athletics and parkour.
地址:https://bensapp.github.io/flic-dataset.html
样本数:2W
关节点个数:9
全身,单人
样本数:25K
全身,单人/多人,40K people,410 human activities
16个关键点:0 - r ankle, 1 - r knee, 2 - r hip,3 - l hip,4 - l knee, 5 - l ankle, 6 - l ankle, 7 - l ankle,8 - upper neck, 9 - head top,10 - r wrist,11 - r elbow, 12 - r shoulder, 13 - l shoulder,14 - l elbow, 15 - l wrist
无mask标注
In order to analyze the challenges for fine-grained human activity recognition, we build on our recent publicly available \MPI Human Pose" dataset [2]. The dataset was collected from YouTube videos using an established two-level hierarchy of over 800 every day human activities. The activities at the first level of the hierarchy correspond to thematic categories, such as ”Home repair", “Occupation", “Music playing", etc., while the activities at the second level correspond to individual activities, e.g. ”Painting inside the house", “Hairstylist" and ”Playing woodwind". In total the dataset contains 20 categories and 410 individual activities covering a wider variety of activities than other datasets, while its systematic data collection aims for a fair activity coverage. Overall the dataset contains 24; 920 video snippets and each snippet is at least 41 frames long. Altogether the dataset contains over a 1M frames. Each video snippet has a key frame containing at least one person with a sufficient portion of the body visible and annotated body joints. There are 40; 522 annotated people in total. In addition, for a subset of key frames richer labels are available, including full 3D torso and head orientation and occlusion labels for joints and body parts.
为了分析细粒度人类活动识别的挑战,我们建立了我们最近公开发布的\ MPI Human Pose“数据集[2]。数据集是从YouTube视频中收集的,使用的是每天800多个已建立的两级层次结构人类活动。层次结构的第一级活动对应于主题类别,例如“家庭维修”,“职业”,“音乐播放”等,而第二级的活动对应于个人活动,例如“在屋内绘画”,“发型师”和“播放木管乐器”。总的来说,数据集包含20个类别和410个个人活动,涵盖比其他数据集更广泛的活动,而其系统数据收集旨在实现公平的活动覆盖。数据集包含24; 920个视频片段,每个片段长度至少为41帧。整个数据集包含超过1M帧。每个视频片段都有一个关键帧,其中至少包含一个人体,其中有足够的身体可见部分和带注释的身体关节。总共有40个; 522个注释人。此外,对于关键帧的子集,可以使用更丰富的标签,包括全3D躯干和头部方向以及关节和身体部位的遮挡标签。
14个关键点:0 - r ankle, 1 - r knee, 2 - r hip,3 - l hip,4 - l knee, 5 - l ankle, 8 - upper neck, 9 - head top,10 - r wrist,11 - r elbow, 12 - r shoulder, 13 - l shoulder,14 - l elbow, 15 - l wrist
不带mask标注,带有head的bbox标注
PoseTrack is a large-scale benchmark for human pose estimation and tracking in image sequences. It provides a publicly available training and validation set as well as an evaluation server for benchmarking on a held-out test set (www.posetrack.net).
PoseTrack是图像序列中人体姿态估计和跟踪的大规模基准。 它提供了一个公开的培训和验证集以及一个评估服务器,用于对保留的测试集(www.posetrack.net)进行基准测试。
In the PoseTrack benchmark each person is labeled with a head bounding box and positions of the body joints. We omit annotations of people in dense crowds and in some cases also choose to skip annotating people in upright standing poses. This is done to focus annotation efforts on the relevant people in the scene. We include ignore regions to specify which people in the image where ignored duringannotation.
在PoseTrack基准测试中, 每个人都标有头部边界框和身体关节的位置 。 我们 在密集的人群中省略了人们的注释,并且在某些情况下还选择跳过以直立姿势对人进行注释。 这样做是为了将注释工作集中在场景中的相关人员上。 我们 包括忽略区域来指定图像中哪些人在注释期间被忽略。
Each sequence included in the PoseTrack benchmark correspond to about 5 seconds of video. The number of frames in each sequence might vary as different videos were recorded with different number of frames per second. For the **training** sequences we provide annotations for 30 consecutive frames centered in the middle of the sequence. For the **validation and test ** sequences we annotate 30 consecutive frames and in addition annotate every 4-th frame of the sequence. The rationale for that is to evaluate both smoothness of the estimated body trajectories as well as ability to generate consistent tracks over longer temporal span. Note, that even though we do not label every frame in the provided sequences we still expect the unlabeled frames to be useful for achieving better performance on the labeled frames.
PoseTrack基准测试中包含的 每个序列对应于大约5秒的视频。 每个序列中的帧数可能会有所不同,因为不同的视频以每秒不同的帧数记录。 对于**训练**序列,我们 提供了以序列中间为中心的30个连续帧的注释 。 对于**验证和测试**序列,我们注释30个连续帧,并且另外注释序列的每第4个帧。 其基本原理是评估估计的身体轨迹的平滑度以及在较长的时间跨度上产生一致的轨迹的能力。 请注意,即使我们没有在提供的序列中标记每一帧,我们仍然期望未标记的帧对于在标记帧上实现更好的性能是有用的。
The PoseTrack 2018 submission file format is based on the Microsoft COCO dataset annotation format. We decided for this step to 1) maintain compatibility to a commonly used format and commonly used tools while 2) allowing for sufficient flexibility for the different challenges. These are the 2D tracking challenge, the 3D tracking challenge as well as the dense 2D tracking challenge.
PoseTrack 2018提交文件格式基于Microsoft COCO数据集注释格式 。 我们决定这一步骤1)保持与常用格式和常用工具的兼容性,同时2)为不同的挑战提供足够的灵活性。 这些是2D跟踪挑战,3D跟踪挑战以及密集的2D跟踪挑战。
Furthermore, we require submissions in a zipped version of either one big .json file or one .json file per sequence to 1) be flexible w.r.t. tools for each sequence (e.g., easy visualization for a single sequence independent of others and 2) to avoid problems with file size and processing.
此外,我们要求在每个序列的一个大的.json文件或一个.json文件的压缩版本中提交1)灵活的w.r.t. 每个序列的工具(例如,单个序列的简单可视化,独立于其他序列和2),以避免文件大小和处理的问题。
The MS COCO file format is a nested structure of dictionaries and lists. For evaluation, we only need a subsetof the standard fields, however a few additional fields are required for the evaluation protocol (e.g., a confidence value for every estimated body landmark). In the following we describe the minimal, but required set of fields for a submission. Additional fields may be present, but are ignored by the evaluation script.
MS COCO文件格式是字典和列表的嵌套结构。 为了评估,我们仅需要标准字段的子集,但是评估协议需要一些额外的字段(例如,每个估计的身体标志的置信度值)。 在下文中,我们描述了提交的最小但必需的字段集。 可能存在其他字段,但评估脚本会忽略这些字段。
At top level, each .json file stores a dictionary with three elements:
* images
* annotations
* categories
it is a list of described images in this file. The list must contain the information for all images referenced by a person description in the file. Each list element is a dictionary and must contain only two fields: `file_name` and `id` (unique int). The file name must refer to the original posetrack image as extracted from the test set, e.g., `images/test/023736_mpii_test/000000.jpg`.
它是此文件中描述的图像列表。 该列表必须包含文件中人员描述所引用的所有图像的信息。 每个列表元素都是一个字典,只能包含两个字段:`file_name`和`id`(unique int)。 文件名必须是指从测试集中提取的原始posetrack图像,例如`images / test / 023736_mpii_test / 000000.jpg`。
This is another list of dictionaries. Each item of the list describes one detected person and is itself a dictionary. It must have at least the following fields:
* `image_id` (int, an image with a corresponding id must be in `images`),
* `track_id` (int, the track this person is performing; unique per frame),`
* `keypoints` (list of floats, length three times number of estimated keypoints in order x, y, ? for every point. The third value per keypoint is only there for COCO format consistency and not used.),
* `scores` (list of float, length number of estimated keypoints; each value between 0. and 1. providing a prediction confidence for each keypoint),
这是另一个词典列表。 列表中的每个项目描述一个检测到的人并且本身是字典。 它必须至少包含以下字段:
*`image_id`(int,具有相应id的图像必须在`images`中),
*`track_id`(int,此人正在执行的追踪;每帧唯一),
`*`keypoints`(浮点数列表, 长度是每个点x,y,?的估计关键点数量的三倍 。每个关键点的第三个值仅用于COCO格式的一致性而未使用。),
*`得分`(浮点列表,估计关键点的长度数;每个值介于0和1之间,为每个关键点提供预测置信度),
Human3.6M数据集有360万个3D人体姿势和相应的图像,共有11个实验者(6男5女,论文一般选取1,5,6,7,8作为train,9,11作为test),共有17个动作场景,诸如讨论、吃饭、运动、问候等动作。该数据由4个数字摄像机,1个时间传感器,10个运动摄像机捕获。
由Max Planck Institute for Informatics制作,详情可见Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision论文
论文地址:https://arxiv.org/abs/1705.08421
1,单人姿态估计的重要论文
2014----Articulated Pose Estimation by a Graphical Model with ImageDependent Pairwise Relations
2014----DeepPose_Human Pose Estimation via Deep Neural Networks
2014----Joint Training of a Convolutional Network and a Graphical Model forHuman Pose Estimation
2014----Learning Human Pose Estimation Features with Convolutional Networks
2014----MoDeep_ A Deep Learning Framework Using Motion Features for HumanPose Estimation
2015----Efficient Object Localization Using Convolutional Networks
2015----Human Pose Estimation with Iterative Error
2015----Pose-based CNN Features for Action Recognition
2016----Advancing Hand Gesture Recognition with High Resolution ElectricalImpedance Tomography
2016----Chained Predictions Using Convolutional Neural Networks
2016----CPM----Convolutional Pose Machines
2016----CVPR-2016----End-to-End Learning of Deformable Mixture of Parts andDeep Convolutional Neural Networks for Human Pose Estimation
2016----Deep Learning of Local RGB-D Patches for 3D Object Detection and 6DPose Estimation
2016----PAFs----Realtime Multi-Person 2D Pose Estimation using PartAffinity Fields (openpose)
2016----Stacked hourglass----StackedHourglass Networks for Human Pose Estimation
2016----Structured Feature Learning for Pose Estimation
2017----Adversarial PoseNet_ A Structure-aware Convolutional Network forHuman pose estimation (alphapose)
2017----CVPR2017 oral----Realtime Multi-Person 2D Pose Estimation usingPart Affinity Fields
2017----Learning Feature Pyramids for Human Pose Estimation
2017----Multi-Context_Attention_for_Human_Pose_Estimation
2017----Self Adversarial Training for Human Pose Estimation
2,多人姿态估计的重要论文
2016----AssociativeEmbedding_End-to-End Learning for Joint Detection and Grouping
2016----DeepCut----Joint Subset Partition and Labeling for Multi PersonPose Estimation
2016----DeepCut----Joint Subset Partition and Labeling for Multi PersonPose Estimation_poster
2016----DeeperCut----DeeperCut A Deeper, Stronger, and Faster Multi-PersonPose Estimation Model
2017----G-RMI----Towards Accurate Multi-person Pose Estimation in the Wild
2017----RMPE_ Regional Multi-PersonPose Estimation
2018----Cascaded Pyramid Network for Multi-Person Pose Estimation
“级联金字塔网络用于多人姿态估计”
2018----DensePose: Dense Human Pose Estimation in the Wild
”密集人体:野外人体姿势估计“(精读,DensePose有待于进一步研究)
2018---3D Human Pose Estimation in the Wild by Adversarial Learning
“对抗性学习在野外的人体姿态估计”
基于Mediapipe与Unity的人体姿态捕捉系统
基于Mediapipe与Unity的人体姿态捕捉系统
1. 工程整体介绍
整个工程主要分成三部分:1.基于Mediapipe的人体姿态估计;2.基于Unity的人体姿态展示;3.从Mediapipe到Unity的通讯,即Mediapipe估计的姿态如何实时传递给Unity。
2. 基于Mediapipe的人体姿态估计
姿态估计部分,使用opencv进行人体采集,然后调用Mediapipe对读取的每一帧图像进行姿态估计。
2.1 环境搭建
版本要求:python >= 3.7
pip install mediapipe
pip install opencv-python
pip install opencv-contrib-python
参考官方文档:
链接: https://google.github.io/mediapipe/solutions/pose.
2.2 代码片段
博主假设你已经掌握了最基础的python语法
import cv2
import mediapipe as mp
import numpy as np
def Pose_Images():
#使用算法包进行姿态估计时设置的参数
mp_pose = mp.solutions.pose
with mp_pose.Pose(
min_detection_confidence=0.5,
min_tracking_confidence=0.8) as pose:
#打开摄像头
cap = cv2.VideoCapture(0)
while(True):
#读取摄像头图像
hx, image = cap.read()
if hx is False:
print('read video error')
exit(0)
image.flags.writeable = False
# Convert the BGR image to RGB before processing.
# 姿态估计
results = pose.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
print(results.pose_landmarks)
cv2.imshow('image', image)
if cv2.waitKey(10) & 0xFF == ord('q'): # 按q退出
break
cap.release()
if __name__ == '__main__':
Pose_Images()
2.3 结果展示
在你的控制台可以看到一个一个点的输出,如下,那么姿态估计就完成了第一步。
landmark
x: 0.7439931035041809
y: 3.0562074184417725
z: -0.25115278363227844
visibility: 0.00022187501599546522
landmark
x: 0.5690034627914429
y: 3.0262765884399414
z: -0.44416818022727966
visibility: 0.00034665243583731353
......
2.4 结果分析
下面我们来分析一下这些点的坐标到底代表了什么。
每一张图片都会产生一组坐标,每组坐标包含32个坐标点。
每个地标包括以下内容:
- x和y:通过图像宽度和高度分别归一化为[0.0, 1.0],通俗点说是,真实的x,y坐标,分别除以图像的宽度跟高度。
- z: 代表坐标的深度,臀部中点的深度为原点,数值越小,地标就越靠近摄像机。z的大小使用与x大致相同的比例。
- visibility:一个[0.0, 1.0]的值,表示坐标在图像中可见(存在且不被遮挡)的可能性。
2.5 本章总结
我们通过使用opencv库,打开摄像头采集照片,将照片传递给Mediapipe进行姿态坐标估计,后续我们将坐标放到Unity中进行展示就可以了。
3. 基于Unity的人体姿态展示
博主假设你已经掌握了unity的基础知识。
3.1 Unity 人体骨骼动画
我们可以从Unity商店中或者在此网站https://www.mixamo.com选择任意一个3D的人物模型导入到工程中。
3.2 Mediapipe坐标到Unity的映射
具体转换细节会专门写一篇文件来解释,数据是驱动骨骼运动的。整体的实现参考了开源的解决方案。
参考:
https://github.com/digital-standard/ThreeDPoseTracker
VNectModel.cs 文件实现了从预测坐标到Unity骨骼坐标的转换。
4. 从Mediapipe到Unity的数据传递
姿态预测跟unity的姿态展示毕竟属于不同的进程,对于进程之间的通讯有不同的实现方式。本文选择网络通讯中的UDP通讯,因为UDP通讯具有延迟小的特点,同时我们对于数据的丢失存在一定的容忍性。
4.1 使用Python发送数据
#代码片段示例:
json_data = json.dumps(pose_data)
udp_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
dest_addr = ('127.0.0.1', 5052)
text = json_data.encode('utf-8')
udp_socket.sendto(text, dest_addr)
4.2 使用C#接收数据
此代码片段借鉴了此博主的文章(如有侵权,联系删除)。
Link
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using System;
using System.Text;
using System.Net;
using System.Net.Sockets;
using System.Threading;
public class UDPRecive : MonoBehaviour
Thread receiveThread;
UdpClient client;
public int port = 5052;
public bool startRecieving = true;
public string data;
// Start is called before the first frame update
void Start()
receiveThread = new Thread(
new ThreadStart(ReceiveData));
receiveThread.IsBackground = true;
receiveThread.Start();
// Update is called once per frame
void Update()
private void ReceiveData()
client = new UdpClient(port);
while (startRecieving)
Debug.Log("startRecieving");
try
IPEndPoint anyIP = new IPEndPoint(IPAddress.Any, 0);
byte[] dataByte = client.Receive(ref anyIP);
data = Encoding.UTF8.GetString(dataByte);
Debug.Log(data);
catch (Exception err)
print(err.ToString());
5. 成果展示
从姿态预测到3D展示就完成了,那么我们来看一下效果吧。
11月25日
6. 后续展望
后续会根据需求,出一个对全部代码的讲解,可以帮助读者一步步实现自己的工程。
欢迎提出意见或建议。
7.联系方式
欢迎私信加我。
以上是关于与姿态、动作相关的数据集介绍的主要内容,如果未能解决你的问题,请参考以下文章