Learning Temporal Pose Estimation from Sparsely-Labeled Videos

Posted xiaoheizi-12345


篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Learning Temporal Pose Estimation from Sparsely-Labeled Videos相关的知识,希望对你有一定的参考价值。

facebook AI 出品



现在视频中的多人关键点识别需要密集标注,资金和劳动力消耗大。作者提出的 PoseWarper 网络利用训练视频每K帧一标注的稀疏标注来实现密集关键点的反向传播和估计。对于已标注的视频帧A和未标注的视频帧BA利用通过反卷积提取的B的特征学习AB的特征扭曲。该方法的优点是:

1at inference time we can reverse the application direction of our network in order to propagate pose information from manually annotated frames to unlabeled frames

2we can improve the accuracy of a pose estimator by training it on an augmented dataset

3we can use our PoseWarper to aggregate temporal pose information from neighboring frames during inference

开源代码: https://github.com/facebookresearch/PoseWarper










The PoseWarper Network

this task would become trivial, as we would simply need to spatially “warp” the feature maps computed from frame B according to the set of correspondences relating frame B to frame A.




Backbone NetworkHRNet-W48

Deformable Warping.



backbone CNN得到pose heatmapsfafbWab=fa-fbWaba stack of 3 × 3 simple residual blocks的输入,输出是OabOab输入dilation不同的一系列3x3卷积层,在每个坐标点pn 得到相应的偏移集合 o(d)(pn) ,不同的dilation是为了得到在不同的空间尺度下的运动线索,预测到的偏移量是为了在空间上扭曲B的特征,五个偏移集合相加得到gAB,用来在A上进行预测。

Loss Functionomputes a mean squared errorapplying a 2D Gaussian around the location of each joint

Pose Annotation Propagation:将AB之间的特征图大小相等,可以匹配A的真值yA,这样可以进行反向传播,we can predict the offsets for warping ground-truth heatmap yA to an unlabeled Frame B, from the feature difference WBA = fB − fA,然后可以得到yAB之间的扭曲。

Temporal Pose Aggregation at Inference Timeu使用反卷积扭曲机制来聚集推理时附近视频帧的关键点信息来提升关键点检测的准确性。时间t时的图像帧,会聚集 时间在t + δ 时的视频帧信息,δ 在(−3; −2; −1; 0; 1; 2; 3)范围内. 此方法使算法对 occlusions, motion blur, and video defocus更鲁棒。



以上是关于Learning Temporal Pose Estimation from Sparsely-Labeled Videos的主要内容,如果未能解决你的问题,请参考以下文章

PP: Multi-Horizon Time Series Forecasting with Temporal Attention Learning

《Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks》算法详解

Deep High-Resolution Representation Learning for Human Pose Estimation

论文笔记:Hierarchical Deep Reinforcement Learning:Integrating Temporal Abstraction and Intrinsic

Data Mining 论文翻译:Deep Learning for Spatio-Temporal Data Mining: A Survey

Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization概述