Convolutional Pose Machines

Preprint
Abstract
Pose Machines provide a powerful modular framework for articulated pose estimation. The sequential prediction framework allows for the learning of rich implicit spatial models, but currently relies on manually designed features for representing image and spatial context. In this work, we incorporate a convolutional network architecture into the pose machine framework allowing the learning of representations for both image and spatial context directly from data. The contribution of this paper is a systematic approach to composing convolutional networks with large receptive fields for pose estimation tasks. Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing backpropagated gradients and conditioning the learning procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks.