Learning deformable action templates from cluttered videos

Abstract
In this paper, we present a Deformable Action Template (DAT) model that is learnable from cluttered real-world videos with weak supervisions. In our generative model, an action template is a sequence of image templates each of which consists of a set of shape and motion primitives (Gabor wavelets and optical-flow patches) at selected orientations and locations. These primitives are allowed to slightly perturb their locations and orientations to account for spatial deformations. We use a shared pursuit algorithm to automatically discover a best set of primitives and weights by maximizing the likelihood over one or more aligned training examples. Since it is extremely hard to accurately label human actions from real-world videos, we use a three-step semi-supervised learning procedure. 1) For each human action class, a template is initialized from a labeled (one bounding-box per frame) training video. 2) The template is used to detect actions from other training videos of the same class by a dynamic space-time warping algorithm, which searches a best match between the template and target video in 5D space (x, y, scale, ttemplate and ttarget) using dynamic programming. 3) The template is updated by the shared pursuit algorithm over all aligned videos. The 2nd and 3rd steps iterate several times to arrive at an optimal action template. We tested our algorithm on a cluttered action dataset (the CMU dataset) and achieved favorable performance than. Our classification performance on the KTH dataset is also comparable to state-of-the-arts.
Keywords

This publication has 14 references indexed in Scilit: