Unsupervised Learning from Narrated Instruction Videos
- 1 June 2016
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 4575-4583
- https://doi.org/10.1109/cvpr.2016.495
Abstract
We address the problem of automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The contributions of this paper are three-fold. First, we develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method solves two clustering problems, one in text and one in video, applied one after each other and linked by joint constraints to obtain a single coherent sequence of steps in both modalities. Second, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains about 800,000 frames for five different tasks1 that include complex interactions between people and objects, and are captured in a variety of indoor and outdoor settings. Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.Keywords
Other Versions
This publication has 18 references indexed in Scilit:
- Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video SegmentsPublished by Association for Computational Linguistics (ACL) ,2015
- What’s Cookin’? Interpreting Cooking Videos using Text, Speech and VisionPublished by Association for Computational Linguistics (ACL) ,2015
- A Hierarchical Bayesian Model for Unsupervised Induction of Script KnowledgePublished by Association for Computational Linguistics (ACL) ,2014
- Finding Actors and Actions in MoviesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Action Recognition with Improved TrajectoriesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Poselet Key-Framing: A Model for Human Activity RecognitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Unsupervised Learning of Human Action Categories Using Spatial-Temporal WordsInternational Journal of Computer Vision, 2008
- WordNetCommunications of the ACM, 1995
- On the Complexity of Multiple Sequence AlignmentJournal of Computational Biology, 1994
- CLUSTAL: a package for performing multiple sequence alignment on a microcomputerGene, 1988