POLO: Learning Explicit Cross-Modality Fusion for Temporal Action Localization

24 February 2021

journal article
research article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Signal Processing Letters

Vol. 28 (10709908), 503-507
https://doi.org/10.1109/lsp.2021.3061289

Abstract

Temporal action localization aims at discovering action instances in untrimmed videos, where RGB and flow are two widely used feature modalities. Specifically, RGB chiefly reveals appearance and flow mainly depicts motion. Given RGB and flow features, previous methods employ the early fusion or late fusion paradigm to mine the complementarity between them. By concatenating raw RGB and flow features, the early fusion implicitly achieved complementarity by the network, but it partly discards the particularity of each modality. The late fusion independently maintains two branches to explore the particularity of each modality, but it only fuses the localization results, which is insufficient to mine the complementarity. In this work, we propose ex p licit cr o ss-moda l ity fusi o n (POLO) to effectively utilize the complementarity between two modalities and thoroughly explore the particularity of each modality. POLO performs cross-modality fusion via estimating the attention weight from RGB modality and employing it to flow modality (vice versa). This bridges the complementarity of one modality to supply the other. Assisted with the attention weight, POLO independently learns from RGB and flow features and explores the particularity of each modality. Extensive experiments on two benchmarks demonstrate the preferable performance of POLO.

This publication has 30 references indexed in Scilit:

Eratosthenes sieve based key-frame extraction technique for event summarization in videos
Multimedia Tools and Applications, 2018
Single Shot Temporal Action Detection
Published by Association for Computing Machinery (ACM) ,2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2017
Generalized Deep Transfer Networks for Knowledge Propagation in Heterogeneous Domains
ACM Transactions on Multimedia Computing, Communications, and Applications, 2016
Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Training Region-Based Object Detectors with Online Hard Example Mining
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Fast R-CNN
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Weakly-Shared Deep Transfer Networks for Heterogeneous-Domain Knowledge Propagation
Published by Association for Computing Machinery (ACM) ,2015
ActivityNet: A large-scale video benchmark for human activity understanding
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Infinite Hidden Markov Models for Unusual-Event Detection in Video
IEEE Transactions on Image Processing, 2008

Cited by 15 articles