APSIPA Transactions on Signal and Information Processing

Journal Information
ISSN / EISSN : 2048-7703 / 2048-7703
Published by: Cambridge University Press (CUP) (10.1017)
Total articles ≅ 179
Current Coverage
Archived in

Latest articles in this journal

Liang-Yao Wang, Sau-Gee Chen,
APSIPA Transactions on Signal and Information Processing, Volume 10; https://doi.org/10.1017/atsip.2021.14

Many approaches have been proposed in the literature to enhance the robustness of Convolutional Neural Network (CNN)-based architectures against image distortions. Attempts to combat various types of distortions can be made by combining multiple expert networks, each trained by a certain type of distorted images, which however lead to a large model with high complexity. In this paper, we propose a CNN-based architecture with a pre-processing unit in which only undistorted data are used for training. The pre-processing unit employs discrete cosine transform (DCT) and discrete wavelets transform (DWT) to remove high-frequency components while capturing prominent high-frequency features in the undistorted data by means of random selection. We further utilize the singular value decomposition (SVD) to extract features before feeding the preprocessed data into the CNN for training. During testing, distorted images directly enter the CNN for classification without having to go through the hybrid module. Five different types of distortions are produced in the SVHN dataset and the CIFAR-10/100 datasets. Experimental results show that the proposed DCT-DWT-SVD module built upon the CNN architecture provides a classifier robust to input image distortions, outperforming the state-of-the-art approaches in terms of accuracy under different types of distortions.
Yu-Jen Wei, Tsu-Tsai Wei, , Po-Chyi Su
APSIPA Transactions on Signal and Information Processing, Volume 10; https://doi.org/10.1017/atsip.2021.13

The development of colorization algorithms through deep learning has become the current research trend. These algorithms colorize grayscale images automatically and quickly, but the colors produced are usually subdued and have low saturation. This research addresses this issue of existing algorithms by presenting a two-stage convolutional neural network (CNN) structure with the first and second stages being a chroma map generation network and a refinement network, respectively. To begin, we convert the color space of an image from RGB to HSV to predict its low-resolution chroma components and therefore reduce the computational complexity. Following that, the first-stage output is zoomed in and its detail is enhanced with a pyramidal CNN, resulting in a colorized image. Experiments show that, while using fewer parameters, our methodology produces results with more realistic color and higher saturation than existing methods.
APSIPA Transactions on Signal and Information Processing, Volume 10; https://doi.org/10.1017/atsip.2021.12

Immersive audio has received significant attention in the past decade. The emergence of a few groundbreaking systems and events (Dolby Atmos, MPEG-H, VR/AR, AI) contributes to reshaping the landscape of this field, accelerating the mass market adoption of immersive audio. This review serves as a quick recap of some immersive audio background, end to end workflow, covering audio capture, compression, and rendering. The technical aspects of object audio and ambisonic will be explored, as well as other related topics such as binauralization, virtual surround, and upmix. Industry trends and applications are also discussed where user experience ultimately decides the future direction of the immersive audio technologies.
, Chaoran Liu, , Hiroshi Ishiguro
APSIPA Transactions on Signal and Information Processing, Volume 10; https://doi.org/10.1017/atsip.2021.11

Automatic emotion recognition has become an important trend in the fields of human–computer natural interaction and artificial intelligence. Although gesture is one of the most important components of nonverbal communication, which has a considerable impact on emotion recognition, it is rarely considered in the study of emotion recognition. An important reason is the lack of large open-source emotional databases containing skeletal movement data. In this paper, we extract three-dimensional skeleton information from videos and apply the method to IEMOCAP database to add a new modality. We propose an attention-based convolutional neural network which takes the extracted data as input to predict the speakers’ emotional state. We also propose a graph attention-based fusion method that combines our model with the models using other modalities, to provide complementary information in the emotion classification task and effectively fuse multimodal cues. The combined model utilizes audio signals, text information, and skeletal data. The performance of the model significantly outperforms the bimodal model and other fusion strategies, proving the effectiveness of the method.
, Detlev Marpe
APSIPA Transactions on Signal and Information Processing, Volume 10; https://doi.org/10.1017/atsip.2021.10

AOM Video 1 (AV1) and Versatile Video Coding (VVC) are the outcome of two recent independent video coding technology developments. Although VVC is the successor of High Efficiency Video Coding (HEVC) in the lineage of international video coding standards jointly developed by ITU-T and ISO/IEC within an open and public standardization process, AV1 is a video coding scheme that was developed by the industry consortium Alliance for Open Media (AOM) and that has its technological roots in Google's proprietary VP9 codec. This paper presents a compression efficiency evaluation for the AV1, VVC, and HEVC video coding schemes in a typical video compression application requiring random access. The latter is an important property, without which essential functionalities in digital video broadcasting or streaming could not be provided. For the evaluation, we employed a controlled experimental environment that basically follows the guidelines specified in the Common Test Conditions of the Joint Video Experts Team. As representatives of the corresponding video coding schemes, we selected their freely available reference software implementations. Depending on the application-specific frequency of random access points, the experimental results show averaged bit-rate savings of about 10–15% for AV1 and 36–37% for the VVC reference encoder implementation (VTM), both relative to the HEVC reference encoder implementation (HM) and by using a test set of video sequences with different characteristics regarding content and resolution. A direct comparison between VTM and AV1 reveals averaged bit-rate savings of about 25–29% for VTM, while the averaged encoding and decoding run times of VTM relative to those of AV1 are around 300% and 270%, respectively.
AprilPyone Maungmaung,
APSIPA Transactions on Signal and Information Processing, Volume 10; https://doi.org/10.1017/atsip.2021.9

In this paper, we propose a novel method for protecting convolutional neural network models with a secret key set so that unauthorized users without the correct key set cannot access trained models. The method enables us to protect not only from copyright infringement but also the functionality of a model from unauthorized access without any noticeable overhead. We introduce three block-wise transformations with a secret key set to generate learnable transformed images: pixel shuffling, negative/positive transformation, and format-preserving Feistel-based encryption. Protected models are trained by using transformed images. The results of experiments with the CIFAR and ImageNet datasets show that the performance of a protected model was close to that of non-protected models when the key set was correct, while the accuracy severely dropped when an incorrect key set was given. The protected model was also demonstrated to be robust against various attacks. Compared with the state-of-the-art model protection with passports, the proposed method does not have any additional layers in the network, and therefore, there is no overhead during training and inference processes.
, Hiroshi Hashimoto, Koichi Takahashi, Akinori F. Ebihara, Jianquan Liu, Akihiro Hayasaka, Yusuke Morishita, Kazuyuki Sakurai
APSIPA Transactions on Signal and Information Processing, Volume 10; https://doi.org/10.1017/atsip.2021.8

Biometric recognition technologies have become more important in the modern society due to their convenience with the recent informatization and the dissemination of network services. Among such technologies, face recognition is one of the most convenient and practical because it enables authentication from a distance without requiring any authentication operations manually. As far as we know, face recognition is susceptible to the changes in the appearance of faces due to aging, the surrounding lighting, and posture. There were a number of technical challenges that need to be resolved. Recently, remarkable progress has been made thanks to the advent of deep learning methods. In this position paper, we provide an overview of face recognition technology and introduce its related applications, including face presentation attack detection, gaze estimation, person re-identification and image data mining. We also discuss the research challenges that still need to be addressed and resolved.
APSIPA Transactions on Signal and Information Processing, Volume 10; https://doi.org/10.1017/atsip.2021.4

This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.
, Takeshi Mori, Satoshi Kobashikawa, Tomoki Toda
APSIPA Transactions on Signal and Information Processing, Volume 10; https://doi.org/10.1017/atsip.2021.7

This paper presents a novel speech emotion recognition scheme that leverages the individuality of emotion perception. Most conventional methods simply poll multiple listeners and directly model the majority decision as the perceived emotion. However, emotion perception varies with the listener, which forces the conventional methods with their single models to create complex mixtures of emotion perception criteria. In order to mitigate this problem, we propose a majority-voted emotion recognition framework that constructs listener-dependent (LD) emotion recognition models. The LD model can estimate not only listener-wise perceived emotion, but also majority decision by averaging the outputs of the multiple LD models. Three LD models, fine-tuning, auxiliary input, and sub-layer weighting, are introduced, all of which are inspired by successful domain-adaptation frameworks in various speech processing tasks. Experiments on two emotional speech datasets demonstrate that the proposed approach outperforms the conventional emotion recognition frameworks in not only majority-voted but also listener-wise perceived emotion recognition.
APSIPA Transactions on Signal and Information Processing, Volume 10; https://doi.org/10.1017/atsip.2021.6

While deceptive behaviors are a natural part of human life, it is well known that human is generally bad at detecting deception. In this study, we present an automatic deception detection framework by comprehensively integrating prior domain knowledge in deceptive behavior understanding. Specifically, we compute acoustics, textual information, implicatures with non-verbal behaviors, and conversational temporal dynamics for improving automatic deception detection in dialogs. The proposed model reaches start-of-the-art performance on the Daily Deceptive Dialogues corpus of Mandarin (DDDM) database, 80.61% unweighted accuracy recall in deception recognition. In the further analyses, we reveal that (i) the deceivers’ deception behaviors can be observed from the interrogators’ behaviors in the conversational temporal dynamics features and (ii) some of the acoustic features (e.g. loudness and MFCC) and textual features are significant and effective indicators to detect deception behaviors.
Back to Top Top