Empirical Comparison of Automatic Image Annotation Systems

Abstract
The performance of content-based image retrieval systems has proved to be inherently constrained by the used low-level features, and cannot give satisfactory results when the user's high level concepts cannot be expressed by low level features. In an attempt to bridge this semantic gap, recent approaches started integrating both low level-visual features and high-level textual keywords. Unfortunately, manual image annotation is a tedious process and may not be possible for large image databases. To overcome this limitation, several approaches that can annotate images in a semi-supervised or unsupervised way have emerged. In this paper, we outline and compare four different algorithms. The first one is simple and assumes that image annotation can be viewed as the task of translating from a vocabulary of fixed image regions to a vocabulary of words. The second approach uses a set of annotated images as a training set and learns the joint distribution of regions and words. The third and fourth approaches are based on segmenting the images into homogeneous regions. Both of these approaches rely on a clustering algorithm to learn the association between visual features and keywords. The clustering task is not trivial as it involves clustering a very high-dimensional and sparse feature spaces. To address this, the third approach uses semi-supervised constrained clustering while the fourth approach relies on an algorithm that performs simultaneous clustering and feature discrimination. These four algorithms were implemented and tested on a data set that includes 6000 images using four-fold cross validation.

This publication has 21 references indexed in Scilit: