Monocular depth perception from optical flow by space time signal processing

Abstract
A theory of monocular depth determination is presented. The effect of finite temporal resolution is incorporated by generalizing the Marr-Hildreth edge detected operator -$\nabla ^{2}$G(r) where $\nabla ^{2}$ is the Laplacian and G(r) is a two-dimensional Gaussian. The constraint that the edge detection operator in space-time should produce zero-crossings at the same place in different channels, i.e. at different resolutions of the Gaussian, led to the conclusion that the Marr-Hildreth operator should be replaced by -$\square ^{2}$G(r,t) where $\square ^{2}$ is the d'Alembertian $\nabla ^{2}-(1/u^{2})(\partial ^{2}/\partial t^{2})$ and G(r, t) is a Gaussian in space--time. To ensure that the locations of the zero-crossings are independent of the channel width, G(r, t) has to be isotropic in the sense that the velocity u appearing in the defintion of the d'Alembertian must also be used to relate the scales of length and time in G. However, the new operatior -$\square ^{2}$G(r,t) produces two types of zero-crossing for each isolated edge feature in the image I(r, t). One of these, termed the `static edge', corresponds to the position of the image edge at time t as defined by $\nabla ^{2}$I(r,t) = 0; the other, called a `depth zero', depends only on the relative motion of the observer and object and is usually found only in the periphery of the field of view. When an edge feature is itself in the periphery of the visual field and these zeros coincide, there is an additional cross-over effect. It is shown how these zero-crossings may be used to infer the depth of an object when the observer and object are in relative motion. If an edge feature is near the centre of the image (i.e. near the focus of expansion), the spatial and temporal slopes of the zeros crossing at the static edge may be used to infer the depth, but, if the edge feature is in the periphery of the image, the cross-over effect enables the depth to be obtained immediately. While the former utilizes sharp spatial and temporal resolution to give detailed three-dimensional information, the cross-over effect relies on longer integration times to give a direct measure of the time-to-contact. We propose that both mechanisms could be used to extract depth information in computer vision systems and speculate on how our theory could be used to model depth perception in early visual processing in humans where there is evidence of both monocular perception of the environment in depth and of looming detection in the periphery of the field of view. In addition it is shown how a number of previous models are included in our theory, in particular the directional sensor proposed by Marr & Ullman and a method of depth determination proposed by Prazdny.

This publication has 21 references indexed in Scilit: