AN INTRODUCTION TO IMAGE MOSAICING

AN INTRODUCTION TO IMAGE MOSAICING
Sevket Gumustekin
July 1999
sevgum@pitt.edu (inactive)
sevketgumustekin@iyte.edu.tr (since Feb 2000)

Registration and mosaicing of images have been in practice since long before the age of digital computers. Shortly after the photographic process was developed in 1839, the use of photographs was demonstrated on topographical mapping [1]. Images acquired from hill-tops or balloons were manually pieced together. After the development of airplane technology (1903) aerophotography became an exciting new field. The limited flying heights of the early airplanes and the need for large photo-maps, forced imaging experts to construct mosaic images from overlapping photographs. This was initially done by manually mosaicing [2] images which were acquired by calibrated equipment. The need for mosaicing continued to increase later in history as satellites started sending pictures back to earth. Improvements in computer technology became a natural motivation to develop computational techniques and to solve related problems.

The construction of mosaic images and the use of such images on several computer vision/graphics applications have been active areas of research in recent years. There have been a variety of new additions to the classic applications mentioned above that primarily aim to enhance image resolution and field of view. Image-based rendering [3] has become a major focus of attention combining two complementary fields: computer vision and computer graphics [4]. In computer graphics applications (e.g. [5]) images of the real world have been traditionally used as environment maps. These images are used as static background of synthetic scenes and mapped as shadows onto synthetic objects for a realistic look with computations which are much more efficient than ray tracing [6,7]. In early applications such environment maps were single images captured by fish-eye lenses or a sequence of images captured by wide-angle rectilinear lenses used as faces of a cube [5]. Mosaicing images on smooth surfaces (e.g. cylindirical [8,9,10] or spherical [11,12,13]) allows an unlimited resolution also avoiding discontinuities that can result from images that are acquired separately. Such immersive environments (with or without synthetic objects) provide the users an improved sense of presence in a virtual scene. A combination of such scenes used as nodes[8,13] allows the users to navigate through a remote environment. Computer vision methods can be used to generate intermediate views [14,9] between the nodes. As a reverse problem the 3D stucture of scenes can be reconstructed from multiple nodes [15,16,13,17,18]. Among other major applications of image mosaicing in computer vision are image stabilization [19,20], resolution enhancement [21,22], video processing [23] (e.g. video compression [24]. video indexing [25,26]).

The problem of image mosaicing is a combination of three problems:

Correcting geometric deformations using image data and/or camera models.
Image registration using image data and/or camera models.
Eliminating seams from image mosaics.

1 Geometric Corrections

A geometric transformation is a mapping that relocates image points. Transformations can be global or local in nature. Global transformations are usually defined by a single equation which is applied to the whole image. Local transformations are applied to a part of image and they are harder to express concisely.

Figure 1: Common geometric transformations.

Some of the most common global transformations are affine, perspective and polynomial transformations. The first three cases of the Fig 1 are typical examples for the affine transformations. The remaining two are the common cases where perspective and polynomial transformations are used, respectively.

1.1 Projective Models for Camera Motion

Having p = (x,y)^T as old and p' = (x',y')^T as the new coordinates of a pixel, a 2D affine transformation can be written as:

The vector t is the translation component of the above equation. The matrix A controls scaling, rotation and shear effects:

The affine transformation can be represented by a single matrix multiplication in homogenous coordinates:

Introducing a scaling parameter W the transformation matrix A can be modified to handle perspective corrections:

New image coordinates can be found as x' = X'/W, y' = Y'/W .

Alternatively, perspective transformations are often represented by the following equations known as homographies:

Eight unknown parameters can be solved without any 3D information using only correspondences of image points ¹. The point correspondences can be obtained by feature based methods (e.g. based on corner detection [27]). Note that the transformation found for corresponding images is globally valid for all image points only when there is no motion parallax between frames (e.g. the case of a planar scene or a camera rotating around its center of projection). The motion parameters can also be found iteratively (e.g. [10,11,28,21])

Figure 2: (a) Perspective (b) Stereographic (c) Equi-distant projections of a world point P.

The 8-parameter homography accurately models a perspective transformation between different views for the case of a camera rotating around a nodal point. Such a perspective transformation is shown in Fig 2.a which displays a cross-section of the viewing sphere (i.e. a sphere of unit size with its center coinciding with the center of projection).

Fig 2 also illustrates some of the projective transformations that are alternative to the perspective transformation. Each of these projective transformations has distinctive features. Perspective transformations preserve lines whereas the stereographic transformations preserve circular shapes [29]. Stereographic transformations are capable of mapping a full field of view of the viewing sphere onto the projection plane. For the equi-distant projection (which can be viewed as flattening a spherical surface [30]) mapping a full field of view is no longer an asymptotical case. The distance between the point p^* and the principle point c can be found according to the projection model:

In [30] we show that it is possible to easily use each one of these projections to model camera motion between image frames. As opposed to homography techniques which project images to a reference frame (e.g. the first frame in a sequence) we project images onto a plane with a principle point (i.e. the point c in Fig 2) of the projection plane as a point shared in 3D by rotating image frames. Our method projects an arbitrary point p first to viewing sphere (p') then to its final destination (p^*). We utilize this versatile camera model in [30,31].

1.2 Representing Projections on Non-planar Surfaces

A natural domain for representing and compositing images acquired by a camera rotating around its nodal point is a unit sphere centered at the nodal point. We use the term ``plenoptic image'' for an image composited on a spherical surface representing the entire 360^o field of view. The term ``plenoptic'' was introduced by [32] and later popularized by [9]. A plenoptic function describes everything that is visible in radiant forms of energy to an observer for every possible location of the observer. A plenoptic image ² is a sample of the plenoptic function for a fixed location of the observer.

Side views of cylindrical maps [8,9,10] are often chosen to represent plenoptic images compromising the discarded views of top and bottom with the uniform sampling in the cylindrical coordinate system. Uniform sampling feature is desirable especially when images are needed to be translated in the target domain. We use spherical surfaces (as in [11,12,13]) as an environment to construct plenoptic images. The construction of mosaic images on spherical surfaces is complicated by the singularities at the poles [33]. Numerical errors near the poles cause irregularities in the similarity measures used for automatic registration. Using images acquired with a fish-eye lens [12] and the small relative size of polar regions with respect to such images alleviates the negative effect of singularities. Relative rotational motions between image pairs are used in [13] (based on quaternions [34]) and [11] (based on an angular motion matrix [35]) before mapping images onto a sphere to avoid the effect of singularities in registration.

Images that form a large portion of a plenoptic image can be constructed on a single image frame by using special mirrors [36,37,38]. Having a single viewpoint [36,37] in such imaging systems is important for capability to reconstruct perspective views. Carefully calibrated and coupled mirrors [37] can capture two images that can be easily combined to form a plenoptic image. Even though this kind of approach provides a simple framework for capturing a full field of view of a scene, the limited resolution of the film frame (or sensor array) may be a serious limitation for recording images in detail. Plenoptic images constructed by mosaicing smaller images can store detailed information without being subject to such limits.

1.3 Camera Motion as a Sweeping Motion of a Strip

The complications due to parallax that are observed in the case of translational motion of a planar camera can be avoided by using a one dimensional camera (i.e. ``pushbroom camera'') to scan scenes. This action can be emulated using conventional cameras by combining strips taken from a sequence of two dimensional images as a series of neighboring segments. These cameras can directly acquire cylindrical (with a rotating motion) and orthographic (with translational motion) maps [39]. They can also acquire images along an arbitrary path [40].

The strips that should be taken from two dimensional images are identified as the ones perpendicular to the image flow in [41]. These family of strips can handle a wide variety of motions including forward motion and optical zoom. Additional formulation is developed in [42] for these complicated cases of motion.

Images acquired as a combination of strips (along with range images and a recorded path of camera) are also shown to be effective in complete 3D reconstruction of scenes [43].

1.4 Polynomial Transformations for Arbitrary Geometric Corrections

Polynomial transformations are often referred to as ``rubber sheet transformations'' [44] to describe their capability to change shapes of objects until they appear to be in desired shape. Using only correspondences of image points they can handle global distortions (e.g. pincushion/barrel distortion) which can not be modeled easily as in perspective transformations. A bivariate polynomial transformation is of the form:

The order of the transformation increases as the number of points that need to be matched is increased. If the transformation is a bilinear transformation (i.e. as described above, with no higher order terms), four (x,y) points and corresponding (x',y') points are sufficient to solve the above equations for coefficients a₀,..,a₃,b₀,..,b₃ . If the order of the polynomial is not high enough to solve with direct matrix inversion, a pseudo inverse solution can be obtained. This solution gives identical results with the classical least squares formulation which yield those coefficients that best approximate the true mapping function for control points. It also spreads the error equally. Weighted least squares solutions [45] introduce a weighting function which localizes the error.

Irani et al. [25,46,23,24] choose to use polynomial transformations with more degrees of freedom than the 8-parameter bilinear transformation that can accurately handle perspective distortions. They use the extra degrees of freedom in the transformation to deal with the nonlinearities due to parallax, scene change etc.

Global transformations described above impose a single mapping function on the image. They do not account for (with the exception of weighted least squares solution) local variations. Local distortions may be present in scenes due to a motion parallax, movement of objects etc.

The parameters of a local mapping transformation vary across the different regions of the image to handle local deformations. One way to do this is to partition the image into smaller sub-regions such as triangular regions with corners at the control points and then find a linear [47] transformation that exactly maps corners to desired locations. Smoother results can be obtained by a nonlinear transformation [48]. In [49] the control points are selected to be along the desired border of overlapping images. A transformation that relocates these points to align with their correspondences has an effect on rest of the pixels inversely proportional to their distances to the control points. In [50] local variations that need to be corrected are estimated by the image flow between corresponding images that have undergone global transformations.

Although the local transformations can correct deformations that are not corrected by global corrections it is difficult to justify their necessity in image mosaicing. Warping images simply to reduce local variations (e.g. aligning a moving object by warping its local neighborhood) is likely to introduce unnatural distortions in the warped areas. We address the problem of local distortions during the mosaicing process by minimizing their significance in the blended images.

A detailed discussion on spatial transformation and interpolation methods can be found in [45].

2 Image Registration

Image registration is the task of matching two or more images. It has been a central issue for a variety of problems in image processing [51] such as object recognition, monitoring satellite images, matching stereo images for reconstructing depth, matching biomedical images for diagnosis, etc.

Registration is also the central task of image mosaicing procedures. Carefully calibrated and prerecorded camera parameters may be used to eliminate the need for an automatic registration. User interaction also is a reliable source for manually registering images (e.g. by choosing corresponding points and employing necessary transformations on screen with visual feedback). Automated methods for image registration used in image mosaicing literature can be categorized as follows:

Feature based [52,27] methods rely on accurate detection of image features. Correspondences between features lead to computation of the camera motion which can be tested for alignment. In the absence of distinctive features, this kind of approach is likely to fail.

Exhaustively searching for a best match for all possible motion parameters can be computationally extremely expensive. Using hierarchical processing (i.e. coarse-to-fine [53]) results in significant speed-ups. We also use this approach also taking advantage of parallel processing [31] for additional performance improvement.

Frequency domain approaches for finding displacement [54] and rotation/scale [55,56] are computationally efficient but can be sensitive to noise. These methods also require the overlap extent to occupy a significant portion of the images (e.g. at least 50%).

Iteratively adjusting camera-motion parameters leads to local minimums unless a reliable initial estimate is provided. Initial estimates can be obtained using a coarse global search or an efficiently implemented frequency domain approach [28,18].

3 Image Compositing

Images aligned after undergoing geometric corrections most likely require further processing to eliminate remaining distortions and discontinuities. Alignment of images may be imperfect due to registration errors resulting from incompatible model assumptions, dynamic scenes, etc. Furthermore, in most cases images that need to be mosaiced are not exposed evenly due to changing lighting conditions, automatic controls of cameras, printing/scanning devices, etc. These unwanted effects can be alleviated during the compositing process.

The main problem in image compositing is the problem of determining how the pixels in an overlapping area should be represented. Finding the best separation border between overlapping images [57] has the potential to eliminate remaining geometric distortions. Such a border is likely to traverse around moving objects avoiding double exposure [56,30]. The uneven exposure problem can be solved by histogram equalization [30,58], by iteratively distributing the edge effect on the border to a large area [59], or by a smooth blending function [60].

4 Mosaicing Applications

The most common mosaicing applications include constructing high resolution images that cover an unlimited field of view using inexpensive equipment, creating immersive environments for effective information exchange through the internet. These applications have been extended towards the creation of completely navigatable ``virtualized'' [61] environments by creating arbitrary views from a limited number of nodes [14,9,62,63]. The reconstruction of 3D scene structure from multiple nodes [15,16,13,17,18,64] has also been another active area of research.

We expect the use of image mosaicing to make a significant impact in video processing [23]. The complete representation of static scenes resulting from mosaicing video frames in conjunction with an efficient representation for dynamic changes provide a versatile environment for visualizing, efficiently coding, accessing, analyzing information. Besides video compression [24] and indexing [25,26] this environment is shown to be useful for image stabilization [19,20] and building high quality images using low-cost imaging equipments [21,22].

As indicated by the recent history of newly developed applications, image mosaicing has become a major field of research. Besides a growing number of research papers, the public interest in image mosaicing has also been substantial. In recent years several constructing tools and viewers for panoramic images have appeared as successful commercial products such as Adessosoft Inc.'s PanoTouch^©, Apple's Quicktime VR^TM [8], Black Diamond Inc.'s Surround Video^©, Terran Interactive Inc.'s Electrifier Pro^©, Enroute Imaging's Quickstich^©, IBM's PanoramIX^©, Infinite Pictures' SmoothMove^TM, Interactive Pictures' IPIX^TM Multimedia Builder, Live Picture Inc.'s PhotoVista^TM, Panavue's Visual Sticher^TM, PictureWorks' Spin Panorama^©, RealSpace Inc.'s RealVR^TM, RoundAbout Logic's Nodester^©, Ulead Systems' Cool 3D^©, Videobrush Corp.'s Panorama^© [65], Visdyn's Jutvision^TM .

References

[1]: P.R. Wolf. Elements of Photogrammetry. McGraw-Hill, 2 edition, 1983.
[2]: P. Kolonia. When more is better. Popular Photography, 58(1):30-34, Jan 1994.
[3]: S.B. Kang. A survey of image-based rendering techniques. Technical Report CRL 97/4, Digital Equipment Corp. Cambridge Research Lab, Aug 1997.
[4]: J. Lengyel. The convergence of graphics and vision. Computer, IEEE Computer Society Magazine, pages 46-53, July 1998.
[5]: N. Greene. Environment mapping and other applications of world projections. IEEE Transactions on Computer Graphics and Applications, pages 21-29, November 1986.
[6]: A. Watt. 3D Computer Graphics. Addison-Wesley, 2 edition, 1995.
[7]: J.D. Foley A.V. Dam S.K. Feiner J.F. Hughes. Computer Graphics: Principles and Practice. Addison-Wesley, 2 edition, 1996.
[8]: S.E. Chen. Quicktime VR - An image based approach to virtual environment navigation. In Proc. of ACM SIGGRAPH, 1995.
[9]: L. McMillan G. Bishop. Plenoptic modeling: An image-based rendering system. In Proc. of ACM SIGGRAPH, pages 39-46, 1995.
[10]: R. Szeliski. Video mosaics for virtual environments. IEEE Computer Graphics and Applications, pages 22-30, March 1996.
[11]: R. Szeliski H.Y. Shum. Creating full view panoramic image mosaics and environment maps. In Proc. of ACM SIGGRAPH, 1997.
[12]: Y. Xiong K. Turkowski. Creating image-based VR using a self-calibrating fisheye lens. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 1997.
[13]: S. Coorg N. Master S. Teller. Acquisition of a large pose-mosaic dataset. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 1998.
[14]: S.E. Chen L. Williams. View interpolation for image synthesis. In Proc. of ACM SIGGRAPH, 1993.
[15]: SMILE. European workshop on 3d structure from multiple images of large-scale environments (in conjunction with eccv), 1998.
[16]: H.Y. Shum M. Han R. Szeliski. Interactive construction of 3d models from panoramic mosaics. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, June 1998.
[17]: S.B. Kang R. Szeliski. 3-d scene data recovery using omnidirectional multibaseline stereo. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pages 364-370, 1996.
[18]: R. Szeliski. Image mosaicing for tele-reality applications. In Proc. of IEEE Workshop on Applications of Computer Vision, pages 44-53, 1994.
[19]: C. H. Morimoto R. Chellappa. Fast 3d stabilization and mosaic construction. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pages 660-665, June 1997.
[20]: M. Hansen P. Anandan K. Dana G. Wal P. Burt. Real-time scene stabilization and mosaic construction. In Proc. of IEEE Workshop on Applications of Computer Vision, pages 54-62, 1994.
[21]: D. Capel A. Ziserman. Automated mosaicing with super-resolution zoom. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 1998.
[22]: S. Mann R.W. Picard. Virtual bellows: Constructing high quality stills from video. In Proc. of IEEE Int. Conf. on Image Processing, pages 363-367, 1994.
[23]: M. Irani P. Anandan J. Bergen R. Kumar S. Hsu. Mosaic representations of video sequences and their applications. Signal Processing: Image Communication, special issue on Image and Video Semantics: Processing, Analysis, and Application, 8(4), May 1996.
[24]: M. Irani S. Hsu P. Anandan. Video compression using mosaic representations. Signal Processing: Image Communication, 7:529-552, 1995.
[25]: M. Irani P. Anandan. Video indexing based on mosaic representations. Proceedings of the IEEE, 86(5):905-921, May 1998.
[26]: H.S. Sawhney S. Ayer. Compact representation of videos through dominant multiple motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):814-830, 1996.
[27]: I. Zoghlami O. Faugeras R. Deriche. Using geometric corners to build a 2d mosaic from a set of images. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pages 420-425, 1997.
[28]: H.S. Sawhney R. Kumar. True multi-image alignment and its application to mosaicing and lens distortion correction. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pages 450-456, 1997.
[29]: M. Fleck. Perspective projection: The wrong imaging model. Technical report, Dept. of Computer Science, University of Iowa, 1994.
[30]: S. Gümüstekin R.W. Hall. Mosaic image generation on a flattened gaussian sphere. In Proc. of IEEE Workshop on Applications of Computer Vision, pages 50-55, 1996.
[31]: S. Gümüstekin R.W. Hall. Image registration and mosaicing using a self calibrating camera. In Proc. of IEEE Int. Conf. on Image Processing, 1998.
[32]: E.H. Adelson J.R. Bergen. Computational Models of Visual Processing, chapter 1, The Plenoptic Function and the Elements of Early Vision. The MIT Press, 1991. edited by M. Landy J.A. Movshon.
[33]: E. Kreyszig. Differential Geometry. Dover Publications Inc. (republication of the 1963 edition by the University of Toronto Press), 1991.
[34]: R. Jain R. Kasturi B.G. Schunck. Machine Vision. Mc Graw Hill, 1995.
[35]: N. Ayache. Vision Stereoscopique et Perception Multisensorielle. InterEditions, Paris, 1989.
[36]: Y. Onoe K. Yamazawa H. Takemura N. Yokoya. Tele-presence by real-time view-dependent image generation from omnidirectional video streams. IEEE Trans. on Computer Vision and Image Understanding, 71(2):154-165, Aug 1998.
[37]: S.K. Nayar. Catadioptric omnidirectional camera. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pages 482-488, 1997.
[38]: Y. Yagi S. Kawato S. Tsuji. Real-time omnidirectional image sensor (copis) for vision-guided navigation. IEEE Transactions on Robotics and Automation, 10(1):11-21, 1994.
[39]: S. Peleg J. Herman. Panoramic mosaics by manifold projection. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 1997.
[40]: J.Y. Zheng S.Tsuji. Panaromic representation for route recognition by a mobile robot. International Journal of Computer Vision, 9(1):56-76, 1992.
[41]: B. Rousso S. Peleg I. Finci. Mosaicing with generalized strips. In DARPA Image Understanding Workshop, May 1997.
[42]: B. Russo S. Peleg I. Finci A. Rav-Acha. Universal mosaicing using pipe projection. In Proc. of IEEE International Conf. on Computer Vision, 1998.
[43]: P. Rademacher G. Bishop. Multiple-center-of-projection images. In Proc. of ACM SIGGRAPH, 1998.
[44]: R.C. Gonzales R.E. Woods. Digital Image Processing. Addison-Wesley Publishing Company, 1993.
[45]: G. Wolberg. Digital Image Warping. IEEE Computer Society Press, 1990.
[46]: M. Irani P. Anandan. Robust multi-sensor image alignment. In Proc. of IEEE International Conf. on Computer Vision, India, Jan 1998.
[47]: A. Goshtasby. Piecewise linear mapping functions for image registration. Pattern Recognition, 19(6):459-468, 1986.
[48]: A. Goshtasby. Piecewise cubic mapping functions for image registration. Pattern Recognition, 20(5):525-533, 1987.
[49]: P. Jaillon A. Montanvert. Image mosaicing applied to three dimensional surfaces. In IEEE Int. Conf. on Pattern Recognition, pages 253-257, 1994.
[50]: H.Y. Shum R.Szeliski. Panoramic image mosaics. Technical Report MSR-TR-97-23, Microsoft research, 1997.
[51]: L. G. Brown. A survey of image registration techniques. ACM Computing Surveys, 24(4):325-376, 1992.
[52]: P. Dani S. Chaudhuri. Automated assembling of images: Image montage preparation. Pattern Recognition, 28(3):431-445, 1995.
[53]: P.J. Burt. Smart sensing within a pyramid vision. Proceedings of the IEEE, 76(8):1006-1015, 1988.
[54]: C.D. Kuglin D.C. Hines. The phase correlation image alignment method. In Proc. of Int. Conf. Cybernetics Society, pages 163-165, 1975.
[55]: B.S. Reddy B.N. Chatterji. An fft-based technique for translation, rotation, and scale-invariant image registration. IEEE Transactions on Image Processing, 5(8), 1996.
[56]: J. Davis. Mosaics of scenes with moving objects. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 1998.
[57]: D.L. Milgram. Adaptive techniques for photo mosaicing. IEEE Transactions in Computers, C-26:1175-1180, 1977.
[58]: J. Lim. Two Dimensional Signal and Image Processing. Prentice Hall, 1990.
[59]: S. Peleg. Elimination of seams from photomosaics. Computer Graphics and Image Processing, 16:90-94, 1981.
[60]: P.J. Burt E.H. Adelson. A multiresolution spline with application to image mosaics. ACM Transactions on Graphics, 2(4):217-236, 1983.
[61]: T. Kanade P.W. Rander P.J. Narayaman. Virtualized reality: Constructing virtual worlds from real scenes. IEEE Trans. on Multimedia, 4(1):34-47, 1997.
[62]: N.L. Chang A. Zakhor. View generation for three-dimensional scenes from video sequences. IEEE Transactions on Image Processing, 6(4):584-598, 1997.
[63]: S.M. Seitz C.R. Dyer. View morphing. In Proc. of ACM SIGGRAPH, 1996.
[64]: M. Herman T. Kanade. The 3D Mosaic Scene Understanding System (chapter 14 in "From pixels to Predicates" edited by A. Pentland). Ablex Pub., 1986.
[65]: H. S. Sawhney R. Kumar G. Gendel J. Bergen D. Dixon V. Paragano. Videobrush: Experiences with consumer video mosaicing. In Proc. of IEEE Workshop on Applications of Computer Vision, pages 56-62, 1998.

Footnotes:

¹ Four point correspondences are sufficient to solve for eight unknown parameters

² McMillan and Bishop [9] use the term ``plenoptic sample''.

File translated from T_EX by T_TH, version 2.32.
On 30 Jul 1999, 10:16.