VCA
Video Coding & Architectures Research group





Multi-view 3D video acquisition, coding and rendering



A 3D video is typically obtained from a set of synchronized cameras, which are capturing the same scene from different view points (multiview video). This technique enables applications such as free-viewpoint video or 3D-TV. First, the free-viewpoint video application provides the ability for users to interactively select a viewpoint of the scene. Second, with 3D-TV, the depth of the scene can be perceived using a multiview display. The corresponding display technology is based on showing several views of the same scene. By observing slightly different views of the scene, the human brain integrates these views into a 3D representation of the scene.


Synthetic rendered image.

3D TV shows simultaneously several views of the same scene so that the human brain can integrates them into a 3D representation.

We have developed a multi-view video-processing system that can be decomposed into the following sub-systems:
  • Multi-view video acquisition and calibration
  • Depth estimation
  • Texture compression
  • Depth compression

Multi-view video acquisition


Our multi-view capturing system is based on a set of multiple FireWire Sony cameras. The cameras are synrchonized and fully calibrated prior to the capture session.


Camera calibration and lens distortion


The multi-view camera acquisition setup is fully calibrated using a planar chess-board pattern. It should be noted that our software enables real-time corner-points estimation, so that the camera calibration procedure can be performed within minutes.


Automatic real-time corner point detection (MPEG movie).

Calculated position of cameras and chess board in the 3D space (click on the image to see higher resolution image).

Depth image compression


Emerging 3-D displays show several views of the scene simultaneously. A direct transmission of a selection of these views is impractical, because various types of displays support a different number of views and the decoder has to interpolate the intermediate views. The transmission of multiview image information can be simplified by only transmitting the texture data for the central view and a corresponding depth map. Additional to the coding of the texture data, this technique requires the efficient coding of depth maps. Since the depth map represents the scene geometry and thereby covers the 3-D perception of the scene, sharp edges corresponding to object boundaries, should be preserved. We propose an algorithm that models depth maps using piecewise-linear functions (platelets). To adapt to varying scene detail, we employ a quadtree decomposition that divides the image into blocks of variable size, each block being approximated by one platelet. In order to preserve sharp object boundaries, the support area of each platelet is adapted to the object boundary. The subdivision of the quadtree and the selection of the platelet type are optimized such that a global rate-distortion trade-off is realized. Experimental results show that the described method can improve the resulting picture quality after compression of depth maps by 1-3 dB when compared to a JPEG-2000 encoder.


Original depth image Ballet

Image coded with platelets at 0.032 bpp and 39 dB (no grid superimposed).

Image coded with platelets at 0.032 bpp and 39 dB (grid superimposed).



Coded depth images available at http://vca.ele.tue.nl/demos/mvc/PlateletDepthCoding.tgz