Extracting spatio-temporally consistent segments from a video sequence is a challenging problem due to the complexity of color, motion and occlusions. Most existing spatio-temporal segmentationapproaches have inherent difficulties in handling large displacement with significant occlusions . This paper presents a novel framework for spatio-temporal segmentation. With the estimated depth data beforehand by a multi-view stereo technique, we project the pixels to other frames for collecting the boundary and segmentation statistics in a video, and incorporate them into the segmentation energy for spatio-temporal optimization.
In order to effectively solve this problem, we introduce an iterative optimization scheme by first initializing segmentation maps for each frame independently, and then link the correspondences among different frames and iteratively refine them with the collected statistics, so that a set of spatio-temporally consistent volume segments are finally achieved. The effectiveness and usefulness of our automatic framework are demonstrated via its applications for 3D reconstruction, video editing and semantic segmentation on a variety of challenging video examples.