In this paper, a structured L2,1 optimization model, which simultaneously characterizes the reconstruction capability and diversity, is proposed to provide a semantically meaningful representation of a short video clip acquired from digital cameras or a mobile robot. In this model, a mutual inhabitation penalty term is imposed to prevent similar samples from being selected simultaneously. The proposed model is highly flexible to incorporate different mutual inhabitation terms and the temporal redundancy in video is exploited to encourage the diversity.
The constructed objective function is nonconvex and an iterative algorithm is developed to solve the optimization problem. The performance is evaluated using various video clips from YouTube and also based on practical video captured by an indoor mobile robot. The results clearly indicate that the proposed strategy helps the optimization model to achieve more diversified key frames than the other existing work method.