Human parsing, namely partitioning the human body into semantic regions, has drawn much attention recently for its wide applications in human-centric analysis. Previous works often consider solving the problem of human pose estimation as the prerequisite of human parsing. We argue that these approaches cannot obtain optimal pixel-level parsing due to the inconsistent targets between the different tasks. In this work, we directly address the problem of human parsing by using the novel Parselet representation as the building blocks of our parsing model. Parselets are a group of parsable segments which can generally be obtained by low-level over-segmentation algorithms and bear strong semantic meaning. We then build a Deformable Mixture Parsing Model (DMPM) for human parsing to simultaneously handle the deformation and multi-modalities of Parselets.
The proposed model has two unique characteristics: (1) the possible numerous modalities of Parselet ensembles are exhibited as the “And-Or” structure of sub-trees; (2) to further solve the practical problem of Parselet occlusion or absence, we directly model the visibility property at some leaf nodes. The DMPM thus directly solves the problem of human parsing by searching for the best graph configuration from a pool of Parselet hypotheses without intermediate tasks. Fast rejection based on hierarchical filtering is employed to ensure the overall efficiency. Comprehensive evaluations on a new large-scale human parsing dataset, which is crawled from the Internet, with high resolution and thoroughly annotated semantic labels at pixel-level, and also a benchmark dataset demonstrate the encouraging performance of the proposed approach.