Source transcription of pitched polyphonic music entails providing the pitch (F0) values corresponding to each source in a separate channel. This problem is an important step towards many important problems in music and speech processing. It involves 1) estimating the multiple F0 values in each short time frame, and 2) clustering the F0 values into streams corresponding to different sources. We address the problem in an unsupervised way, with only the total number of sources given beforehand. The framework of probabilistic latent component analysis (PLCA) is used to decompose the polyphonic short-time magnitude spectra for multiple F0 estimation and source-specific feature extraction. It is further embedded into the structure of hidden Markov random fields (HMRF) for clustering the F0s into different sources.
This clustering is constrained by the cognitive grouping of continuous F0 contours as well as segregation of simultaneous F0s into different source streams. Such constraints are effectively and elegantly modeled by the HMRF’s. Simulated annealing varies the degree of constraints for better clustering. The paper also proposes a novel strategy using the trade-off between precision and recall of multiple F0 estimation for better clustering. Evaluations over a variety of datasets show the efficacy of the proposed algorithm and its robustness to the presence of spurious F0s while clustering. It also outperforms a state-of-the-art unsupervised source streaming algorithm in a set of comparative experiments.