This work explores the use of Empirical Mode Decomposition (EMD) for discriminating speech regions from music in audio recordings. The different frequency scales or Intrinsic Mode Functions (IMFs) obtained from EMD of the audio signal are found to contain discriminatory evidence for distinguishing the speech regions from the music regions of the audio signal.
Different statistical measures like mean, absolute mean, variance, skewness and kurtosis are computed from the various IMFs and investigated for speech vs music discrimination. These features on being used for classification using classifiers like Support Vector Machines (SVMs) and k-Nearest Neighbour (k-NN) on the Scheirer and Slaney database gives the best overall classification accuracy of 90.83% for the SVMs and 85.33% for the k-NN.