Document Retrieval with Unlimited Vocabulary

In this paper, we describe a classifier based retrieval scheme for efficiently and accurately retrieving relevant documents. We use SVM classifiers for word retrieval, and argue that the classifier based solutions can be superior to the OCR based solutions in many practical situations. We overcome the practical limitations of the classifier based solution in terms of limited vocabulary support, and availability of training data. In order to overcome these limitations, we design a one-shot learning scheme for dynamically synthesizing classifiers. Given a set of SVM classifiers, we appropriately join them to create novel classifiers.

This extends the classifier based retrieval paradigm to an unlimited number of classes (words) present in a language. We validate our method on multiple datasets, and compare it with popular alternatives like OCR and word spotting. Even on a language like English, where OCRs have been fairly advanced, our method yields comparable or even superior results. Our results are significant since we do not use any language specific post-processing for obtaining this performance. For better accuracy of the retrieved list, we use query expansion. This also allows us to seamlessly adapt our solution to new fonts, styles and collections.

Share This Post