Document Retrieval with Unlimited Vocabulary

In this paper, we describe a classifier based retrieval scheme for efficiently and accurately retrieving relevant documents. We use SVM classifiers for word retrieval, and argue that the classifier based solutions can be superior to the OCR based solutions in many practical situations. We overcome the practical limitations of the classifier based solution in terms of limited vocabulary support, and availability of training data. In order to overcome these limitations, we design a one-shot learning scheme for dynamically synthesizing classifiers. Given a set of SVM classifiers, we appropriately join them to create novel classifiers.

This extends the classifier based retrieval paradigm to an unlimited number of classes (words) present in a language. We validate our method on multiple datasets, and compare it with popular alternatives like OCR and word spotting. Even on a language like English, where OCRs have been fairly advanced, our method yields comparable or even superior results. Our results are significant since we do not use any language specific post-processing for obtaining this performance. For better accuracy of the retrieved list, we use query expansion. This also allows us to seamlessly adapt our solution to new fonts, styles and collections.

Share This Post

Get Updates

Related Posts

GyroPen: Gyroscopes for Pen-Input With Mobile Phones

Pareto-Path Multitask Multiple Kernel Learning

Removal of residual cavitation nuclei to enhance histotripsy erosion of model urinary stones

In vitro imaging of kidney stones using ultra-short echo-time magnetic resonance