Joint Research Conference

June 24-26, 2014

Text Categorization via Similarity Search: An Efficient and Effective Supervised Learning Algorithm

Abstract:

We present an efficient and effective supervised learning algorithm, developed by members of the University of Ottawa Data Science Group, for text categorization which is extremely scalable with big data. Based on similarity search in the space of measure distributions under the Vector Space Model, this algorithm associates to each text category a measure distribution; however, unlike it is usual in clustering, this point is not a centroid of the category but rather an outlier, a uniform measure on a selection of category-specific words. Given a new text document vector, our algorithm simply classifies it with the category whose uniform distribution is closest according to the inner product. The algorithm only uses matrix operations and thus is easily parallelized for big data. Successes with this algorithm include winning the 1st place prize in the 2013 international Cybersecurity Data Mining Competition (CDMC 2013) and beating state-of-the-art classifiers, such as SVM and Random Forest, both in accuracy and the F-Measure on the Reuters 21578 dataset.


This is joint work with Vladimir Pestov, University of Ottawa, and Varun Singla, Goldman Sachs India