Joint Research Conference

June 24-26, 2014

Risk-Aware Classification Validation

Abstract:

As an important task in big data modeling, classification has received tremendous attention in data mining and machine learning. The goal for classification is to build a machine classifier based on labelled data, so that it can be applied to previously unseen instances to make predictions. Previous work has mainly focused on designing algorithms to build classifiers that make accurate predictions. However, little work is done on classification validation after the model is applied to new instances. In practice, actions on verifying the classification results tend to be complex and time consuming. For example, when a responsive document is detected from a large corpus of documents by a machine classifier, a group of lawyers and domain experts need to read, analyze and discuss about the content to ensure the detection is not a false alarm. Moreover, there is the need to make sure that most, if not all, responsive documents are successfully detected. Since the validation process is tedious, it is impossible to verify every document scored. The decision to stop verifying more cases introduces risk of missing actual responsive instances. In this study, we focus on analyzing the risk of using machine classifiers while verifying only a subset of the scoring set, which we call “stopping set.” Three different verification procedures are described, analyzed, and critiqued. Both theoretical analysis and experimental results show a tradeoff between verification cost and risk of missing responsive instances.