JRC 2010 - Invited Program

Semi-Supervised Learning

Organizers: George Michailidis, Univ. of Michigan and Mark Culp, West Virginia Univ.
Session Chair: Dan Samarov, NIST

On Propagated Scoring for Semi-supervised Additive Models

Mark Culp
West Virginia University

Abstract: In this talk, a semi-supervised modeling framework that combines feature-based (X) data and graph-based (G) data for classification/regression of the response Y is presented. In this semi-supervised setting, Y is observed for a subset of the observations (labeled) and missing for the remainder (unlabeled). The Propagated Scoring algorithm proposed for fitting this model is a semi-supervised fixed point regularization approach that essentially extends the generalized additive model into the semi-supervised setting. We first articulate when semi-supervised degeneracys are expected to occur within our framework and then provide a general regularization strategy to address such circumstances. For statistical analysis we establish that the approach uses shrinking smoothers, provide circumstances for when the result is consistent, provide measures of inference and description, and establish clear connections to supervised models. Lastly, several semi-supervised approaches have been considered for the classification problem posed, typically motivated from energy optimization perspective. In this work, we rigorously connect the statistically based propagated scoring framework to this class of approaches. This is particularly insightful, especially with regard to supervised comparisons, since this type of analysis is lacking for the previous work. Two applications are presented, the first involves classification of protein location on a cell using a network of protein interaction data, and the second involves classification of text documents with citation network information and text data are presented.

Adversarial Classification

Bowei Xi
Purdue University

Many data mining applications, ranging from spam filtering to intrusion detection, are faced with active adversaries. In all these applications, the future data sets and the training data set are no longer from the same population, due to the transformations employed by the adversaries. Hence a main assumption for the existing classification techniques is not satisfied and initially successful classifiers degrade easily. This becomes a game between the adversary and the data miner: The adversary modifies its strategy to avoid being detected by the current classifier; the data miner then updates its classifier based on the new threats. We investigate the possibility of an equilibrium in this seemingly never ending game, where neither party has an incentive to change. Modifying the classifier causes too many false positives with too little increase in true positives; changes by the adversary decrease the utility of the false negative items that are not detected. We develop a game theoretic framework where equilibrium behavior of adversarial classification applications can be analyzed, and provide a solution for finding an equilibrium point. A classifier's equilibrium performance indicates its eventual success or failure. The data miner could then select attributes based on their equilibrium performance, and construct an effective classifier. A case study of online lending data demonstrates how to apply the proposed game theoretic framework to a real application.

Estimation of Heterogeneous Binary Markov Networks

George Michailidis
University of Michigan

Heterogenous data consists of several categories sharing the same variables but differ in their distributions. In this talk, we discuss a new Markov network model to characterize the heterogenous dependence structures between a set of variables from such data. The proposed model jointly estimates a collection of Markov networks corresponding to different categories, not only preserving the underlying common graph structure, but also allowing the flexibility for different links between the networks. This target is achieved by using a group penalty effectively removing the common zero interaction effects across all Markov networks. We establish consistency in terms of both parameter estimation and structure selection and demonstrate its advantage over the separate estimation using a number of simulated examples. An application to the voting record of the US Senate on bills on different topics such as national defense, energy and health illustrates the model.