On Supervised and Unsupervised Discrimination
Discrimination is a supervised problem in statistics and machine learning that begins with data from a finite number of groups. The goal is to partition the data-space into some number of regions, and assign a group to each region so that observations there are most likely to belong to the assigned group. The most popular tool for discrimination is called discriminant analysis. Unsupervised discrimination, commonly known as clustering, also begins with data from groups, but now we do not necessarily know how many groups, nor do we get to know which group each observation belongs to. Our goal when doing clustering is still to partition the data-space into regions and assign groups to those regions, however we do not have any a priori information with which to assign these groups. Common tools for clustering include the k-means algorithm and model-based clustering using either the expectation maximization (EM) or classification expectation maximization (CEM) algorithms (of which k-means is a special case).
Tools designed for clustering can also be used to do discrimination. We investigate this possibility, along with a method proposed by Yang (2013) for smoothing the transition between these problems. We use two simulations to investigate the performance of discriminant analysis and both versions of model-based clustering with various parameter settings across various datasets. These settings include using Yang’s method for modifying clustering tools to handle discrimination. Results are presented along with recommendations for data analysis when doing discrimination or clustering. Specifically, we investigate what assumptions to make about the groups’ sizes and shapes, as well as which method to use (discriminant analysis or the EM or CEM algorithms) and whether or not to apply Yang’s pre-processing procedure.