Statistical Services Offered

Back to List of Services

Cluster Analysis

Cluster Analysis is a set of methods for finding a useful classification of an initially unclassified set of data, using the variable values observed on each individual. The objective of cluster analysis is to partition a data set into groups based on observed characteristics, so that the observations in a group are as similar as possible to each other and as dissimilar as possible to the observations in other groups. Clustering is considered to be an "unsupervised" learning model because the observations in the data set do not contain a target (or response) variable to provide guidance as to correct response.

There are two types of clustering algorithms: hierarchical and partitive. (Hierarchical clustering, in turn, has two forms: agglomerative (or aggregative) and divisive (or disaggregative), with agglomerative methods being far more commonly used.) When the number of observations to be clustered is very large, partitive clustering is the preferred method over hierarchical clustering. A strength of partitive clustering methods is that they do not depend on previously found clusters. Hierarchical clustering, in contrast, embeds clusters formed at earlier iterations which cannot be undone.

A drawback of partitive clustering methods is that they make explicit assumptions about the shape of the clusters. They also require the user to take a guess at the number of clusters that exist in the data. Also, partitive clustering methods are influenced by outliers, the choice of the initial cluster seeds, and even by the order in which the seeds are read. Due to these deficiencies, partitive clustering methods are not as widely used as hierarchical methods.