E-learning in analysis of genomic and proteomic data 2. Data analysis 2.1. General analysis workflow 2.1.3. Class discovery (finding new groups)

Class discovery is a type of analysis where we try to make conclusions on the dataset without (taking into account) any a priori knowledge on the underlying biology. This type of analysis is also called clustering, as its aim is to partition objects in the dataset (in our case samples or genes/proteins) into groups (clusters) such that the objects inside the same group were highly similar while objects from different groups were as different as possible. In genomics and proteomics, one can search for functionally related genes or proteins via searching for groups of genes/proteins with similar expression. Another task can be to find new disease subgroups. Clustering provides a good framework for this type of analysis. It is often used ad hoc as a visualization and control tool after selection of differentially expressed genes/proteins between known groups of samples. If the selection was successful, the clustering of samples based on this subgroup of genes/samples should reveal more or less the two groups that were compared. The new high-density genomic and proteomic techniques produce high-dimensional data and performing such a task is impossible without appropriate analytic tools. The advantage of clustering techniques is that they reduce the size of the data sets by organizing genes (or samples) into a reduced number of groups.

Basic principle

We have a data matrix X of the size n x p, where n is a number of objects (samples) and p is a number of variables (genes/proteins). We are searching for the most appropriate division of the data, such that the discovered groups were highly homogenous inside and heterogenous between them.

There are different types of the clustering methods and it is not possible to describe them all here, thus, we will focus on only the most commonly used methods. There are two major issues a reader should take into account before applying any of the algorithms described below:
1) Many clustering methods will find clusters even in the data where there are no clusters, because they were designed for it.
2) The result of the clustering should never be considered an objective representation of the information hidden in the data, as it is dependent on the algorithm used for the analysis

There might be confusion in some literature where the authors use terms supervised and unsupervised clustering. The supervised clustering uses the a-priori knowledge about the data. However, this type of clustering belongs rather to the next chapter on class prediction. In the following, we will consider only unsupervised methods.

In general, we can distinguish two major approaches in unsupervised clustering: distance-based methods and model-based methods. The most commonly used are the distance based methods that aim at grouping similar objects according to a similarity measure given a priori. These methods are non-parametric as they do not assume the data to come from a pre-defined distribution. On the contrary, model-based methods are based on statistical modeling which puts strong assumptions on the distribution of the data and these can be therefore considered parametric clustering methods.
Most clustering techniques produce distinct clusters, this means that each object is assigned only to one group. This might not be the best approach, especially when clustering genes/proteins. Many of these are involved in more than one biological pathway, suggesting that each gene/protein should be allowed to belong to more than one cluster. The model-based methods correct for this by assigning to each gene a probability of belonging to each cluster.