E-learning in analysis of genomic and proteomic data 2. Data analysis 2.1. General analysis workflow 2.1.3. Class discovery (finding new groups) 2.1.3.1.Distance based methods

In order to organize objects into clusters with high similarity inside clusters and high dissimilarity between clusters, we need to somehow measure the similarity (or distance) between objects and between clusters. There is a number of various metrics designed for this purpose. The selection of the metrics depends on the type of the data and a desired result.

The terms similarity and distance have opposite meanings and each similarity can be converted to distance measure or vice versa. For instance a Pearson correlation, which represents a similarity measure, can be easily converted into distance measure simply by computing |1-R|.

Generally, the distance measure d between objects g and g’ should be:

positive: d(g, g′) ≥ 0
symmetric: d(g, g′) = d(g′, g)
zero only between g and itself: {d(g, g′) = 0} <=>{g = g′}

The object is either gene/protein or cluster.

The similarity s between objects g and g’ is defined equivalently:

s is positive with maximum at 1: 0≤ s(g, g′)≤ 0
s is symmetric: s(g, g′) = s(g′, g)
s is 1 only between g and itself: {d(g, g′) = 0} <=> {g = g′}.

Once the metrics is selected, the first step of the analysis is often the calculation of the distance (similarity) between each pair of objects. The result is a n x n matrix (n is number of objects), that has on the diagonal either

0’s – in the case our metrics is a distance measure (the distance of an object from itself is 0)
1’s – in the case the metrics is a similarity measure (the similarity of an object with itself is maximal, with the value of 1)

Later, the same measure is applied either to

compute the distances between clusters , which can be further clustered together, if the distance between clusters is considered to be small enough and the new cluster comprising both compared clusters is enough homogenous
or to compute the distance of objects inside the cluster in order to divide it into two more homogenous clusters

Please note that a cluster can sometimes consist only of one sample.

Now we will describe the most common distance/similarity measures that are widely used in the analysis of genomic and proteomic data. All these require objects described by quantitative variables, an appropriate choice in genomic and proteomic data analysis where the gene expression or protein abundance is represented by real numbers.

search