# E-learning in analysis of genomic and proteomic data 2. Data analysis 2.2. Analysis of high-density genomic data 2.2.1. DNA microarrays 2.2.1.10. Meta-analysis of microarray experiments

Meta-analysis of microarrays is a relatively new topic. Nascent methods are created with specific goal. As to our knowledge, there is no study comparing all available methods. The aim of this article is to offer a brief review of published methods and their SW implementation.

Following attributes of each method have been monitored: goal and principle of a method, type of microarray platforms, maximal number of studies that method can compare, input data, outputs and SW implementation.

Methods can be divided with regard to any of the above mentioned attributes as follows:

**According to type of microarray experiment:**

- methods for expression or microRNA microarrays
- methods for CGH microarrays

**According to type of input data:**

- methods computing with lists of names of genes
- methods computing with T-statistics and p-values
- methods computing with log
_{2}ratios

**According to type of used microarray platform:**

- both cDNA and Affymetrix microarrays
- only Affymetrix microarrays
- only cDNA microarrays

**According to implementation:**

- implemented in R
- implemented as executable files
- implemented in WinBUGS

**According to number of studies being compared**:

- 2
- 2 and more

**According to type of the method:**

- modeling approaches
- classifiers
- scoring methods

In the following, we will divide methods according to type of microarray experiment.

## Methods for expression or microRNA microarrays

The most important step in analysis of expression/microRNA microarrays is the search for differentially expressed genes, usually between two or more groups of samples. This step usually involves hypothesis testing with correction to so-called multiple testing problem. As a result of this testing we obtain values of test statistics, p-values and list of significantly differentially expressed genes chosen according to selected significance level. Each of these outputs can serve as an input for following meta-analysis. Therefore we divide methods for expression or microRNA microarrays according to type of input data.

**Methods computing with lists of names of genes**

Input lists are lists of significantly differentially expressed genes or lists of names of all genes ordered by value of test statistics. These methods are based on counting matches between two or more gene lists or in binary vectors.

One of the first proposed methods called VennMapping (Smid *et al.*, 2003) uses Venn’s diagrams and contingency tables to find genes presented in pairs of lists of differentially expressed genes. Even when the input can consist of more than two such gene lists, the comparison is always performed only between pairs of lists. Method of Rhodes *et al. *(2004) called Meta-profiling searches for genes present in several gene lists. It is a modification of a previously proposed method of Rhodes *et al.* (2002) (for details see next chapter) used for comparison of analogous microarray studies against one another. In comparison to that approach, Meta-profiling is aimed not at validating analogous data sets, but at comparing and assessing the intersection of many cancer type-specific gene expression data sets. The goal is to find cancer non-specific genes responsible for neoplastic transformation. Principle is based on computing certain statistics for a gene. A gene enters the algorithm and thus can be included in the final meta-list of significant genes only if it is significant in several studies. However, as pointed out by Yang et al. (2005), this method is based on analysis of significance of a gene rather than on meta-analysis. Yang *et al.*,(2005) propose a combination of both VennMapping and Meta-profiling. The method is called MAP-Matches and uses analysis of binary vectors to find out differentially expressed genes and common molecular mechanism between different types of cancer. SOGL (Yang *et al.*, 2006) – counts matches in first or last positions in ordered gene lists, it counts similarity score for gene lists.

**Methods computing with numerical characteristics of difference in gene expression (values of test statistics, p-values)**

These methods are based on already known concepts of meta-analysis and modeling.

The first proposed method was Fisher’s method inverse chi-square method (Rhodes *et al.*, 2002). It combines p-values – results of hypotesis testing - into S-statistics, which is then used to test hypothesis, that positive results from individual studies correspond to same genes.

The other methods use modeling approach: Effect-size modeling (Choi *et al.*, 2003) models effect size (Hedges and Olkin’s, 1985) with random effect or fixed effect models and LASSO method (Ghosh *et al.*, 2003) models T-statistics, It estimates the parameters of model by LASSO method (Tibshirani, 1996). Generally, a model of a statistic is fitted for each gene. Choi introduces (defines) IDD as Integration-Driven Discovery and IDR as Integration-Driven Discovery rate. IDD is a gene that is identified as differentially expressed only by meta-analysis (it wasn’t significant in any of previous analysis). IDR is ratio of number of IDD to number of all significant genes. Integration-Driven Discovery Rate and Integration-Driven Revison Rate were later used to compare different types of Bayesian models. Integration-Driven Revison was defined by Stevens and Doerge (2005) as a gene that has been identified as significant by a previous analysis but was insignificant in meta-analysis.

**Methods computing with expression profiles**

This group of methods is very heterogenous and particular methods use different approaches – from data-mining to commom statistical approaches like ANOVA. Here belong methods:

Some of them were created to find gene markers that could be used to classify new sample by for example PCR. These methods look for minimal amount of significant genes in contrast to other methods that focus on finding as much relevant genes as possible. Between them belong: Gene Shaving Random Forests (GSRF) and Gene Shaving Fisher’s Linear Discrimination (GSFLD) created by Jiang *et al.* (2004). Author has joined so-called *Gene Shaving*, introduced by Hastie *et al. *(2000) and classification methods. In GSRF and GSFLD importance of particular gene is set according to decrase in misclassification rate by that gene.

Similar goal has TSP-classifier (Geman *et al.*, 2004), It creates classification rules based on expression profiles. Classification rules includes couple of genes and determining is a relationship between expression values of that two genes. These rules are then used to classify new samples.

Papers by Conlon from years 2006 and 2007 compares couples of Bayesian models (Conlon, 2007, Conlon *et al.*, 2006, 2007) They create models of distribution of expression levels of each gene. They can divide genes into categories: up-regulated, down-regulated and non-differentially expressed. Parameters of Bayesian models are estimated by Markov Chain Monte Carlo algorithm.

In the same article as above-mentioned LASSO method methods that estimate FDR (Ghosh *et al.*, 2003) were introduced. These methods rise from models of test statistics and estimates of FDR and q-values... Differentially expressed genes are identified by q-values.

Whereas methods like Choi *et al.*(2003) or Rhodes *et al.* (2004) are heavily discussed or cited, we Two-stage ANOVA (Park *et al.*, 2006) and Z-statistics (Wang *et al.*, 2004) are …. Two-stage ANOVA firstly removes variability caused by different laboratories, then looks for differentially expressed genes using hypothesis testing (no difference in gene expression between groups). Z-statistic estimates variance of gene expression from pooling expression values from all genes with similar average gene expression, then analyzes difference in average gene expression between two groups (uses Z-statistic). The idea of pooling information from all genes comes from Bayesian modeling.

Kind of special method is Latent variable method (Choi *et al.*, 2007). One of difficulties with meta-analysis of microarray data is different range of expression values coming from different microarray platforms. This problem is solved by Latent variable method. It transforms gene expression data into interval <-1,1> (probability of expression) by maximum likelihood method or Bayesian models. Transformed data are then analyzed by known methods of analysis of microarray.

Review of methods together with their detailed characteristics offers Table 1.

**Table 1. Review of methods of meta-analysis of micorarrays**

## Methods for meta-analysis of CGH microarrays

Data from CGH microarrays are more homogenous than data from expression microarrays. {Maybe therefore, there is not such amount of methods of meta-analysis of CGH microarrys[moj komentar]} Only one method of meta-analysis of CGH microarrays is known. This method transforms data to common format and then uses known methods of simultaneous search for altered segments of CGH microarrys. Transformation consist of this five steps:

- Clone mapping
- Smoothing of Log2ratios (Jong et al., 2004)
- Dividing of each chromosome to 100 positions
- Assigning of log2ratios to position
- Conversion log2ratios to Z-statistics