E-learning in analysis of genomic and proteomic data 2. Data analysis 2.2. Analysis of high-density genomic data 2.2.1. DNA microarrays 2.2.1.11. Pathway Analysis

2.2.1.11.5 Discussion

We have introduced the building blocks of gene set analysis methods for microarray data. The two main components of any such method are the biological database used to define gene sets and the statistical/mathematical method to score the gene sets. It is our impression that most software packages are focus only on one of these two components. Many commercial pathway analysis software packages like Ingenuity Pathways Analysis (http://www.ingenuity.com/products/pathways_analysis.html) or Metacore (http://www.genego.com/metacore.php) give access to much more detailed biological information than the databases we discussed, but tend to use very simple statistical analysis. The numerous Bioconductor packages on the other hand offer much more sophisticated algorithms but can only access mainstream publicly available databases, that often don’t meet the needs of a specific microarray study. As we come from a statistical background ourselves the largest part of this text was devoted to the different philosophies behind the statical methods used in this context with a particular focus on the difference between selfcontained and competitive gene set tests.

As we showed the p-values from competitive tests (e.g. Fisher’s exact test) must be interpreted with caution as a) they concern a rather unusual null hypothesis and b) are calculated under the very unrealistic assumption of gene to gene independence. The argument that is often used to defend these methods is that they are not really used
for formal hypothesis testing but rather as a way to rank a list of gene sets. We still feel the same can be achieved with self-contained tests, which additionally can give valid p-values, that are uniformly distributed under the null hypothesis and thus enable us to use methods for false discovery rate control.

One common feature of all methods we discussed was that they do not use any other knowledge than the composition of the corresponding gene set. In the case of pathways the knowledge about gene interactions that is contained in the pathway map is completely ignored. There have been some articles that address this issue, like for example the impact analysis suggested by Draghici et al. [2007]. It is beyond the scope of this article to discuss these more complex tools in detail, but we think they present one import strand of future developments in the area: additionally to the automatic scoring of large databases of gene sets we think that in future there will be more activity towards the more detailed analysis of smaller number of selected pathways. Real improvements in this area will need close collaboration between biologists/life scientists on the one hand and bioinformaticians/statisticians on the other hand.