E-learning in analysis of genomic and proteomic data 2. Data analysis 2.2. Analysis of high-density genomic data 2.2.1. DNA microarrays Example

One of the objects of microarray experiments is to find new biological markers, which could help doctors to make correct diagnose as soon as possible. Particularly in oncology is important to capture the disease in the beginnings, because the lower stage cancer during diagnose is, the higher probability of remission or at least partial remission. Stage of cancer is the most powerful predictor of survival. In higher stages there is large mortality. Finding genes that are over or under expressed in cancer tissues will help in making faster and more accurately diagnose, which is important with respect to treatment. The heterogeneity of some types of tumours requires different treatments, but it is difficult to distinguish it.

Example dataset

In this example we will compare two groups of colorectal tumors One group contains sample of microsatellite stable colorectal tumor (MSS) and the other contains microsatellite instable colorectal tumor (MSI). The dataset is from Gene Expression Omnibus. In total it contains 38 microsatellie stable and 13 microsatellite instable colorectal tumors, each from a separate patient. All mRNA assays were hybridized on Affymetrix HuGeneFL array, with 7129 probesets.

The normalized dataset can be downloaded here.

To follow the analysis, we have to run R and copy the commands into the R console. If you are not familiar with R and Bioconductor, please, read the chapter on Software.

Differential expession

First of all we have to load the data.

> load(“GSE11543.rdata”)

To check if the data is loaded correctly, we use the following command that shows a list all variables in our session.

> ls()
[1] "Xnrm"

Clearly, there is only one variable called Xnrm. This variable is an object of class ExpressionSet, which stores the information about the probesets on the microarray slide, samples and gene expression intensities. Calling the variable returns the following summary:

> Xnrm
Loading required package: Biobase
Loading required package: tools

Welcome to Bioconductor

Vignettes contain introductory material. To view, type
'openVignette()'. To cite Bioconductor, see
'citation("Biobase")' and for packages 'citation(pkgname)'.

ExpressionSet (storageMode: lockedEnvironment)
assayData: 7129 features, 51 samples
element names: exprs
sampleNames: GSM290693, GSM290694, ..., GSM290743 (51 total)
varLabels and varMetadata description:
sample: arbitrary numbering
featureNames: A28102_at, AB000114_at, ..., Z97074_at (7129 total)
fvarLabels and fvarMetadata description: none
experimentData: use 'experimentData(object)'
Annotation: hu6800

The assayData stores the gene expression matrix, the phenoData contains the information about samples and any meta-description of the dataset, featureData stores information about the probesets names. experimentData provides us with information about the type of the microarray slide the samples are hybridized on. For more information about the ExpressionSet class, type:


The matrix of gene expression intensities can be extracted from this object using the exprs()function:

> data.exprs = exprs(Xnrm)

Now we need to define the grouping variable that provides the information which sample belongs to which of groups we are about to compare. The group variable can be extracted from the ExpressionSet file

If for any reason, the group information is not provided, we can upload it from an external file.

> groups = read.csv2("./Pheno.11543.csv", sep=";", header=FALSE)

The probeset annotation file should be uploaded from the separate file, too:

> annotation = read.csv2("./Annotation.csv", sep=";", header=FALSE)

Now we have all the data necessary to run the analysis. In this example we will apply four methods for differential expression analysis.

1) Two-sample t-test

The R function for T-test is implemented in the basic stats package.

> ?t.test

The groups to test can be defined either each in one variable:
> a = data.exprs[1,] [groups[,2]==”S”] #selecting expression values of the first probeset of the group MSS
> b = data.exprs[1,] [groups[,2]==”I”] #selecting expression values of the first probeset of the group MSI
> t.test(a,b)

or we can use the model type of assignment:

> t.test(data.exprs[1,]~groups[,2])

Because we are going to test each of the thousands of probesets in our dataset, it is not reasonable to tape the same command thousands times in the R console. We can automate this by creating a for cycle. Let’s say we want store all the p-values in one vector, thus we define a variable

> result.p.val = c()

Next, we create the “for cycle”, and in each run we add to our vector the p-value for each probeset.

> for (i in c(1:nrow(data.exprs)))
result.p.val = c(result.p.val, t.test(data.exprs[i,]~groups[,2])$p.value)

This way we can create multiple vectors, storing for instance T-statistics, etc.

There is however, much faster way how to obtain results for all of the comparisons than the for cycle. The functions apply or tapply are much faster and provide us with results for all the probesets in just one command:

> ?apply
> result.p.val = apply(data.exprs, 1, FUN=function(x) t.test(x~groups[,2])$p.value)

Now we have to correct for the multiple testing problem described in detail in the chapter ….
There are different functions available in R and Bioconductor for this purpose. One of them is the function p.adjust offering 7 methods for p-value correction. To adjust our p-values we will use the Benjamini&Hochberg [1995] method for controlling false discovery rate and also Bonferroni correction which control familywise error rate.

> result.p.val adjust.bonf = p.adjust(result.p.val, method="bonferroni")
> result.p.val.adjust.BH = p.adjust(result.p.val, method="BH")

To compare result of these two corrections we will plot histograms

> png("./Comparison.of.t.test.p.val.adjustment.png", width=1100, height=400)
> par(mfrow=c(1,3))
> hist(result.p.val, ylim=c(0,7000), xlab = "P.values", main ="P.values without correction", cex.main=2, cex.lab=1.5, cex.axis=1.5)
> hist(result.p.val.adjust.bonf, ylim=c(0,7000), xlab = "P.values", main = "P.values-Bonferroni correction", cex.main=2, cex.lab=1.5, cex.axis=1.5)
> hist(result.p.val.adjust.BH, ylim=c(0,7000), xlab = "P.values", main = "P.values-Benjamini & Hochberg correction", cex.main=2, cex.lab=1.5, cex.axis=1.5)
> dev.off()
null device


The picture shows us three histograms, on the left is a histogram of unadjusted p-values, in the middle of p-values after Bonferroni correction and on the right side a histogram of p-values after B&H correction. Clearly there are big differences between the two adjustments. The Bonferroni correction is too conservative for such a high number of comparisons and in further calculation we have selected the B&H adjustment.

> res.t.test.final = cbind(rownames(data.exprs),result.p.val.adjust.BH)

T-test is based on several assumptions. When they cannot be confirmed, then we have to use nonparametric tests. Let’s check assumption of normality for our data.

> shap.wilk.res.MSS = apply(data.exprs[,groups[,2]==”MSS”], 1, FUN=function(x) shapiro.test(x)$p.value)
> shap.wilk.res.MSI = apply(data.exprs[,groups[,2]==”MSI”], 1, FUN=function(x) shapiro.test(x)$p.value)
> SW.MSS.p.adjust = p.adjust(shap.wilk.res.MSS, method="BH")
> SW.MSI.p.adjust = p.adjust(shap.wilk.res.MSI, method="BH")

> length(which(SW.MSS.p.adjust<0.05))
[1] 1694
> length(which(SW.MSI.p.adjust<0.05))
[1] 162

Shapiro Wilk test of normality reveals that on significance level α = 0.05 we cannot confirm data normality for 162 probesets in MSI group and for 1694 probesets in MSS group.
In such case, we have to consider a nonparametric approach. The second method we apply is thus the Wilcoxon rank test. The processing is very similar like in case of t-test, we just need an appropriate R function (wilcox.test).