E-learning in analysis of genomic and proteomic data 2. Data analysis 2.1. General analysis workflow 2.1.1. Data pre-processing

Pre-processing of data is one of the most important steps in any type of data. It is often data-type specific, but there are some general common features.

First, there are two main types of data pre-processing:

  1. Quality control - this means that data should be controlled and error-prone before the analysis starts. Usually includes:
    • Detection of typing errors (character value instead of a number, additional zero's)
    • Checking the consistency of variable (are all the values in the same metric scale? - e.g. height of some patients measured in cm, others in meters)
    • Detection of outlier values indicating problem in system measurement (e.g. pixel saturation in image analysis, negative values if only positive allowed, etc.)

  2. Normalization - involves data transformations that allow direct comparison of values across samples / experiments, and/or ensure the normality of the data
    • Logarithmic (or other function) transform - applied to data in order to change the shape of the distribution, applied usually for extreme skewed distributions. The result of the transformation usually gives gaussian-shaped distribution. Such transformed data can be treated by parametric methods, where normality of the data is one of the assumptions.
    • Global normalization - ensures that all the samples have the same median/mean
    • Scale normalization - transformation that unifies the scale of the data across samples

Concrete pre-processing methods are discussed in following chapters and are data-type specific.