E-learning in analysis of genomic and proteomic data 2. Data analysis 2.1. General analysis workflow 2.1.1. Data pre-processing
Pre-processing of data is one of the most important steps in any type of data. It is often data-type specific, but there are some general common features.
First, there are two main types of data pre-processing:
- Quality control - this means that data should be controlled and error-prone before the analysis starts. Usually includes:
- Detection of typing errors (character value instead of a number, additional zero's)
- Checking the consistency of variable (are all the values in the same metric scale? - e.g. height of some patients measured in cm, others in meters)
- Detection of outlier values indicating problem in system measurement (e.g. pixel saturation in image analysis, negative values if only positive allowed, etc.)
- Normalization - involves data transformations that allow direct comparison of values across samples / experiments, and/or ensure the normality of the data
- Logarithmic (or other function) transform - applied to data in order to change the shape of the distribution, applied usually for extreme skewed distributions. The result of the transformation usually gives gaussian-shaped distribution. Such transformed data can be treated by parametric methods, where normality of the data is one of the assumptions.
- Global normalization - ensures that all the samples have the same median/mean
- Scale normalization - transformation that unifies the scale of the data across samples
Concrete pre-processing methods are discussed in following chapters and are data-type specific.