E-learning in analysis of genomic and proteomic data 2. Data analysis 2.2. Analysis of high-density genomic data 2.2.2. DNA sequence analysis

author: Natália Martínková

Amount of sequence data rapidly increases on weekly bases fuelled by intensive genome sequencing projects as well as research of individual laboratories. Analysis of original image files starts with establishing nucleotide sequence, resolving possible polymorphism and sequence annotation. Sequence annotation includes identification of functional parts of the genome such as different RNA molecules and especially protein prediction. Annotated sequence submitted to public databases can be subsequently utilised for analysis of unknown sequences. Possible origin of such sequences can be identified, and protein function can be estimated based on similarity of query sequence to other available data. Sequence comparison can lead to identification of mutations and their possible links to genetically associated diseases and to reconstruction of evolution of function.

A human genome contains a sequence of 3.2 billion nucleotides. Written in the genetic alphabet of ATCG, human genome could fit into a library containing about six thousand novels. All this information is available on the internet. Apart from being written into a database, the human genome has been intensely analysed. Thousands of scientists dedicated their attention to understanding individual proteins, their mutual interactions, and metabolism. As a result, human genome, together with genomes of a few model organisms, is the best-studied genome with the most reliable annotations.

Human genome sequence forms only a fraction of available sequence information in the genetic databases, but it is a backbone from which we expand our knowledge. Sequences of pathogens, agriculturally important organisms, organisms of cultural interest contribute substantially to our knowledge base. Current sequencing effort produces such vast quantities of data every day, that new uploaded sequences represent about 5% of all internet content in the world. In other words, five percent of new internet information is written in A, T, C, G for adenine, thymine, cytosine and guanine, the four nucleotides that dominate all DNA sequences.

Largely, we do not understand this information, and the role of a bioinformatician is to sort through it, find patterns, recognise function and discover. From Sanger to shotgun sequencing

The amount of obtained sequence information is limited by technology and funding. Currently, Sanger sequencing is a vastly established sequencing method readily available to scientific public with sequencers working in most molecular labs. It reads about 800 base-pairs in a single reaction, and each sequence is processed individually. For that reason, the first genome assembly was a long-term effort of several major laboratories. On the other hand, new whole-genome sequencing methods provide shorter reads, from 30 to about 400 base-pairs, but sequencing is massively parallel. Where up to 368 different sequence fragments can be read in a single run of a conventional sequencer, a single run on a genomic sequencer reads hundreds of thousands different DNA molecule fragments.

Sanger sequencing
Sanger sequencing is based on a polymerase chain reaction (PCR). In a PCR, the template DNA fragment intended for sequencing is mixed in a reaction with two oligonucleotide primers, four nucleotide solutions and salts. An enzyme, DNA polymerase, is added to ensure elongation of a fragment. In a sequencing reaction, the mix of nucleotide bases is modified. A certain proportion of nucleotides is provided in a form of fluorescently labelled dideoxynucleotides. Those molecules, once incorporated into a DNA fragment, disable additional elongation of the DNA fragment. DNA polymerase is unable to ligate additional nucleotides to the chain that ends with a dideoxynucleotide, and each such terminated molecule contains only one labelled marker. As a result, a terminated chain is shorter than the intended PCR fragment and is fluorescently labelled according to the last incorporated nucleotide.
Reading the sequence is executed on a principle of capillary electrophoresis. Nucleic acid has a negative charge and thus flows towards a positive charge of the electric field. The capillary is filled with a polymer. DNA molecules move through the polymer with different speed depending on their size. Shorter fragments move faster than longer ones. Very high resolution of the capillary electrophoresis enables separation of DNA molecules that differ by a single base-pair. Fluorescent signal of the last incorporated nucleotide is detected and the intensity of the signal is scored.
The result of such a read is a wave signal where peaks are coloured according to nucleotides.

Short-fragment reads – shotgun
Current (autumn 2009) technologies for whole genome sequencing in the Czech Republic are limited to pyrosequencing on 454 FLX Genome Sequencer by Roche.
The technology is also referred to as sequencing by ligation. In essence, detectable light beam is released at the moment when a new nucleotide is ligated to the replicated DNA strand. (In Sanger sequencing, fluorescent signal was read after the reaction was completed and the fragments sorted through electrophoresis.)
Pyrosequencing is based on a biological process known as bioluminescence. The reaction occurs in an individual well where thorough pre-processing isolates a single strand of DNA molecule about 400 base-pairs long. This template DNA is fixed in the well on tiny beads. One bead binds the template DNA, other carry necessary enzymes, and additional beads hold all reactants in the well. Extra chemicals necessary for the reactions are subsequently washed over the wells in cycles.
Especially, each nucleotide is present in the well separately. This is crucial for the method of sequence detection. DNA is copied (synthesised) from the template as follows. A primer binds to a free end of the fragment and DNA polymerase can then synthesise the complementary strand starting from the primer. A nucleotide is washed over the well. If it finds a matching base on the template at the end of the primer, it can be ligated to it. When the DNA polymerase ligates a nucleotide onto a growing DNA chain, the reaction results in release of pyrophosphate. Enzyme sulfurylase, provided on the enzyme bead, creates converts pyrophosphate into ATP. ATP molecules are most frequent molecules that carry and provide energy for enzymatic reactions in living organisms. In case of pyrosequencing, the reaction that requires energy is bioluminescence. Enzyme luciferase utilises ATP to oxidate pigment luciferin. The reaction produces light.
This means that if a new nucleotide can be incorporated to the growing DNA strand, the well where this happened would light up. If it cannot, the well would remain dark as the whole chain of chemical reactions would remain inactive until the well is washed with such a nucleotide that complements the synthesised DNA strand and can be incorporated.
Thanks to hundreds of thousands of separate well on a single sequencing plate, this procedure records sequence of nucleotides in a massive number of DNA fragments simultaneously. The resulting files, called flowgrams, record light intensity signal in each well and can be read into a nucleotide sequence. Bioinformatic analysis

Sequencing of DNA produces a raw result. It contains a plethora of information, but in such a way that is indistinguishable without subsequent analysis. To extract this information and discover new insights, the simple sequences of nucleotides must be vigorously investigated.

Sequence identification
The first step in such investigation is identification of the obtained sequence. Public sequence databases already contain information about many organisms and genes and therefore such identification is simple and straightforward pending the completeness of the databases and limits of search methods.

Genetic databases
Databases that store DNA sequence data are maintained on three continents. In Europe, EMBL database can be found at http://www.ebi.ac.uk/embl/. In North America, GenBank database is accessible from http://www.ncbi.nlm.nih.gov/ and in Asia, DDBJ from http://www.ddbj.nig.ac.jp/. Information in all three databases is mutually synchronised every day, so that the same data can be retrieved from either of them at any time.
A typical entry in a nucleotide sequence database is identified with an Accession Number unique to each sequence followed by a version number. Definition of the sequence is a text line provided by the researcher who uploads the sequence. Usually, it contains all basic information necessary to identify the sequence – organism name, gene name and location and haplotype identification. These data are provided in the actual sequence entry with more detail in respective fields. Currently, most scientific journals require that all mentioned sequences are submitted to public databases. Such information is then cross-referenced between the journal article, where the Accession Numbers are provided, and a nucleotide database, where the article is cited and where available, identified with its PubMed code. The sequence itself is provided with annotations. These represent detailed information on sequence origin (organism, location of the gene, sample voucher number, sample locality, collection date, etc.), genes that are present in the sequence and information about proteins coded by the sequence, including their translation into an amino-acid sequence.
All fields are searchable through any database interface. At the moment, the easiest tool to search the sequence databases is Sequence Retrieval System at EBI (http://srs.ebi.ac.uk/).

Sequence comparison
BLAST is an abbreviation of Basic Local Alignment Search Tool at NCBI (http://blast.ncbi.nlm.nih.gov/); an algorithm that enables to search sequence databases where the query represents an unknown sequence provided by the researcher. It is a tool so often used, that the word blast has become both a noun and a verb common in bioinformatics slang. Whenever the researcher obtains a sequence from his/her experiments, BLAST is the first step in working with the raw data.
In effect, the BLAST search returns a list of sequences most similar to the query. Such a list informs the researcher what kinds of sequence the query most likely represents (a protein - which protein, a RNA sequence - which RNA, non-coding region) and from which organism it might have originated. For a project, where the gene and organism is known, a BLAST search provides basic reassurance that the researcher sequenced his/her target organism and gene, not an artefact or contamination. It is a routine operation to be executed for virtually any unknown or markedly divergent sequence a scientist encounters.
BLAST does not compare the whole query sequence against billions base-pairs in the databases. It breaks the query down to words. In a typical DNA search, an eleven bases long words are searched. The formation of words for BLAST search is fairly straightforward with a small catch. First, low-complexity regions must be removed. Low-complexity regions, such as repeats, long stretches of the same nucleotide or ambiguous regions return many hits across the database that are most likely not homologous. The remaining sequence is broken down to words in such a way that the first word represents the first 11 positions of the remaining sequence, the second word is formed from positions 2-12, the third 3-13 and so on.
Next, words are scored according to their similarity. For DNA words, a match is scored as +5 and a mismatch as -4. The best matches return the highest score, and BLAST retains and lists the highest scoring words and discards those whose score is below threshold. The remaining high-scoring words are then organised into an efficient search tree. Finally, the database is searched for exact matches of high-scoring words.
If the algorithm finds the exact match of the word, that word is used as a seed for alignment between query and the database sequence. The alignment stretches to left and right of the matched word and recalculates the score. Such high-scores segment pairs are then sorted and the algorithm attempts to evaluate their significance, combine and align multiple hits. The resulting image shows gapped Smith-Waterman local alignments of the query and each of the matched database sequences.
The matched database sequences are provided with a basic statistics that should help the researcher evaluate reliability of the found matches. In particular, parameter E-value represents expectation that such a match could have occurred in the database simply by chance. For relatively long sequences that are variable and complex, E-values tend to be very low.

Sequence alignment
There are two basic types of sequence alignments: local and global. As their names indicate, the most important feature of local alignment is nearly exact match between sequences on a local scale. Local alignment is used for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. Smith-Waterman algorithm is a dynamic programming algorithm often used for local alignment.
Global alignment, on the other hand, tries to align two sequences along their whole length and is most useful when the sequences in the query set are similar and of roughly equal size. Needleman-Wunsch dynamic programming algorithm is used in global alignment programs.
A set of similar sequences aligns in such a way that there is no difference between local and global alignments.

Smith-Waterman algorithm for local alignment
Aligning the whole sequence at once is a computationally intensive task. For that reason, Smith-Waterman algorithm breaks the problem down to smaller tasks, finds solutions to them and then puts them all together to form an optimal alignment. It utilises dynamic programming to assign score to individual alignment options. All residues of each compared sequence are compared to the other in a two-dimensional array, and all possible alignments are represented as pathways through this array. The optimal alignment is the pathway with the best score.
Local alignment achieves nearly perfect matches between sequences on a local scale with large gapped areas where sequences do not match. It is most suitable for sequences of different length, distantly related sequences, where only conserved regions of similarity can be expected. BLAST algorithm utilises local alignment technique.

Needleman-Wunsch algorithm for global alignment
Global alignment maximises a similarity score to return the largest number of residues of one sequence that can be matched with another allowing for all possible deletions. However, global alignment does so with respect to matching the whole length of aligned sequences, rather than finding the perfect match in short segments of respective sequences.


Phylogenetic analysis attempts to reconstruct relationships between sequences based on their homology. The information is presented in form of a phylogenetic tree. A phylogenetic tree describes the ancestor of all analysed sequences at the base (root) of the tree and course of evolution along its length where the tips represent the most recent events. Length of each branch of a phylogenetic tree is a compound of time of divergence and rate of evolution. As such, longer branches and divergences closer to root represent older evolutionary events, pending variation in the rate of evolution.

Interpretation of a phylogenetic tree
In essence, phylogenetic trees are one-dimensional. Only direct distance from root to tip is of consequence. This information is however most often spaced out in the second dimension to be readable and discernible. Then, if the root of the tree is located on the left side and tips on the right side, all vertical distances are meaningless and only horizontal distances represent evolutionary information.
The first group that branches away closest to root is the oldest. The junction represents the most recent common ancestor of the branched group and remaining tree. Subsequently, the further down the tree, the more recent divergence events and by extension the more closely related the investigated sequences are.
Sequences that descend from a common node in the tree form a monophyletic relationship. Generally, they represent an evolutionary group with common history and are all related by descend, but there are some exceptions. A monophyletic group, supported in a phylogenetic tree, might be considered for taxonomic classification and interpretation of evolutionary relationship.
A polyphyletic group is a group of samples that are scattered across the tree. Such information may point out potential taxonomic misidentifications, hybridisation events followed by introgression or, in terms of within-species diversity, mixing of evolutionary lineages.
Paraphyletic relationship in a phylogenetic tree occurs when a target group of sequences belongs to the same clade, but there are also additional samples in the clade that were not previously recognised as related to the target group.

A role of bioinfomatician in DNA sequence research is essential. The vast amount of available data makes research design an intellectual pursuit that must be carefully executed and thoroughly analysed. A bioinformatician might design a research question that utilises information stored in public domain sequence databases, analyse it using Open Source software and publish his/her discoveries in an Open Access journal. The trend is clear. Many scientists today promote collaboration and possibility of anyone to access knowledge base. With such tools available for use, genomic DNA research has great potential both today and in the foreseeable future.