Download Comparative Genomics in Bioinformatics: Distances, Phylogenies, and Rearrangements - Prof. and more Study notes Computer Science in PDF only on Docsity! CISC636, S08, Lec23, Liao CISC 636 Intro to Bioinformatics (Spring 2008) Comparative Genomics • Comparative genomics is the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and noncoding regions of the genome. What are the comparative genome sizes of humans and other organisms being studied?
organism estimated size
Home sapiens 2900 million bases
(human)
Rattus norvegicus 2,750 million bases
(rat)
Afus musculus 2400 million bases
(mouse)
Drosophila melanogaster
(Gruit £9) 180 million bases
Arabidopsis thaliana
(plant)
Caenorhabditis elegans
(oundworm)
125 million bases
97 million bases
Saccharomyces cerevisiae
Gucats 12 million bases
Escherichia coli
(bacteria)
Hi. influenzae
(bacteria)
4.7 million bases
1.8 million bases
estimated
gene number
~30,000
~30,000
~30.000
13,600
2
va
Un
00
19.100
6300
3200
1700
*“hybrmation extracted fom genome publication papers below.
average gene density chromosome
number
1 gene per 100,000 bases 46
1 gene per 100.000 bases 42
1 gene per 100.000 bases 40
1 gene per 9.000 bases. g
1 gene per 4000 bases 10
1 gene per £000 bases 12
1 gene per 2000 bases 32
1 gene per 1400 bases 1
1 gene per 1000 bases 1
Genomic Distances. • Tasks: to explain differences in gene orders in two or more genomes in terms of a limited number of rearrangement processes. • For single-chromosome genomes, this requires the calculations of an edit distance between two linear orders on the same set of objects, representing the ordering of homologous genes in two genomes. • In the ''signed'' version of the problem, a plus or minus is associated with each gene, representing the direction of transcription. One edit operation consists of the inversion, or reversal, of any number of consecutive terms in the ordered set, which, in the case of signed orders, also reverses the polarity of each term within the scope of the inversion. • The calculation of the distance for unsigned genomes with inversions only is NP-hard; for signed problem it is of polynomial complexity. For multi-chromosome genomes, another important edit operation is reciprocal translocation, representing the exchange of terminal fragments between two chromosomes. Some formulations of the distance problem for translocation are of polynomial complexity (Hannenhalli-Pevzner algorithms) Gene-order Phylogenies. • Tasks: the reconstruction of ancestral gene orders. This is NP-hard, even with only three input genomes. The number of breakpoints is an alternate, easily computed, genomic distance which, however, is also theoretically hard to generalize to the phylogenetic context. Comparative Mapping. • The simplest model of genomic divergence, deriving from a 1984 study by Nadeau and Taylor, assumes the spatial homogeneity of both breakpoint and gene distributions along the chromosomes. The main focus has been the severe underestimation of the number of segments in comparisons where there are relatively few genes common to the data sets for a pair of species. Genome rearrangement
chromosome 1 {
~~ abed — reciprocal —_,_
concen wxy translocation °
chromosome ¢
— inversion—»
wxabcyz
transposition
wx yabez
chromoso! me 1’
wxweba yz
wabecx yz
Gates and Papadimitriou (1979) any permutation can be sorted by at most (n+1)5/3 prefix reversals. Profiling • Treat all metabolic pathways independently • Impose an arbitrary order for enumerating the pathways for all organisms • Encode presence/absence as 1s and 0s. Implication of profiling • In a metabolic pathway profile, an organism is represented as a string of zeros and ones. • The task of comparing genomes by their pathway profile boils down to comparison of strings • No need for alignment, but a scoring scheme is needed. • Profiles can be used from either an organism perspective or a pathway perspective (e.g. find patterns of correlation among pathways, perform pathway reconstruction). Information-based weight approach • To capture the correlations • An intuitive way to including the hierarchical structure of pathways – correlation 1: pi and pj are sibling, or distantly related – correlation 2: frequency (probability) of mismatching at pi • Definitions – Master Tree: All known metabolic pathways are represented as a tree using WIT’s categories – p-tree: derived from the master tree by dropping leaves that correspond to pathways absent from an organism. The Algorithm 1. Overlay two p-trees 2. Score mismatches/matches between two p-trees, label scores at the corresponding leaves on the master tree. 3. Propagate scores bottom-up as follows - average score from siblings - multiply the averaged score by inverse of the depth 1 1 1 1 0 1 1 1 1 1 Example = 0.5 0 1 1 1 1 = 0.9375