Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bioinformatics 1: Dynamic Programming Algorithm with Affine Gap Penalties and Alignment - , Study notes of Biology

A part of a university lecture series on bioinformatics 1. It covers the topic of dynamic programming algorithm in bioinformatics, focusing on affine gap penalties and various types of alignments: global, local-global, and local. The document also discusses the use of seqlab and optimizing parameters for bestfit.

Typology: Study notes

Pre 2010

Uploaded on 08/09/2009

koofers-user-ipc
koofers-user-ipc 🇺🇸

10 documents

1 / 29

Toggle sidebar

Related documents


Partial preview of the text

Download Bioinformatics 1: Dynamic Programming Algorithm with Affine Gap Penalties and Alignment - and more Study notes Biology in PDF only on Docsity! Bioinformatics 1: lecture 5 Dynamic Programming Algorithm, continued End gaps Affine versus linear gap penalties Gloabl, local-global and local alignment SeqLab: using GenBank Features Seqlab: optimizing parameters for BestFit Thinking about gaps •Each gap represents an evolutionary event (duplication, polymerase stutter, deletion/ligation, etc.) •If the alignment has "evolutionary distance" meaning, then the gap penalty score should be proportional to the number of gaps. •Two problems: What about long gaps versus short gaps? Are they equally probable? What about gaps at the ends of the sequence? How many evolutionary events took place there? Affine gap penalty scoring AGGCTACT~T~TCA GGCTACTATATCA AGGCTACTTT~~CA GGCTACTATATCA gap initiation = -5 gap extension = -1 -10 -6 -5 -5 -5 -1 Affine Gap DP •Optimal alignment is the highest scoring. •Alignments entering the last box on the bottom row can have 5 types of arrows, instead of just three. (1) Match (2) Open a gap in first sequence. (3) Open a gap in second sequence. (4) Extend a gap in first sequence. (5) Extend a gap in second sequence. Affine gap penalty worksheet M = match matrix I = insertion matrix D = deletion matrix scores for alignments currently in a match state scores for alignments with gap in first sequence. scores for alignments with gap in second sequence. M(i,j) is the max over three diagonal arrows I(i,j) is the max over three down arrows D(i,j) is the max over three right arrows A D P Q F GA K L K L D Q F G P A K L K L D Q F G P A K L K L D Q F G P A D P Q F G A D P Q F G N ot e: I w ro te th e s eq ue nc es in th e g ap ro w s! ! Affine gap penalty worksheet M = match matrix I = insertion matrix D = deletion matrix scores for alignments currently in a match state scores for alignments with gap in first sequence. scores for alignments with gap in second sequence. M M M D MI I I I M M Traceback The traceback is a sequence of M I or D , but this is NOT the traceback of the letters. Instead it is a traceback of the location of the letters (which matrix they are in), since the location (the matrix) defines the direction of the next arrow. MIIIIMDMMMD A ~ ~ ~ ~ D P Q F G ~ A K L K L D ~ Q F G P Should we penalize gaps at the ends ? Example: here is an alignment of mouse nitric oxide synthase (think black line). It has multiple domains which are homologous to several shorter proteins. If we penalize end gaps, what happens to the score of the true alignment? Did "end gaps" evolve the same way as internal gaps? (no!) Unless the two proteins are known to be single domains, it makes more sense NOT to penalize end gaps. Global in one sequence, local in the other If we penalize end gaps in sequence 2 but not in sequence 1, we are asking for an alignment that contains all of sequence 2 within sequence 1. G T T C A G C T T T C A C T 0 0 0 0 0 0 0 0 0 0 Global in one sequence, local in the other If we penalize end gaps in sequence 1 but not in sequence 2, we are asking for an alignment that contains all of sequence 1 within sequence 2. G T T C A G C T T T C A C T 0 0 0 0 0 0 Global, global-local, and local alignment •Global alignment (with end gaps) requires that all 4 termini are counted. In general, the two sequences be about the same length. •Global-local alignment (no end gaps in 1 or both seqs) requires that one of the two sequences be completely contained in the other or that 2 or the 4 the termini be included. •Local alignment finds subsequences in both. Does not require that the termini be included in the alignment. The choice of alignment method makes a statement about how the sequences are related. Was one sequence inserted into the other? The optimal alignment may be no alignment If the maximum score in the alignment matrix is < 0., then the optimal local alignment has score = 0 and looks like this: ATSFM~~~~~~~ ~~~~~PGTSFEP Structure-based alignments are "correct" 2DRC:A 1/2 MISLIAALAVDRVIGMENAM-PFNLPADLAWFKRNTL-------DKPVIMGRHTWESIG- 1DRF:_ 3/4 SLNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSIPE 2DRC:A 52/53 --RPLPGRKNIILSSQP--GTDDRVTWVKSVDEAIAACG------DVPEIMVIGGGRVYE 1DRF:_ 63/64 KNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYK 2DRC:A 102/103 QFLPK--AQKLYLTHIDAEVEGDTHFPDYEPDDWESVF------SEFHDADAQNSHSYCF 1DRF:_ 123/124 EAMNHPGHLKLFVTRIMQDFESDTFFPEIDLEKYKLLPEYPGVLSDVQEE---KGIKYKF 2DRC:A 154/155 EILERR 1DRF:_ 180/181 EVYEKN The closest thing to a "Gold Standard" for protein alignments is the sequence alignment that comes from a structure superposition. Note: Lots of mismatches (id=38%), few gaps (8), gaps are long (1-7). Structure-based alignment Two similar structures may be superimposed. The parts that overlay well are the matches (purple and green), and the parts that do not overlay well are the insertions (yellow and red). Aligned positions have similar chemical 3D environment. In class exercise: Features (3) Set the Display to "Features coloring" Double click on a blue shaded region of ECDHFOLG. A features window appears. Select "Features at Cursor." Note that the region is now selected. You can copy it. (4) Create a new "feature": Find the sequence "CGATCG" in ECDHFOLG. Select it. Open the Features window and Add a feature for this region. Call it "restriction_site" and put "PvuI" in the comments area. Give it a Diamond Shape. Back in the Editor, set Display to Graphical Features, change the scale to 16:1 and find the Diamond. Is it in the CDS? In class exercise: translation (5) Double-click on the CDS and select the CDS feature in the window that pops up. Close the window. The coding region is still selected. Copy the selected region. Create a new DNA sequence. Paste the selected region (text) into that new line. Remove any gaps if neessary. Translate that gene to amino acids in frame 1 only (one letter code). The amino acid sequence should start with "MISLIAA...". Rename the sequence "ecdhfr" using the INFO window. Remove all gaps. Do the same for the CDS region of LBADHFR. Label the new protein sequence "lcdhfr" It should start "MTAFL..." In class exercise: pairwise alignments (6) Remove any gaps you may have created in both sequences. Check that the sequences agree with the sequences in the corresponding "features" (in the DNA sequences). If so, go ahead and delete the DNA sequences. (7) Select the two protein sequence, now names ecdhfr and lcdhfr. Align them using Functions-->Pairwise-- >Bestfit. Open Options...Set different gap penalty schemes. At the bottom, select both "New sequence file..." and give them unique names ending in .gap For example, for gap penalty 10 use "ec_10.gap". Run. When finished, select the two .gap files in the Output Manager window, and "Add to Editor". Compare results as on the following page.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved