Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lecture Slides on An Introduction to Bioinformatics | BSC 5936, Study notes of Biology

Material Type: Notes; Class: ST:TEACH/LEARN SCIEN; Subject: BIOLOGICAL SCIENCES; University: Florida State University; Term: Unknown 2003;

Typology: Study notes

Pre 2010

Uploaded on 08/30/2009

koofers-user-chj
koofers-user-chj 🇺🇸

10 documents

1 / 11

Toggle sidebar

Related documents


Partial preview of the text

Download Lecture Slides on An Introduction to Bioinformatics | BSC 5936 and more Study notes Biology in PDF only on Docsity! Steve Thompson 1 Special Topics BSC5936: An Introduction to Bioinformatics. Florida State University The Department of Biological Science www.bio.fsu.edu Sept. 9, 2003 The Dot Matrix Method Steven M. Thompson Florida State University School of Computational Science and Information Technology (CSIT) Steve Thompson 2 The Dot Matrix Method. Gets you started thinking about sequence alignment in general. Provides a ‘Gestalt’ of all possible alignments between two sequences. To begin — I will use a very simple 0, 1 (match, no-match) identity scoring function without any windowing. As you will see later today, more complex scoring functions will normally be used in sequence analysis (especially with amino acid sequences). This example is based on an illustration in Sequence Analysis Primer (Gribskov and Devereux, editors, 1991). The sequences to be compared are written out along the x and y axes of a matrix. Put a dot wherever symbols match; identities are highlighted. A general way to see similarities in pair-wise comparisons: S E Q U E N C E A N A L Y S I S P R I M E R S • • • E • • • • Q • U • E • • • • N • • C • E • • • • A • • N • • A • • L • Y • S • • • I • • S • • • P • R • • I • • M • E • • • • R • • Since this is a comparison between two of the same sequences, an intra-sequence comparison, the most obvious feature is the main identity diagonal. Two short perfect palindromes can also be seen as crosses directly off the main diagonal; they are “ANA” and “SIS.” Steve Thompson 5 Reconsider the same plot. Notice the extraneous dots that neither indicate runs of identity between the two sequences nor inverted repeats. These merely contribute ‘noise’ to the plot and are due to the ‘random’ occurrence of the letters in the sequences, the composition of the sequences themselves. How can we ‘clean up’ the plots so that this noise does not detract from our interpretations? Consider the implementation of a filtered windowing approach; a dot will only be placed if some ‘stringency’ is met. What is meant by this is that if within some defined window size, and when some defined criteria is met, then and only then, will a dot be placed at the middle of that window. Then the window is shifted one position and the entire process is repeated. This very successfully rids the plot of unwanted noise. Filtered Windowing — The only remaining dots indicate the two runs of identity between the two sequences; however, any indication of the palindrome, “ANA” has been lost. This is because our filtering approach was too stringent to catch such a short element. In general you need to make your window about the same size as the element you are attempting to locate. In the case of our palindrome, “AN” and “NA”’ are the inverted repeat sequences and since our window was set to three, we will not be able to see an element only two letters long. Had we set our stringency filter to one in a window of two, then these would be visible. The Wisconsin Package’s implementation of dot matrix analysis, the paired programs Compare and DotPlot use the window/stringency method by default. S E Q U E N C E A N A L Y S I S P R I M E R A • N • A • L • Y • Z E S • E • Q • U • E • N • C • E • S In this plot a window of size three and a stringency of two is used to considerably improve the signal to noise ratio (remember, I am using a 1:0 identity scoring function). Steve Thompson 6 You need to be careful with window/stringency dot matrix methods. Default window sizes and stringencies may not be appropriate for the analysis at hand. The Wisconsin Package default window size and stringency for protein sequences are 30 and 10 respectively (based on BLOSUM scores [soon to be explained in Dr. Quine’s lecture]). Sometimes this is perfectly reasonable. Take for instance the next real-life example — the human calmodulin protein sequence compared to itself. Filtered dot plot techniques — Human calmodulin x itself — W h at ’s y o u r in te rp re ta ti o n ? D o y o u k n o w w h at t h e E F -h an d i s? Steve Thompson 7 The calmodulin structure — The four EF-Hand Helix- Loop-Helix conformations (at positions 20,56, 93, and 129) bind Ca++ ions to affect several biological systems, including: mediate control of a large number of Ca++ dependent enzymes, in particular several protein kinases and phosphotases, many of which affect systems ranging from muscle action and cAMP to insulin release. Calmodulin x alpha actinin — default parameters Æ some confusion window=24/stringency=24 Æ clearer picture Alpha actinin has two EF-hand motifs to calmodulin’s four. Steve Thompson 10 That same region ‘zoomed in on’ has some small direct repeats seen by comparing the sequence against itself without reversal: But looking at the same region of the sequence against its reverse- complement shows a wealth of potential stem-loop structure in the transfer RNA: Steve Thompson 11 22 GAGCGCCAGACT G 12, 22 || | ||||| | A 48 CTGGAGGTCTAG A 3 Base position 22 through position 33 base pairs with (think — is quite similar to the reverse- complement of) itself from base position 37 through position 48. MFold, Zuker’s RNA folding algorithm uses base pairing energies to find the family of optimal and suboptimal structures; the most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43. However the region around position 38 is represented as a loop. The actual modeled structure as seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between. FOR EVEN MORE INFO... http://bio.fsu.edu/~stevet/workshop.html Contact me (stevet@bio.fsu.edu) for specific bioinformatics assistance and/or collaboration. What about these alike areas? What’s the best ‘path’ through the dot matrix? How long do I extend it? How can I ‘zoom-in’ on it to see exactly what’s happening? Where, specifically, is this alignment; how can I see the ‘best’ ones? And, what can I learn from these alignments? This brings up the alignment problem. It is easy to see that two sequences are aligned when they have identical symbols at identical positions, but what happens when symbols are not identical or the sequences are not the same length? How can we know that the most alike portions of our sequences are aligned, when is an alignment optimal, and does optimal mean biologically correct? But, how to do all of this? A ‘brute force’ approach just won’t work. Even without considering the introduction of gaps, the computation required to compare all possible alignments between two sequences requires time proportional to the product of the lengths of the two sequences. Therefore, if the two sequences are approximately the same length (N), this is a N2 problem. To include gaps, we would have to repeat the calculation 2N times to examine the possibility of gaps at each possible position within the sequences, now a N4N problem. Waterman illustrated the problem in 1989 stating that to align two sequences 300 symbols long, 1088 comparisons would be required, about the same number of elementary particles estimated to exist in the universe! Part of a better solution . . . enter the dynamic programming algorithm and Dr. Jack Quine’s lecture. Conclusions —
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved