Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Bioinformatics - Lecture Slides | BSC 5936, Study notes of Biology

Material Type: Notes; Class: ST:TEACH/LEARN SCIEN; Subject: BIOLOGICAL SCIENCES; University: Florida State University; Term: Spring 2003;

Typology: Study notes

Pre 2010

Uploaded on 08/30/2009

koofers-user-8nt
koofers-user-8nt 🇺🇸

10 documents

1 / 12

Toggle sidebar

Related documents


Partial preview of the text

Download Introduction to Bioinformatics - Lecture Slides | BSC 5936 and more Study notes Biology in PDF only on Docsity! Bioinformatics 1 1 INTRODUCTION TO BIOINFORMATICS Robert van Engelen 2 Overview Æ Part I: Algorithms on Strings, Trees, and Sequences Æ Why are computers used in Biology and what is the role of Computer Science in Biology? Æ Part II: Neural Nets and Genetic Algorithms Æ How can we use nature’s biological computing mechanisms to solve complex problems in Computer Science? 3 Part I ÆWhy are computers used in Biology and what is the role of Computer Science in Biology? ÆGrowth of data such as DNA sequence data ÆPattern search and pattern analysis 4 Algorithms on Strings, Trees, and Sequences Æ Many molecular biology problems on sequences can be formulated as string matching problems Æ Storing, retrieving, and comparing DNA strings Æ Comparing two or more strings for similarities Æ Searching databases for related strings and substrings Æ Defining and exploring different notions of string relationships Æ Looking for new or ill-defined patterns occurring frequently in DNA 5 Strings, Trees, and Sequences (cont’d) Æ Reconstructing long strings of DNA from overlapping string fragments Æ Determining the physical and genetic maps from probe data under various experimental protocols Æ Looking for structural patterns in DNA and protein determining secondary (2D) structure of RNA Æ Finding conserved but faint patterns in many DNA and protein sequences Æ And much more... 6 Matching and Alignment of Strings and Sequences ÆExact string matching ÆKnuth-Morris-Pratt and Boyer-Moore ÆExact matching with a set of patterns ÆAho-Corasick Æ Inexact matching ÆEdit Distance and dynamic programming ÆSequence alignment problems ÆMultiple alignment problems Bioinformatics 2 7 What is a String? ÆDefinitions ÆA string S is an ordered list of characters of a given alphabet written contiguously from left to right ÆS(i) denotes the character at position i in string S Æ |S| denotes the length of string S ÆS[i..j] is the contiguous substring of S starting at position i and ending at position j 8 Example ÆAlphabet = {a,b,c,1,2,3,#,$} ÆLet S = a1#33$ ÆS(1) = a ÆS(3) = # Æ |S| = 6 ÆS[4..5] = 33 9 String ≠ Sequence ÆA string is not the same as the concept of a (sub)sequence in biology! Æ (Sub)sequences in the biological literature refer to strings that might be interspersed with other characters, such as gaps 10 What are Prefixes, Suffixes, and Substrings? ÆDefinitions ÆS[1..i] is a prefix of string S ÆS[i..|S|] is a suffix of string S ÆS[i..j] is an empty string if i>j Æ The proper prefix, suffix, or substring of a string is a prefix, suffix, or substring that is not the entire string nor the empty string 11 Example ÆLet S = abcd ÆS[1..2] = ab is a proper prefix of S ÆS[2..3] = bc is a proper substring of S ÆS[2..4] = bcd is a proper suffix of S ÆS[1..4] = abcd is a prefix, suffix, and substring of S ÆS[4..3] is empty 12 Exact String Matching ÆWe call string P the pattern of length n=|P| ÆWe call the string T the text of length m=|T| Æ The exact matching problem: find all occurrences of P in T (if any) Æ Let P = aba Æ Let T = bbabaxababay Æ Then P occurs in T at locations 3, 7, and 9 Bioinformatics 5 25 Bad Character Rule ÆDefinition Æ For each character x in the alphabet, let R(x) be the position of the right-most occurrence of x in P ÆR(x) = 0 if x does not occur in P ÆFor any alignment of P and T, if the first mismatch comparing P to T from right to left occurs at position i in P and k in T then shift P right by max(1,i-R(T(k))) 26 Bad Character Rule Example Æ Let P = tpabxab Æ Let T = xpbctbxabpqxctbpq Æ xpbctbxabpqxctbpq tpabxab tpabxab Æ P(3) = a Æ T(5) = t Æ R(a) = 6,R(b) = 7,R(p) = 2,R(t) = 1,R(x) = 5 Æ shift 3-R(t) = 2 places 27 Extended Bad Character Rule Æ xpbctbxabtqxctbpq tpabxabt Æ Since R(t)=8, shift is only max(1,3-8)=1 Æ Extended rule uses index sets for R, e.g. R(t)={1,8}, and uses value of R that is closest to the left of current position i ÆWith this extension, the shift is max(1,3-1)=2 28 Problem ÆThe (extended) bad character rule is effective for matching English text, but less effective for small alphabets, e.g. ACTG, and does not lead to O(n+m) worst case asymptotic time 29 Boyer-Moore ÆThe Boyer-Moore algorithm shifts by the largest amount given by the (extended) bad character rule and the (strong) good suffix rule, resulting in a O(n+m) time algorithm 30 The Strong Good Suffix Rule ÆWhen P(i-1) mismatches T(k-1) and suffix S = P[i..n] matches substring T[k..j] Æ Shift P to the right such that the right-most copy of S in P (that is not a suffix of P) matches T[k..j] and the character P(i-1) differs from the character to the left of the copy of S in P Æ _________xS____ __zS_yS_yS_ __zS_yS_yS_ Bioinformatics 6 31 The Strong Good Suffix Rule Æ If a copy of S in P does no exists, shift P by a least amount so that a that prefix of P matches a suffix of S in T ÆSuppose S = ab Æ _________xSc___ bc______ySc_ bc______ySc_ Æ If no such shift is possible, shift P n places to the right 32 The Boyer-Moore Algorithm Æ For any alignment of P and T, if the first mismatch comparing P to T from right to left occurs at position i-1 in P and k in T and L'(i)>0 then shift P right by n-L'(i) positions Æ Define N(j) = the length of the longest suffix of P[1..j] that is also a suffix of P Æ Let P = cabdabdab then N(3) = 2 and N(6) = 5 Æ for i := 1 to n do L'(i) := 0 for j := 1 to n-1 do i := n-N(j)+1 L'(i) := j 33 Exact Matching of Multiple Patterns ÆBoyer-Moore is faster than Knuth- Morris-Pratt in practice ÆKMP algorithm forms basis for Aho- Corasick algorithm for matching multiple patterns ÆMultiple pattern matching in O(n+m+k) time where k = the number of occurrences in T of any of the patterns 34 Application: Sequence Tagged Site (STS) Æ An STS is a DNA string of length 200-300 nucleotides whose right and left ends, of length 20-30 nucleotides each, occur only once in the entire genome Æ Each STS occurs uniquely in the DNA of interest Æ Hundreds of thousands STSs in databases Æ Problem: find which STSs are contained in anonymous DNA Æ Use Aho-Corasick to find STSs in newly sequenced DNA to find the map locations 35 Exact Matching With Wildcards Æ Zinc Finger DNA transcription factor: CYS??CYS?????????????HIS??HIS Æ If the number of wildcards ? is limited and can be bounded by a fixed constant, a linear time O(n+m) algorithm exists Æ If the number of wildcards is unbounded, it is not known if the problem can be solved in linear time 36 Regular Expression Matching ÆA regular expression (RE) is ÆA character from the alphabet Æ The “empty” symbol e ÆConcatenation of two REs, written as R1R2 ÆAlternation of two REs, written as R1+R2 ÆRepetition of an RE, written as R* ÆO(n*m) time Bioinformatics 7 37 Regular Expression Example Æ (a+b)yk(pp+e)q* Æ aykqqqq is a string that matches Æ bykppq is a string that matches Æ byk is a string that matches Æ yk is a string that does not match 38 Inexact Matching, Sequence Comparison and Alignment ÆSome type of errors are acceptable in valid matches ÆSequence data may contain errors Æ The characters in a subsequence embedded in a string need not be contiguous ÆComparison of similar sequences ÆSequence alignment allows mismatches 39 First Fact of Biological Sequence Analysis Æ “In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity” 40 Edit Distance ÆEdit distance = minimum number of edit operations needed to transform the first string into the second Æ Insert character (in second string) Æ Delete character (from first string) Æ Replace character Æ RI D D I edit transcript v intner first string wri t ers second string 41 Alignment and Edit Distance Æ Global string alignment Æ q a c _ d b d q a w x _ b _ Æ A string alignment can be converted into an edit transcript (edit operations) and vice versa Æ Alignment displays string relationship as the product of evolutionary events Æ Edit distance emphasizes mutational events as a process 42 Dynamic Programming ÆUse Dynamic Programming methodology to find the minimum number of edit operations ÆO(n*m) time algorithm where n is the length of the first string and m is the length of the second string Bioinformatics 10 55 Pattern Recognition with Hopfield Networks ÆAssume m patterns x1,…,xm ÆEach pattern x is a vector over {-1,+1} ÆSet wij = Sk xi k xj k ÆGiven an input pattern, the states of the nodes of the Hopfield network converge to one of the patterns that is similar to the input pattern 56 Boltzmann Networks ÆGeneralization of Hopfield networks for solving combinatorial optimization problems ÆCombined state of nodes forms the solution 57 Kohonen Networks ÆModel the brain ÆSelf organization 58 Feedforward Networks ÆMultilayer Æ Input nodes ÆHidden nodes ÆOutput nodes ÆNodes compute Æ si = +1 if Sjwijsj > f Æ si = -1 otherwise 59 Backpropagation Networks ÆBackpropagation of error ÆAdjusts the weights by propagating errors back into the network upon mismatch of network’s output and true answer ÆUnsupervised training 60 Applications of Neural Nets ÆPattern analysis (images, sound, …) ÆModeling the auditory cortex of a bat ÆTraveling salesman problem ÆModeling the somatotopic map of the body surface Bioinformatics 11 61 Genetic Algorithms Æ A family of computational models inspired by evolution Æ Encode a potential solution of a specific problem in a simple chromosome-like data structure Æ Apply selection, reproduction, and recombination operators to a population of chromosomes to evolve new solutions 62 Optimization Problems ÆFind suitable problem encoding in binary “chromosome” (string of 0 and 1) ÆDefine an evaluation function (a.k.a. fitness function) 63 Selection and Reproduction ÆA population forms a pool of partial solutions ÆCalculate fitness values for all chromosomes in the population ÆA selected group of individuals with larger than average fitness value is allowed to survive and make copies into a new population 64 Recombination ÆCrossover (one-point, uniform) Æ Form new solution by combining two solutions ÆMutations with low probability Æ Insert/destroy material 65 Example Æ Find x on [0,31] such that f(x) = x2 is maximized Æ Encode x as string of 5 bits (25=32) Æ Fitness function f Æ Generate random population of size 4 Æ 01001 11000 01000 10011 66 Example (Cont’d) 112331361100114 022664010003 219749576110002 15814169011011 #copies%/ave%f(x)Pool Bioinformatics 12 67 Example (Cont’d) 256100002310 | 011 729110112411 | 000 62511001411100 | 0 14401100420110 | 1 f(x)New PoolX-SiteMatePool 68 Applications of GA ÆMachine learning ÆTraveling salesman problem ÆRobot trajectory generation ÆParametric design (e.g. pipelines, aircraft, …) 69 Questions?
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved