Download Introduction to Bioinformatics Algorithms - Lecture Slides | CSCE 590B and more Study notes Computer Science in PDF only on Docsity! Introduction to Bioinformatics Algorithms Lectures 1-2 Dr. Max Alekseyev USC, 2009 Organization • Lecturer: Dr. Max Alekseyev • Time/Place:MWF 11:15am-12:30pm / SWGN 2A22 • Office hours: after each lecture or by an appointment, SWGN 3A48 • Course webpage: http://cse.sc.edu/~maxal/csce590b/ • Textbook: «An Introduction to Bioinformatics Algorithms» by N. Jones and P. Pevzner http://www.bioalgorithms.info Bioinformatics Bottlenecks Biological Problem Computational Problem (Model) Formalization How accurate? AlgorithmPractical Results Algorithmic solution Does it exist? Execution Is it efficient? Interpretation Are results meaningful? Data May be “noisy” Why Bioinformatics Algorithms? • Usually a biological problem can be transformed into a computational problem in a number of ways that feature different levels of accuracy and complexity. • Highly accurate models often result in intractable computational problems while less accurate models may produce meaningless results. • Goal: to maintain an acceptable level of accuracy keeping the computational problem effectively solvable. Plan • Brief introduction to Algorithms (Chapter 2) • Brief introduction to Biology (Chapter 3) • Study various biological problems and their computational conunterparts (Chapter 4 and up) An everyday algorithm Good algorithm Bad algorithm A computational problem • Defined by inputs and outputs, eg. Input: amount of money to change, M • Output: set of coins summing to M • Require precise formulation Solved by algorithms A prerequisite to an algorithm Pseudocode: Assignment
Assignment
Format!) a<—b
Effect: Sets the variable a to the value 6.
Example: 6 — 2
ab
Result: The value of a is 2
Pseudocode: Loops
for loops
Format: fori —atob
B
Sets 7 to a and executes instructions B. Sets i to a + 1 and executes
instructions B again. Repeats fori = a + 2,a+3,...,6—1,b°
SUMINTEGERS(n)
1 sum — 0
2 fori lton
3 sum — sum-+ i
4 return sum
Result! © SUMINTEGERS(n) computes the sum of integers from 1 to n. SUM-
INTEGERS(10) returns 1 + 24+ ---+ 10 = 55.
while loops
Format:
Effect:
while A is true
B
Checks the condition A. If it is true, then executes instructions B.
Checks A again; if it’s true, it executes B again. Repeats until A is
not true.
ADDUNTIL(6)
liel
2 total —i
3 while total < 6
4 ro—aitl
5 total — total +7
6 return i
ADDUNTIL(6) computes the smallest integer i such that 1 + 2 +
-- dis larger than b. For example, ADDUNTIL(25) returns 7, since
142+4-+-+7 = 28, which is larger than 25, but 14 2+4---4+6= 21,
which is smaller than 25.
Pseudocode: Array access
Array access
Format: a;
Effect: The ith number of array a = (a),...a;,...4@,). For example, if
F = (1,1,2,3,5,8, 13), then Fy = 2, and Fy = 3.
Example: FIBONACCI(n)
1 PF —]
2 Py — 1
3 fori — 3ton
4 F, — Fy. + Fi-2
5 return F),
Result: —§ FIBONACCI(n) computes the nth Fibonacci number. FIBONACCI(8)
returns 21.
More specifically
USCHANGE(M)
Give the integer part of M/25 quarters to customer.
Let remainder be the remaining amount due the customer.
Give the integer part of remainder /10 dimes to customer.
Let remainder be the remaining amount due the customer.
Give the integer part of remainder /5 nickels to customer.
Let remainder be the remaining amount due the customer.
Give remainder pennies to customer.
Inelegant, but correct
USCHANGE(M)
r~—M
gq 7/25
r-—r—25-q
d<r/10
r+«r-10-d
n—r/d
rer—-5-n
pr
return (q,d, 7, p)
1
2
3
4
5
6
7
8
9
But what about, say, South Africa?
Generalized Problem
Change Problem:
Convert some amount of money M into given denominations, using the
smallest possible number of coins.
Input: An amount of money M, and an array of ¢d denom-
inations c = (c1,C2,...,¢a), in decreasing order of value
(c1 > cg >--+- > eq).
Output: A list of dintegers #1, i2,...,ig such that c)i;+¢oi2+
+++ +cgiqg = M, andi, +12+---+%q¢is as small as possible.
How fast is it? • # iterations of first index: M/c1 • # iterations of second index: M/c2 • ... Each “check” does 2d+k operations (k is constant) Hence, the total number of operations (running time complexity) is: M/c1 · M/c2 · … · M/cd · (2d+k) = 2/(c1·...·cd) · d · M d + k/(c1·...·cd) · M d = O(d · Md) • Finding the exact complexity, f(n) = number of basic operations, of an algorithm is difficult. • We approximate f(n) by a function g(n) in a way that does not substantially change the magnitude of f(n), i.e., g(n) is sufficiently close to f(n) for large values of the input size n. • This "approximate" measure of efficiency is called asymptotic complexity. • Thus the asymptotic complexity measure does not give the exact number of operations of an algorithm, but it shows how that number grows with the size of the input. • This gives us a measure that will work for different operating systems, compilers and CPUs. Asymptotic Complexity Order notation
BRUTEFORCECHANGE(M,c, d)
smallest NumberO f Coins — oo
for each (i,,...,%q) from (0,...,0) to (M/ei,..., M/ea)
valueO f Coins — an tne
if valueO fCoins = M
numberO f Coins — a ig
if numberOfCoins < smallestNumberO fCoins
smallest NumberO fCoins — nurmberO f Coins
best Change + (i1,72,...,éa)
return (best Change)
1
2
3
4
5
6
7
8
9
=O(d M*d)
• Similarly, Ω(g(n)) is used to give a lower bound on a positive runtime function f(n) where n is the input size. Definition: For a function f(n) that is non-negative for all n ≥ 0, we say that f(n) = Ω(g(n)) (“f(n) is big-Omega of g(n)”) if there exist n0 ≥ 0 and a constant c > 0 such that f(n) ≥ cg(n) for all n ≥ n0. Big-Omega Notation • Similarly, Θ(g(n)) is used to give a tight bound on a positive runtime function f(n) where n is the input size. Definition: For a function f(n) that is non-negative for all n ≥ 0, we say that f(n) = Θ(g(n)) (“f(n) is big-Theta of g(n)”) if f(n) = O(g(n)) and f(n) = Ω(g(n)). Big-Theta Notation NP-completeness • There is a class of problems that might require exponential time. • Any problem in this class is, in some way, equivalent to any other problem. • It is very unlikely that a polynomial time algorithm exists that can solve any of this class of problems. The bad news... • Many useful problems in biology are NP-complete (e.g., Traveling Salesman Problem) • Heuristic or statistical approaches aren’t “correct”, but are usually the best choice • Proving NP-completeness for a problem is involved • Take-away lesson: consider the possibility that your problem is NP-complete Good vs. Bad • Problems: Good: model system well; clear; precise • Bad: allows silly/mean solutions • Algorithms: Good: poly-time, correct • Bad: Exponential, or worse; incorrect • Implementations: Good: as fast as the algorithm • Bad: dumb coding Next steps • Sorting Problem • Quadratic vs log time • Towers of Hanoi Problem • Recursion and recurrences • Trees Sorting problem
Sorting Problem:
Sort a list of integers.
Input: A list of n distinct integers a = (@1,@2,...,@n)-
Output: Sorted list of integers, that is, a reordering b =
(b,,b2,...,b,) of integers from a such that b} < by <--- <
bn.
Intuitive approach • Find the smallest element. Put it first. • Find the next smallest element. Put it next. • Repeat until done. Asymptotic Complexity • IndexOfMin ~ O(n) • SelectionSort: • Calls IndexOfMin O(n) times • Also performs constant time operations • O(n·n), or O(n2) A faster way • There is a faster way of searching • MergeSort • Will be covered in “Divide and Conquer”. • Think about it for a while, see if you can’t figure it out. Towers of Hanoi Problem Formal Problem
Towers of Hanoi Problem:
Output a list of moves that solves the Towers of Hanoi.
Input: An integer n.
Output: A sequence of moves that will solve the n-disk
Towers of Hanoi puzzle.
Easy values of n • n=0; done • n=1; move from left to right peg; done • n=2; small to middle, large to right, small to right; done. • n=3? Move disk from peg 1 to peg 3 ut | |
—>
Move disk from peg 1 to peg 2 ut | |
Move disk from peg 3 to peg 2 at LL
——— >
Move disk from peg 1 to peg 3 at dt |
<—
Move disk from peg 2 to peg 1 td
———
Move disk from peg 1 to peg ¢ | |
But we “assumed”! • Key observation: we know how to solve it for small values of n. • So we have HanoiTowers(1,a,b). We can construct HanoiTowers(2,a,b), HT(3,a,b), HT(4,a,b), etc. out of it. The impossible trick • “Assume can opener!” • Assume we have HanoiTowers(k,a,b) that solves correctly the k-disk (general) HT problem for some k • HanoiTowers(k+1,a,b) is easy to write if it can call HanoiTowers(k,a,b): • HanoiTowers(k,a,c) • move largest from a to b • HanoiTowers(k,c,b) Complete algorithm
HANOITOWERS(n, fromPeg, toPeg)
if n=1
output “Move disk from peg fromPeg to peg toPeg’
return
unusedPeg — 6 — fromPeg — toPeg
HANOITOWERS(n — 1, fromPeg, unusedPeg)
output “Move disk from peg fromPeg to peg toPeg”
HANOITOWERS(n — 1, unusedPeg, toPeg)
return
f
1
2
3
4
5
6
7
8