Download String Search Algorithms: Brute Force, Karp-Rabin, Knuth-Morris-Pratt, Boyer-Moore and more Slides Data Representation and Algorithm Design in PDF only on Docsity! String Searching 2 String Search String search. Given a pattern string, find first match in text. Model. Can't afford to preprocess the text. Parameters. N = length of text, M = length of pattern. i n a h a y Text s t a c k a n e e d l e i n a n e e d l e Pattern M = 6, N = 21 typically N >> M 3 Applications Applications. ! Parsers. ! Lexis/Nexis. ! Spam filters. ! Virus scanning. ! Digital libraries. ! Screen scrapers. ! Word processors. ! Web search engines. ! Natural language processing. ! Carnivore surveillance system. ! Computational molecular biology. ! Feature detection in digitized images. 4 Brute Force: Typical Case h a y n e e d s a n n e e d l e x n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e docsity.com 5 Brute Force Brute force. Check for pattern starting at every text position. public static int search(String pattern, String text) { int M = pattern.length(); int N = text.length(); for (int i = 0; i < N - M; i++) { int j; for (j = 0; j < M; j++) { if (text.charAt(i+j) != pattern.charAt(j)) break; } if (j == M) return i; // return offset i of match } return -1; // not found } 6 Brute Force: Worst Case a a a a a a a a a a a a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b 7 Analysis of Brute Force Analysis of brute force. ! Running time depends on pattern and text. ! Slow if M and N are large, and have lots of repetition. Implementation Typical Worst Brute 1.1 N † M N † assumes appropriate model character comparisons Search for M-character pattern in N-character text 8 Screen Scraping Goal. Find current stock price of Google. http://finance.yahoo.com/q?s=goog NYSE symbol docsity.com 17 String Search Implementation Cost Summary Karp-Rabin summary. ! Create fingerprint of each substring and compare fingerprints. ! Expected running time is linear. ! Idea generalizes, e.g., to 2D patterns. Karp-Rabin Implementation %(N) Typical %(N) ‡ Worst Brute 1.1 N † M N † assumes appropriate model ‡ randomized character comparisons Search for M-character pattern in N-character text 18 Knuth-Morris-Pratt Don Knuth 1974 Turing award Vaughan PrattJim Morris 19 Knuth-Morris-Pratt: DFA Simulation KMP algorithm. [over binary alphabet] ! Build DFA from pattern. ! Run DFA on text. 3 4 a a 5 a 0 1 a a 2 b b b b b b a a a a b a a Text b a a a b accept state a a b a a a a a b a a a a a b a a a 20 Knuth-Morris-Pratt: DFA Simulation Interpretation of state i. Length of longest prefix of search pattern that is a suffix of input string. Ex. End in state 4 iff text ends in aaba. Ex. End in state 2 iff text ends in aa (but not aabaa or aabaaa). 3 4 a a 5 a 0 1 a a 2 b b b b b b a a a b a a a Pattern accept state docsity.com 21 DFA Representation DFA used in KMP has special property. ! Upon character match in state j, go forward to state j+1. ! Upon character mismatch in state j, go back to state next[j]. b 0 a 1 0 1 2 3 4 5 0 2 3 2 0 4 0 5 3 6 next 0 0 2 0 0 3 only need to store this row a a b a a a Pattern 3 4 a a 5 a 0 1 a a 2 b b b b b b a accept state 22 KMP Algorithm Two key differences from brute force. ! Text pointer i never "backs up." ! Need to precompute next[] table. int j = 0; for (int i = 0; i < N; i++) { if (t.charAt(i) == p.charAt(j)) j++; // match else j = next[j]; // mismatch if (j == M) return i - M + 1; // found } return -1; // not found Simulation of KMP DFA (assumes binary alphabet) 23 Knuth-Morris-Pratt: DFA Construction Iterative construction. Suppose you've created DFA for pattern aabaaa. How to extend to DFA for pattern aabaaab ? ! Easy: transition from state 6 if next char matches. ! Challenge: transition from state 6 if next char mismatches. Wishful thinking. Simulate aabaaaa on DFA. Key idea. Simulate aabaaaa on DFA. 3 4 a a 5 6 a 0 1 a a 2 b b b b b b a a b 24 Knuth-Morris-Pratt: DFA Construction Iterative construction. Suppose you've created DFA for pattern aabaaa. How to extend to DFA for pattern aabaaab ? ! Easy: transition from state 6 if next char matches. ! Challenge: transition from state 6 if next char mismatches. Wishful thinking. Simulate aabaaaa on DFA. Key idea. Simulate aabaaaa on DFA. Efficient version. Pre-compute simulation of aabaaa. 3 4 a a 5 6 a 0 1 a a 2 b b b b b b a b 7 a docsity.com 25 Knuth-Morris-Pratt: DFA Construction DFA construction for KMP. DFA builds itself! State 6. Given DFA for aabaaa and state X of simulating aabaaa, compute DFA for aabaaab and state X of simulating aabaaab. ! next[6] = X & a = 2. ! Update X = X & b = 3. 3 4 a a 5 6 a 0 1 a a 2 b b b b b b a b 7 a X = 2 26 DFA Construction for KMP DFA construction for KMP. DFA builds itself! State 7. Given DFA for aabaaab and state X of simulating aabaaab, compute DFA for aabaaabb and state X of simulating aabaaabb. ! next[7] = X & a = 4. ! Update X = X & b = 0. 3 4 a a 5 6 a 0 1 a a 2 b b 7 b b 8 b a b b a b a X = 3 27 DFA Construction for KMP: Java Implementation Build DFA for KMP. ! Takes O(M) time. ! Requires O(M) extra space to store next[] table. int X = 0; int[] next = new int[M]; for (int j = 1; j < M; j++) { if (p.charAt(X) == p.charAt(j)) { // char match next[j] = next[X]; X = X + 1; } else { // char mismatch next[j] = X + 1; X = next[X]; } } DFA Construction for KMP (assumes binary alphabet) 28 Optimized KMP Implementation Ultimate search program for aabaaabb pattern. ! Specialized C program. ! Machine language version of C program. int kmpearch(char t[]) { int i = 0; s0: if (t[i++] != 'a') goto s0; s1: if (t[i++] != 'a') goto s0; s2: if (t[i++] != 'b') goto s2; s3: if (t[i++] != 'a') goto s0; s4: if (t[i++] != 'a') goto s0; s5: if (t[i++] != 'a') goto s3; s6: if (t[i++] != 'b') goto s2; s7: if (t[i++] != 'b') goto s4; return i - 8; } next[] assumes pattern is in text (o/w use sentinel) pattern[] docsity.com 37 Bad Character Rule: Java Implementation public static int search(String pattern, String text) { int M = pattern.length(), N = text.length(); int[] right = new int[256]; for (int c = 0; c < 256; c++) right[c] = -1; for (int j = 0; j < M; j++) right[pattern.charAt(j)] = j; int i = 0; // offset while (i < N - M) { int skip = 0; for (int j = M-1; j >= 0; j--) { if (pattern.charAt(j) != text.charAt(i + j)) { skip = Math.max(1, j - right[text.charAt(i + j)]); break; } } if (skip == 0) return i; // found i = i + skip; } return -1; } rightmost occurrence of c in pattern bad character rule 38 Bad Character Rule: Analysis Bad character rule analysis. ! Highly effective in practice, particularly for English text: O(N / M). ! Takes ((MN) time in worst case. b a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b a a a a a a b a a a a a a b a a a a a a b a a a a a a b a a a a a a b a a a a a a b a a a a a a 39 Strong Good Suffix Rule Strong good suffix rule. [a KMP-like suffix rule] ! Right-to-left scanning. ! Suppose text matches suffix t of pattern but mismatches in previous character c. ! Find rightmost copy of t in pattern whose preceding letter is not c, and shift; if no such copy, shift M positions. x c a b d a b d a b string good suffix rule: can skip over this since we already know dab doesn't match bad character rule: skip only 1 position x c a b d a b d a b x x x x x x x b a b ? ? ? ? ? ? x x x x x x x x t = "ab" c = 'b' 40 Boyer-Moore Boyer-Moore. ! Right-to-left scanning. ! Bad character rule. ! Strong good suffix rule. Boyer-Moore analysis. ! O(N / M) average case if given letter usually doesn't occur in string. – time decreases as pattern length increases – sublinear in input size! ! At most 3N comparisons to find a match. Boyer-Moore in the wild. Unix grep, emacs. always take best of two shifts docsity.com 41 String Search Implementation Cost Summary Karp-Rabin Implementation KMP %(N) Typical 1.1 N † %(N) ‡ Worst 2N Boyer-Moore N / M † 3N Brute 1.1 N † M N † assumes appropriate model ‡ randomized Search for M-character pattern in N-character text 42 Boyer-Moore and Alphabet Size Boyer-Moore space requirement. %(M + |'|) Big alphabets. ! Direct implementation may be impractical, e.g., Unicode. ! Fix: search one byte at a time. Small alphabets. ! Loses effectiveness when ' is too small, e.g., DNA. ! Fix: group characters together, e.g., aaaa, aaac, …. 43 Finding All Matches Karp-Rabin. Can find all matches in O(M + N) expected time using Muthukrishnan variant. Knuth-Morris-Pratt. Can find all matches in O(M + N) time via simple modification. Boyer-Moore. Can find all matches in O(M + N) time using Galil variant. search pattern: aabaaa 3 4 a a 5 6 a 0 1 a a 2 b b b b b b a accept state b a 44 Multiple String Search Multiple string search. Search for any of k different patterns. ! Naïve KMP: O(kN + M1 + … + Mk). ! Aho-Corasick: O(N + M1 + … + Mk). ! Ex: screen out dirty words from a text stream. 3 4 a 0 1 b a 2 b b 5 a 6 a 7 a 8 b 9 b b a a DFA for ( aaa or abb or baba ) docsity.com