Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

String Search Algorithms: Brute Force, Karp-Rabin, Knuth-Morris-Pratt, Boyer-Moore, Slides of Data Representation and Algorithm Design

An overview of various string search algorithms, including brute force, karp-rabin, knuth-morris-pratt, and boyer-moore. Each algorithm is explained in detail, along with its time complexity, advantages, and disadvantages. The document also includes examples and code snippets.

Typology: Slides

2011/2012

Uploaded on 07/15/2012

saandeep
saandeep 🇮🇳

4.5

(6)

105 documents

1 / 11

Toggle sidebar

Related documents


Partial preview of the text

Download String Search Algorithms: Brute Force, Karp-Rabin, Knuth-Morris-Pratt, Boyer-Moore and more Slides Data Representation and Algorithm Design in PDF only on Docsity! String Searching 2 String Search String search. Given a pattern string, find first match in text. Model. Can't afford to preprocess the text. Parameters. N = length of text, M = length of pattern. i n a h a y Text s t a c k a n e e d l e i n a n e e d l e Pattern M = 6, N = 21 typically N >> M 3 Applications Applications. ! Parsers. ! Lexis/Nexis. ! Spam filters. ! Virus scanning. ! Digital libraries. ! Screen scrapers. ! Word processors. ! Web search engines. ! Natural language processing. ! Carnivore surveillance system. ! Computational molecular biology. ! Feature detection in digitized images. 4 Brute Force: Typical Case h a y n e e d s a n n e e d l e x n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e docsity.com 5 Brute Force Brute force. Check for pattern starting at every text position. public static int search(String pattern, String text) { int M = pattern.length(); int N = text.length(); for (int i = 0; i < N - M; i++) { int j; for (j = 0; j < M; j++) { if (text.charAt(i+j) != pattern.charAt(j)) break; } if (j == M) return i; // return offset i of match } return -1; // not found } 6 Brute Force: Worst Case a a a a a a a a a a a a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b 7 Analysis of Brute Force Analysis of brute force. ! Running time depends on pattern and text. ! Slow if M and N are large, and have lots of repetition. Implementation Typical Worst Brute 1.1 N † M N † assumes appropriate model character comparisons Search for M-character pattern in N-character text 8 Screen Scraping Goal. Find current stock price of Google. http://finance.yahoo.com/q?s=goog NYSE symbol docsity.com 17 String Search Implementation Cost Summary Karp-Rabin summary. ! Create fingerprint of each substring and compare fingerprints. ! Expected running time is linear. ! Idea generalizes, e.g., to 2D patterns. Karp-Rabin Implementation %(N) Typical %(N) ‡ Worst Brute 1.1 N † M N † assumes appropriate model ‡ randomized character comparisons Search for M-character pattern in N-character text 18 Knuth-Morris-Pratt Don Knuth 1974 Turing award Vaughan PrattJim Morris 19 Knuth-Morris-Pratt: DFA Simulation KMP algorithm. [over binary alphabet] ! Build DFA from pattern. ! Run DFA on text. 3 4 a a 5 a 0 1 a a 2 b b b b b b a a a a b a a Text b a a a b accept state a a b a a a a a b a a a a a b a a a 20 Knuth-Morris-Pratt: DFA Simulation Interpretation of state i. Length of longest prefix of search pattern that is a suffix of input string. Ex. End in state 4 iff text ends in aaba. Ex. End in state 2 iff text ends in aa (but not aabaa or aabaaa). 3 4 a a 5 a 0 1 a a 2 b b b b b b a a a b a a a Pattern accept state docsity.com 21 DFA Representation DFA used in KMP has special property. ! Upon character match in state j, go forward to state j+1. ! Upon character mismatch in state j, go back to state next[j]. b 0 a 1 0 1 2 3 4 5 0 2 3 2 0 4 0 5 3 6 next 0 0 2 0 0 3 only need to store this row a a b a a a Pattern 3 4 a a 5 a 0 1 a a 2 b b b b b b a accept state 22 KMP Algorithm Two key differences from brute force. ! Text pointer i never "backs up." ! Need to precompute next[] table. int j = 0; for (int i = 0; i < N; i++) { if (t.charAt(i) == p.charAt(j)) j++; // match else j = next[j]; // mismatch if (j == M) return i - M + 1; // found } return -1; // not found Simulation of KMP DFA (assumes binary alphabet) 23 Knuth-Morris-Pratt: DFA Construction Iterative construction. Suppose you've created DFA for pattern aabaaa. How to extend to DFA for pattern aabaaab ? ! Easy: transition from state 6 if next char matches. ! Challenge: transition from state 6 if next char mismatches. Wishful thinking. Simulate aabaaaa on DFA. Key idea. Simulate aabaaaa on DFA. 3 4 a a 5 6 a 0 1 a a 2 b b b b b b a a b 24 Knuth-Morris-Pratt: DFA Construction Iterative construction. Suppose you've created DFA for pattern aabaaa. How to extend to DFA for pattern aabaaab ? ! Easy: transition from state 6 if next char matches. ! Challenge: transition from state 6 if next char mismatches. Wishful thinking. Simulate aabaaaa on DFA. Key idea. Simulate aabaaaa on DFA. Efficient version. Pre-compute simulation of aabaaa. 3 4 a a 5 6 a 0 1 a a 2 b b b b b b a b 7 a docsity.com 25 Knuth-Morris-Pratt: DFA Construction DFA construction for KMP. DFA builds itself! State 6. Given DFA for aabaaa and state X of simulating aabaaa, compute DFA for aabaaab and state X of simulating aabaaab. ! next[6] = X & a = 2. ! Update X = X & b = 3. 3 4 a a 5 6 a 0 1 a a 2 b b b b b b a b 7 a X = 2 26 DFA Construction for KMP DFA construction for KMP. DFA builds itself! State 7. Given DFA for aabaaab and state X of simulating aabaaab, compute DFA for aabaaabb and state X of simulating aabaaabb. ! next[7] = X & a = 4. ! Update X = X & b = 0. 3 4 a a 5 6 a 0 1 a a 2 b b 7 b b 8 b a b b a b a X = 3 27 DFA Construction for KMP: Java Implementation Build DFA for KMP. ! Takes O(M) time. ! Requires O(M) extra space to store next[] table. int X = 0; int[] next = new int[M]; for (int j = 1; j < M; j++) { if (p.charAt(X) == p.charAt(j)) { // char match next[j] = next[X]; X = X + 1; } else { // char mismatch next[j] = X + 1; X = next[X]; } } DFA Construction for KMP (assumes binary alphabet) 28 Optimized KMP Implementation Ultimate search program for aabaaabb pattern. ! Specialized C program. ! Machine language version of C program. int kmpearch(char t[]) { int i = 0; s0: if (t[i++] != 'a') goto s0; s1: if (t[i++] != 'a') goto s0; s2: if (t[i++] != 'b') goto s2; s3: if (t[i++] != 'a') goto s0; s4: if (t[i++] != 'a') goto s0; s5: if (t[i++] != 'a') goto s3; s6: if (t[i++] != 'b') goto s2; s7: if (t[i++] != 'b') goto s4; return i - 8; } next[] assumes pattern is in text (o/w use sentinel) pattern[] docsity.com 37 Bad Character Rule: Java Implementation public static int search(String pattern, String text) { int M = pattern.length(), N = text.length(); int[] right = new int[256]; for (int c = 0; c < 256; c++) right[c] = -1; for (int j = 0; j < M; j++) right[pattern.charAt(j)] = j; int i = 0; // offset while (i < N - M) { int skip = 0; for (int j = M-1; j >= 0; j--) { if (pattern.charAt(j) != text.charAt(i + j)) { skip = Math.max(1, j - right[text.charAt(i + j)]); break; } } if (skip == 0) return i; // found i = i + skip; } return -1; } rightmost occurrence of c in pattern bad character rule 38 Bad Character Rule: Analysis Bad character rule analysis. ! Highly effective in practice, particularly for English text: O(N / M). ! Takes ((MN) time in worst case. b a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b a a a a a a b a a a a a a b a a a a a a b a a a a a a b a a a a a a b a a a a a a b a a a a a a 39 Strong Good Suffix Rule Strong good suffix rule. [a KMP-like suffix rule] ! Right-to-left scanning. ! Suppose text matches suffix t of pattern but mismatches in previous character c. ! Find rightmost copy of t in pattern whose preceding letter is not c, and shift; if no such copy, shift M positions. x c a b d a b d a b string good suffix rule: can skip over this since we already know dab doesn't match bad character rule: skip only 1 position x c a b d a b d a b x x x x x x x b a b ? ? ? ? ? ? x x x x x x x x t = "ab" c = 'b' 40 Boyer-Moore Boyer-Moore. ! Right-to-left scanning. ! Bad character rule. ! Strong good suffix rule. Boyer-Moore analysis. ! O(N / M) average case if given letter usually doesn't occur in string. – time decreases as pattern length increases – sublinear in input size! ! At most 3N comparisons to find a match. Boyer-Moore in the wild. Unix grep, emacs. always take best of two shifts docsity.com 41 String Search Implementation Cost Summary Karp-Rabin Implementation KMP %(N) Typical 1.1 N † %(N) ‡ Worst 2N Boyer-Moore N / M † 3N Brute 1.1 N † M N † assumes appropriate model ‡ randomized Search for M-character pattern in N-character text 42 Boyer-Moore and Alphabet Size Boyer-Moore space requirement. %(M + |'|) Big alphabets. ! Direct implementation may be impractical, e.g., Unicode. ! Fix: search one byte at a time. Small alphabets. ! Loses effectiveness when ' is too small, e.g., DNA. ! Fix: group characters together, e.g., aaaa, aaac, …. 43 Finding All Matches Karp-Rabin. Can find all matches in O(M + N) expected time using Muthukrishnan variant. Knuth-Morris-Pratt. Can find all matches in O(M + N) time via simple modification. Boyer-Moore. Can find all matches in O(M + N) time using Galil variant. search pattern: aabaaa 3 4 a a 5 6 a 0 1 a a 2 b b b b b b a accept state b a 44 Multiple String Search Multiple string search. Search for any of k different patterns. ! Naïve KMP: O(kN + M1 + … + Mk). ! Aho-Corasick: O(N + M1 + … + Mk). ! Ex: screen out dirty words from a text stream. 3 4 a 0 1 b a 2 b b 5 a 6 a 7 a 8 b 9 b b a a DFA for ( aaa or abb or baba ) docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved