Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

String Matching Algorithms: Brute Force and Karp-Rabin Fingerprinting, Study notes of Computer Science

Two algorithms for string matching: brute force and karp-rabin fingerprinting. The brute force algorithm searches for a pattern in a text by comparing each substring of the text with the pattern. The karp-rabin fingerprinting algorithm uses a hash function to identify potential matches, which are then verified using a brute-force comparison. The document also discusses the time complexity of each algorithm and the importance of choosing a good prime number for the hash function.

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-8hb-1
koofers-user-8hb-1 🇺🇸

10 documents

1 / 4

Toggle sidebar

Related documents


Partial preview of the text

Download String Matching Algorithms: Brute Force and Karp-Rabin Fingerprinting and more Study notes Computer Science in PDF only on Docsity! CS 373 Non-Lecture C: String Matching Fall 2002 C String Matching C.1 Brute Force The basic object that we’re going to talk about for the next two lectures is a string, which is really just an array. The elements of the array come from a set Σ called the alphabet ; the elements themselves are called characters. Common examples are ASCII text, where each character is an seven-bit integer1, strands of DNA, where the alphabet is the set of nucleotides {A,C,G, T}, or proteins, where the alphabet is the set of 22 amino acids. The problem we want to solve is the following. Given two strings, a text T [1 .. n] and a pattern P [1 ..m], find the first substring of the text that is the same as the pattern. (It would be easy to extend our algorithms to find all matching substrings, but we will resist.) A substring is just a contiguous subarray. For any shift s, let Ts denote the substring T [s .. s + m − 1]. So more formally, we want to find the smallest shift s such that Ts = P , or report that there is no match. For example, if the text is the string ‘AMANAPLANACATACANALPANAMA’2 and the pattern is ‘CAN’, then the output should be 15. If the pattern is ‘SPAM’, then the answer should be ‘none’. In most cases the pattern is much smaller than the text; to make this concrete, I’ll assume that m < n/2. Here’s the ‘obvious’ brute force algorithm, but with one immediate improvement. The inner while loop compares the substring Ts with P . If the two strings are not equal, this loop stops at the first character mismatch. AlmostBruteForce(T [1 .. n], P [1 ..m]): for s← 1 to n−m + 1 equal ← true i← 1 while equal and i ≤ m if T [s + i− 1] 6= P [i] equal ← false else i← i + 1 if equal return s return ‘none’ In the worst case, the running time of this algorithm is O((n − m)m) = O(nm), and we can 1Yes, seven. Most computer systems use some sort of 8-bit character set, but there’s no universally accepted standard. Java supposedly uses the Unicode character set, which has variable-length characters and therefore doesn’t really fit into our framework. Just think, someday you’ll be able to write ‘¶ = ℵ[∞++]/f;’ in your Java code! Joy! 2Dan Hoey (or rather, his computer program) found the following 540-word palindrome in 1984. We have better online dictionaries now, so I’m sure you could do better. A man, a plan, a caret, a ban, a myriad, a sum, a lac, a liar, a hoop, a pint, a catalpa, a gas, an oil, a bird, a yell, a vat, a caw, a pax, a wag, a tax, a nay, a ram, a cap, a yam, a gay, a tsar, a wall, a car, a luger, a ward, a bin, a woman, a vassal, a wolf, a tuna, a nit, a pall, a fret, a watt, a bay, a daub, a tan, a cab, a datum, a gall, a hat, a fag, a zap, a say, a jaw, a lay, a wet, a gallop, a tug, a trot, a trap, a tram, a torr, a caper, a top, a tonk, a toll, a ball, a fair, a sax, a minim, a tenor, a bass, a passer, a capital, a rut, an amen, a ted, a cabal, a tang, a sun, an ass, a maw, a sag, a jam, a dam, a sub, a salt, an axon, a sail, an ad, a wadi, a radian, a room, a rood, a rip, a tad, a pariah, a revel, a reel, a reed, a pool, a plug, a pin, a peek, a parabola, a dog, a pat, a cud, a nu, a fan, a pal, a rum, a nod, an eta, a lag, an eel, a batik, a mug, a mot, a nap, a maxim, a mood, a leek, a grub, a gob, a gel, a drab, a citadel, a total, a cedar, a tap, a gag, a rat, a manor, a bar, a gal, a cola, a pap, a yaw, a tab, a raj, a gab, a nag, a pagan, a bag, a jar, a bat, a way, a papa, a local, a gar, a baron, a mat, a rag, a gap, a tar, a decal, a tot, a led, a tic, a bard, a leg, a bog, a burg, a keel, a doom, a mix, a map, an atom, a gum, a kit, a baleen, a gala, a ten, a don, a mural, a pan, a faun, a ducat, a pagoda, a lob, a rap, a keep, a nip, a gulp, a loop, a deer, a leer, a lever, a hair, a pad, a tapir, a door, a moor, an aid, a raid, a wad, an alias, an ox, an atlas, a bus, a madam, a jag, a saw, a mass, an anus, a gnat, a lab, a cadet, an em, a natural, a tip, a caress, a pass, a baronet, a minimax, a sari, a fall, a ballot, a knot, a pot, a rep, a carrot, a mart, a part, a tort, a gut, a poll, a gateway, a law, a jay, a sap, a zag, a fat, a hall, a gamut, a dab, a can, a tabu, a day, a batt, a waterfall, a patina, a nut, a flow, a lass, a van, a mow, a nib, a draw, a regular, a call, a war, a stay, a gam, a yap, a cam, a ray, an ax, a tag, a wax, a paw, a cat, a valley, a drib, a lion, a saga, a plat, a catnip, a pooh, a rail, a calamus, a dairyman, a bater, a canal—Panama! 1 CS 373 Non-Lecture C: String Matching Fall 2002 actually achieve this running time by searching for the pattern AAA...AAAB with m − 1 A’s, in a text consisting of n A’s. In practice, though, breaking out of the inner loop at the first mismatch makes this algorithm quite practical. We can wave our hands at this by assuming that the text and pattern are both random. Then on average, we perform a constant number of comparisons at each position i, so the total expected number of comparisons is O(n). Of course, neither English nor DNA is really random, so this is only a heuristic argument. C.2 Strings as Numbers For the rest of the lecture, let’s assume that the alphabet consists of the numbers 0 through 9, so we can interpret any array of characters as either a string or a decimal number. In particular, let p be the numerical value of the pattern P , and for any shift s, let ts be the numerical value of Ts: p = m ∑ i=1 10m−i · P [i] ts = m ∑ i=1 10m−i · T [s + i− 1] For example, if T = 31415926535897932384626433832795028841971 and m = 4, then t17 = 2384. Clearly we can rephrase our problem as follows: Find the smallest s, if any, such that p = ts. We can compute p in O(m) arithmetic operations, without having to explicitly compute powers of ten, using Horner’s rule: p = P [m] + 10 ( P [m− 1] + 10 ( P [m− 2] + · · · + 10 ( P [2] + 10 · P [1] ) · · · )) We could also compute any ts in O(m) operations using Horner’s rule, but this leads to essentially the same brute-force algorithm as before. But once we know ts, we can actually compute ts+1 in constant time just by doing a little arithmetic — subtract off the most significant digit T [s] ·10m−1, shift everything up by one digit, and add the new least significant digit T [r + m]: ts+1 = 10 ( ts − 10 m−1 · T [s] ) + T [s + m] To make this fast, we need to precompute the constant 10m−1. (And we know how to do that quickly. Right?) So it seems that we can solve the string matching problem in O(n) worst-case time using the following algorithm: NumberSearch(T [1 .. n], P [1 ..m]): σ ← 10m−1 p← 0 t1 ← 0 for i← 1 to m p← 10 · p + P [i] t1 ← 10 · t1 + T [i] for s← 1 to n−m + 1 if p = ts return s ts+1 ← 10 · ( ts − σ · T [s] ) + T [s + m] return ‘none’ Unfortunately, the most we can say is that the number of arithmetic operations is O(n). These operations act on numbers with up to m digits. Since we want to handle arbitrarily long patterns, we can’t assume that each operation takes only constant time! 2
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved