Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Spell Checking - Data Structures and Algorithms - Project | CSCI 2300, Study Guides, Projects, Research of Data Structures and Algorithms

Material Type: Project; Class: INTRODUCTION TO ALGORITHMS; Subject: Computer Science; University: Rensselaer Polytechnic Institute; Term: Spring 2006;

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 08/09/2009

koofers-user-r2s
koofers-user-r2s 🇺🇸

10 documents

1 / 8

Toggle sidebar

Related documents


Partial preview of the text

Download Spell Checking - Data Structures and Algorithms - Project | CSCI 2300 and more Study Guides, Projects, Research Data Structures and Algorithms in PDF only on Docsity! CSCI 2300 — Data Structures and Algorithms Project 1 — Spell Checking Spring 2006 Due Date The due date is Friday, February 10, 2006 by 11:59:59pm. See the syllabus for late policies and academic integrity policies. See below for submission guidelines. Spell Checking Spell checking is the process of verifying that a particular word is spelled properly according to some dic- tionary. Spell checkers are used in many applications, including word processors (such as Microsoft Word), electronic dictionaries, and optical character recognition (OCR) systems that need to turn images of printed text (or even handwriting) into coherent text. Spell checking itself is trivial, requiring only a simple lookup in a dictionary. However, most applications of spell checking also require that the spell checker provide a list of potentially correct spellings (“near matches”) when the word was spelled improperly. For instance, if I type “speling” into an online dictionary, it will provide suggestions of similar words that I may have meant to type, including “spelling”, “spoiling”, “sapling”, and “splendid”. Your task is to implement a spell checker that determines if a given word is spelled correctly based on a dictionary lookup. When the word is not spelled correctly, it will provide a list of similar-sounding words based on your implementation of the Metaphone algorithm, and ordered based on their edit distance from the string that the user typed. We could use only the edit distance to find near matches, but by first using the Metaphone algorithm we achieve better results because it can find the word you are looking for even if you have no idea how to spell it. Additionally, we can isolate the set of near matches more quickly using the Metaphone algorithm. Overview Your program will be run from the command line, and will take a rule file (rules.txt, to be described later), a dictionary file containing about 10,000 words (dictionary.txt, from the web page), and some words. It will check the spelling of each word. If the word is spelled properly, it will print “Word ‘[the word]’ is spelled correctly.” Otherwise, it will print the word and its pronunciation, followed by a list of like-sounding words ordered by the edit distance (shown in brackets). For example, if your program is named spellcheck.exe, here is a sample program run: spellcheck.exe rules.txt dictionary.txt speling supress save Word ‘speling’ not in dictionary. It sounds like ‘SPLN’. Perhaps you meant: spelling [1] sapling [2] spellings [2] spoiling [2] splendid [5] Word ‘supress’ not in dictionary. It sounds like ‘SPRS’. 1 Perhaps you meant: suppress [1] spares [3] suppressed [3] suppresses [3] surprise [3] surprises [3] suppressing [4] surprised [4] spurious [5] surprising [5] surprisingly [7] Word ‘save’ is spelled correctly. Read on to learn about the Metaphone algorithm, which will be used to isolate the set of near matches; and an edit distance algorithm, which will be used to rank the results (notice that “spelling” comes before “spoiling”). Fig. 1 gives an overview of the spell-checking process. Please write your program to mimic the output given exactly. We will provide more example output on the project web page. input word Metaphone −→ Metaphone dictionary lookup −→ set of edit distance −→ ordered list pronunciation homophones of homophones Figure 1: Overview: Sequence of Transformations for Spell Checking The Metaphone Algorithm The Metaphone algorithm, created by Lawrence Philips, takes a word and returns a very rough approximation of the sound of that word. The rough approximation eliminates the distinction between some characters; for instance, ‘S’ and ‘Z’ sound very similar, so the Metaphone algorithm maps them to the same sound (“s”). Some letters may make many different types of sounds: ‘C’, for instance, may make the “sh” sound (denoted by ‘X’) when it is part of “cia” (as in “social”), the “s” sound when it is following by ‘C’, ‘I’, or ‘Y’ (as in “since”), or even the “k” sound when it is by itself, or followed by a ‘K’ (as in “clack”). The Metaphone algorithm maps every sound in a word down to one of the following sounds: *, B, X, S, K, J, T, F, H, L, M, N, P, R, @, W, Y. The ‘*’ represents the sound of a vowel at the beginning of the word, ‘X’ represents the “sh” sound, and ‘@’ represents the “th” sound. All of the other consonant sounds in the list sound like the consonant. Vowels not at the beginning of the word are considered silent. See Fig. 2(a) for some examples of strings generated by the Metaphone algorithm for some common words. We (arbitrarily) limit the length of the Metaphone output strings to 4 characters, which has been found to produce better results in many cases than using longer strings. Implementing Metaphone To implement the Metaphone algorithm, you will use a table-driven algorithm to transform the input string into its Metaphone pronunciation. Fig. 2(b) contains several of the transformation rules used in the Meta- phone algorithm. The complete set of rules will be provided on the project web page in the file rules.txt. Each of the rules contains a pattern that should be matched to the input word. Patterns are first matched at the first character of the string. When the first matching pattern is found (the order of the rules is signif- icant), it first consumes part of the input string and then the pronunciation for that pattern is added to the 2 3. If the strings with two deletions/replacements are still never equal, allow for exactly three symbol deletions/replacements and repeat. 4. Repeat the process, stopping when the strings with deletions/replacements are equal. Return the number of deletions and replacements that were required. This naive approach is not only tedious to implement, but is also very inefficient. The implementation involves a process that redundantly computes many sub-problems. To make the algorithm more efficient, we will use a technique called dynamic programming, which systematically records the answers to sub-problems in a table. Dynamic programming eliminates the redundant computation of sub-problems through the use of additional memory. The dynamic programming approach is recursive. We compute the edit distance of two strings incre- mentally by first computing the edit distance of the prefixes of the two strings. In turn, the edit distance of each prefix can be computed from the edit distance of even smaller prefixes. This process will eventually continue until the prefix size is one. At this point, the problem of determining edit distance is as simple as comparing the first two symbols in the strings. To perform this recursive computation, we must build a table where each table cell corresponds to the edit distance of a pair of prefixes. The initial table is shown in Figure 3(a). One string Y is aligned across the top and the other X is aligned down the left. Each cell of the table T can be identified using indices. For example, T [i][j] would correspond to the cell in row i and column j. Row zero and column zero are always initialized as shown in the figure. The initialization actually has meaning. For example, the value zero in T [0][0] indicates that the edit distance of two empty strings is zero. The value five in T [0][5] indicates that the edit distance between Y [0, . . . , 4] = agtac and the empty string is five. And the value nine in T [9][0] indicates that the edit distance between X [0, . . . , 8] = gtatcgtat and the empty string is nine. In general, the value T [i][j] represents the edit distance between the strings Y [0, . . . , j-1] and X [0, . . . , i-1]. i (3) (2) (1) j 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 a g t a c g t c a t g t a t c g t a t 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 Column index j R ow in de x i (a) (b) of T[i][j] Neighbors X = g t a t c g t a t Y = a g t a c g t c a t Figure 3: The initial dynamic programming table. Computing the Table • The table values should be computed row by row starting from the first row and moving down to the bottom row. • Each row is computed from left to right. 5 • To compute a new cell, three neighbors must already be computed (see Figure 3(b)): (1) the left neighbor, (2) the top neighbor, and (3) the diagonal top-left neighbor. Recall that T [i][j] is the edit distance between X [0, . . . , j-1] and Y [0, . . . , i-1]. If the two symbols that correspond to T [i][j] are equal, then no deletion or replacement is needed. Thus, the new edit distance is equal to the edit distance between X [0, . . . , j-2] and Y [0, . . . , i-2], which was previously computed and stored in T [i-1][j-1]. If the two symbols are not equal, then a deletion or replacement must be performed. To perform a replacement, we take the edit distance from the diagonal top-left neighbor and increase it by one. To delete, we must examine the edit distance stored in the top and left neighbors. If the top neighbor has a smaller edit distance, then the symbol along the left should be deleted. If the left neighbor has a smaller edit distance, then the symbol along the top should be deleted. This decision ensures that we choose the proper deletions to minimize the edit distance. Computing T[i][j] • Each cell in the table corresponds to two symbols: the symbol along the top row and the symbol along the left column. • If the two symbols are equal, we copy the diagonal top-left value to the new cell. • If the two symbols are not equal, we examine the top cell, diagonal top-left cell, and the left cell. We choose the cell with the smallest value and copy that value, add one to it, and write the incremented value into the new cell. Figure 4 shows the entire computed table. The purpose of the entire table is to arrive at the value in the bottom right corner, which represents the edit distance between the two strings. 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 a g t a c g t c a t g t a t c g t a t 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 Column index j R ow in de x i 1 1 2 3 4 5 6 7 8 9 2 2 1 2 3 4 5 6 7 8 2 3 2 1 2 3 4 5 6 7 3 3 3 2 2 3 3 4 5 6 4 4 4 3 2 3 4 3 4 5 4 3 2 4 4 5 6 5 4 5 4 3 2 3 4 4 7 6 5 4 5 4 3 3 3 4 8 6 5 57 5 4 4 4 3 5 4 5 3 Figure 4: The computed dynamic programming table. The edit distance between the two strings is 3. Dynamic programming saves an incredible amount of computation by recording the edit distance between the prefixes of the strings, rather than recompute them as needed. The naive algorithm we first described implicitly recomputes this information over and over again. The problem with dynamic programming is that the table requires a lot of memory. Given two input strings of length n and m, the size of the table would be (n+1)(m+1). For example, if you were given two input strings that were each one million symbols long, the table would contain 1012 cells, which is too large to store in memory. However, storing the entire table is not necessary. In fact, we can compute the final edit distance by storing only two rows of the table. If we align a string of length n along the top of the table, the amount of 6 memory required is 2(n+1). This is described as a linear memory algorithm because the amount of memory is linearly proportional to the length of the input strings, which is a very nice feature.1 Grading criteria • Compilation (30 points): Your program must compile using VC++ .NET or g++ (version 2.95 or higher; check with g++ --version). If you comment out most of your code or if you do not attempt to implement the algorithms you will not receive this credit. • Metaphone algorithm: 20 points. • Spell checker: 15 points. This includes reading in the dictionary, spell-checking words, and presenting the results. • Edit distance algorithm: 15 points. • Design and Documentation (20 points): Your algorithms must be well-designed and well-structured, and your code well-commented. Make sure you implement an efficient solution to the problem. For example, you should not search the entire dictionary to find all homophones. You are encouraged (but not required) to use STL containers and algorithms in your code. Notes and suggestions • When you write code to read in rules.txt, make absolutely sure your program has read everything properly before passing the resulting data structure to the Metaphone algorithm. You don’t want to spend time debugging the algorithm when it’s just your input reading causing problems. • Write and test the Metaphone algorithm implementation and the edit distance implementation sep- arately. Writing and testing in isolating makes the whole program easier to debug. • You may want to write the Metaphone algorithm first, then the spell checker (without sorting the homophones), and finally implement the edit distance algorithm to rank the results. • Start early! Submission Guidelines Your submission must include only your source code (.h and .cpp suffix) and a brief report as a readme.txt file. Your submission should NOT include any files with a .exe, .dsp, .dsw, .ncb, or .opt suffix. Every file should have your name in a comment line at the top. Your readme.txt file should have a brief description of your program design, the breakdown of the files, which compiler you used (VC++ .NET will be taken as the default), a summary of what you think works and fails in your program. Exact details of the web-based submission procedure you must follow will be posted on the project web page. Briefly, you should submit a single zip, tar, or gzip file containing your source code files and readme.txt file. You can submit your project multiple times; only the most recent project submission will be graded. The most recent project submission will also be used to compute late days, if any. 1You are not required to implement this optimization. 7
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved