Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Table of contents, Exercises of Computer science

by means of a book with this approach we can reach those who were dimly aware of the existence and perhaps of the usefulness of parsing but who thought it ...

Typology: Exercises

2022/2023

Uploaded on 03/01/2023

barnard
barnard 🇺🇸

3.9

(9)

1 document

Partial preview of the text

Download Table of contents and more Exercises Computer science in PDF only on Docsity! Table of contents Preface ........................................................ 11 1 Introduction .............................................. 13 1.1 Parsing as a craft ......................................... 14 1.2 The approach used ........................................ 14 1.3 Outline of the contents ..................................... 15 1.4 The annotated bibliography ................................. 15 2 Grammars as a generating device .............................. 16 2.1 Languages as infinite sets ................................... 16 2.1.1 Language ............................................ 16 2.1.2 Grammars ........................................... 17 2.1.3 Problems ............................................ 18 2.1.4 Describing a language through a finite recipe ................. 22 2.2 Formal grammars ........................................ 24 2.2.1 Generating sentences from a formal grammar ................. 25 2.2.2 The expressive power of formal grammars ................... 27 2.3 The Chomsky hierarchy of grammars and languages ............... 28 2.3.1 Type 1 grammars ...................................... 28 2.3.2 Type 2 grammars ...................................... 32 2.3.3 Type 3 grammars ...................................... 37 2.3.4 Type 4 grammars ...................................... 40 2.4 VW grammars ........................................... 41 2.4.1 The human inadequacy of CS and PS grammars ............... 41 2.4.2 VW grammars ........................................ 42 2.4.3 Infinite symbol sets .................................... 45 2.4.4 BNF notation for VW grammars ........................... 45 2.4.5 Affix grammars ....................................... 46 2.5 Actually generating sentences from a grammar ................... 47 2.5.1 The general case ...................................... 47 2.5.2 The CF case .......................................... 49 2.6 To shrink or not to shrink ................................... 51 6 Table of contents 2.7 A characterization of the limitations of CF and FS grammars ........ 54 2.7.1 The uvwxy theorem .................................... 54 2.7.2 The uvw theorem ...................................... 56 2.8 Hygiene in grammars ...................................... 56 2.8.1 Undefined non-terminals ................................ 56 2.8.2 Unused non-terminals .................................. 57 2.8.3 Non-productive non-terminals ............................ 57 2.8.4 Loops .............................................. 57 2.9 The semantic connection ................................... 57 2.9.1 Attribute grammars .................................... 58 2.9.2 Transduction grammars ................................. 59 2.10 A metaphorical comparison of grammar types ................... 60 3 Introduction to parsing ...................................... 62 3.1 Various kinds of ambiguity ................................. 62 3.2 Linearization of the parse tree ............................... 64 3.3 Two ways to parse a sentence ................................ 64 3.3.1 Top-down parsing ..................................... 65 3.3.2 Bottom-up parsing ..................................... 66 3.3.3 Applicability ......................................... 67 3.4 Non-deterministic automata ................................. 68 3.4.1 Constructing the NDA .................................. 69 3.4.2 Constructing the control mechanism ........................ 69 3.5 Recognition and parsing for Type 0 to Type 4 grammars ............ 70 3.5.1 Time requirements ..................................... 70 3.5.2 Type 0 and Type 1 grammars ............................. 70 3.5.3 Type 2 grammars ...................................... 72 3.5.4 Type 3 grammars ...................................... 73 3.5.5 Type 4 grammars ...................................... 74 3.6 An overview of parsing methods ............................. 74 3.6.1 Directionality ......................................... 74 3.6.2 Search techniques ..................................... 75 3.6.3 General directional methods .............................. 76 3.6.4 Linear methods ....................................... 76 3.6.5 Linear top-down and bottom-up methods .................... 78 3.6.6 Almost deterministic methods ............................ 79 3.6.7 Left-corner parsing .................................... 79 3.6.8 Conclusion ........................................... 79 4 General non-directional methods .............................. 81 4.1 Unger’s parsing method .................................... 82 4.1.1 Unger’s method without ε-rules or loops ..................... 82 4.1.2 Unger’s method with ε-rules .............................. 85 4.2 The CYK parsing method .................................. 88 4.2.1 CYK recognition with general CF grammars .................. 89 4.2.2 CYK recognition with a grammar in Chomsky Normal Form ...... 92 4.2.3 Transforming a CF grammar into Chomsky Normal Form ........ 94 Table of contents 9 9.9 Non-canonical parsers .................................... 227 9.10 LR(k) as an ambiguity test ................................. 228 10 Error handling ........................................... 229 10.1 Detection versus recovery versus correction .................... 229 10.2 Parsing techniques and error detection ........................ 230 10.2.1 Error detection in non-directional parsing methods ............ 230 10.2.2 Error detection in finite-state automata ..................... 231 10.2.3 Error detection in general directional top-down parsers ......... 231 10.2.4 Error detection in general directional bottom-up parsers ........ 232 10.2.5 Error detection in deterministic top-down parsers ............. 232 10.2.6 Error detection in deterministic bottom-up parsers ............. 232 10.3 Recovering from errors ................................... 233 10.4 Global error handling ..................................... 233 10.5 Ad hoc methods ......................................... 237 10.5.1 Error productions ..................................... 237 10.5.2 Empty table slots ..................................... 237 10.5.3 Error tokens ......................................... 238 10.6 Regional error handling ................................... 238 10.6.1 Backward/forward move ............................... 238 10.7 Local error handling ..................................... 240 10.7.1 Panic mode ......................................... 240 10.7.2 FOLLOW set error recovery ............................. 241 10.7.3 Acceptable-sets derived from continuations .................. 241 10.7.4 Insertion-only error correction ........................... 244 10.7.5 Locally least-cost error recovery .......................... 246 10.8 Suffix parsing .......................................... 246 11 Comparative survey ....................................... 249 11.1 Considerations .......................................... 249 11.2 General parsers ......................................... 250 11.2.1 Unger ............................................. 250 11.2.2 Earley ............................................. 250 11.2.3 Tomita ............................................. 250 11.2.4 Notes .............................................. 251 11.3 Linear-time parsers ...................................... 251 11.3.1 Requirements ........................................ 251 11.3.2 Strong-LL(1) versus LALR(1) ........................... 251 11.3.3 Table size .......................................... 252 12 A simple general context-free parser .......................... 253 12.1 Principles of the parser ................................... 253 12.2 The program ........................................... 258 12.2.1 Handling left recursion ................................. 260 12.3 Parsing in polynomial time ................................. 260 13 Annotated bibliography .................................... 264 10 Table of contents 13.1 Miscellaneous literature ................................... 265 13.2 Unrestricted PS and CS grammars ........................... 269 13.3 Van Wijngaarden grammars and affix grammars ................. 271 13.4 General context-free parsers ................................ 273 13.5 LL parsing ............................................. 279 13.6 LR parsing ............................................ 282 13.7 Left-corner parsing ...................................... 292 13.8 Precedence and bounded-context parsing ...................... 294 13.9 Finite-state automata ..................................... 299 13.10 Natural language handling ................................. 300 13.11 Error handling .......................................... 302 13.12 Transformations on grammars .............................. 310 13.13 General books on parsing .................................. 310 13.14 Some books on computer science ............................ 312 Author index .................................................. 313 Index ........................................................ 317 Preface Parsing (syntactic analysis) is one of the best understood branches of computer science. Parsers are already being used extensively in a number of disciplines: in computer sci- ence (for compiler construction, database interfaces, self-describing data-bases, artifi- cial intelligence), in linguistics (for text analysis, corpora analysis, machine translation, textual analysis of biblical texts), in document preparation and conversion, in typeset- ting chemical formulae and in chromosome recognition, to name a few; they can be used (and perhaps are) in a far larger number of disciplines. It is therefore surprising that there is no book which collects the knowledge about parsing and explains it to the non-specialist. Part of the reason may be that parsing has a name for being “difficult”. In discussing the Amsterdam Compiler Kit and in teaching compiler construction, it has, however, been our experience that seemingly difficult parsing techniques can be explained in simple terms, given the right approach. The present book is the result of these considerations. This book does not address a strictly uniform audience. On the contrary, while writing this book, we have consistently tried to imagine giving a course on the subject to a diffuse mixture of students and faculty members of assorted faculties, sophisticated laymen, the avid readers of the science supplement of the large newspapers, etc. Such a course was never given; a diverse audience like that would be too uncoordinated to convene at regular intervals, which is why we wrote this book, to be read, studied, perused or consulted wherever or whenever desired. Addressing such a varied audience has its own difficulties (and rewards). Although no explicit math was used, it could not be avoided that an amount of mathematical thinking should pervade this book. Technical terms pertaining to parsing have of course been explained in the book, but sometimes a term on the fringe of the subject has been used without definition. Any reader who has ever attended a lecture on a non-familiar subject knows the phenomenon. He skips the term, assumes it refers to something reasonable and hopes it will not recur too often. And then there will be pas- sages where the reader will think we are elaborating the obvious (this paragraph may be one such place). The reader may find solace in the fact that he does not have to doo- dle his time away or stare out of the window until the lecturer progresses. On the positive side, and that is the main purpose of this enterprise, we hope that by means of a book with this approach we can reach those who were dimly aware of the existence and perhaps of the usefulness of parsing but who thought it would forever 14 Introduction [Ch. 1 1.1 PARSING AS A CRAFT Parsing is no longer an arcane art; it has not been so since the early 70’s when Aho, Ullman, Knuth and many others put various parsing techniques solidly on their theoret- ical feet. It need not be a mathematical discipline either; the inner workings of a parser can be visualized, understood and modified to fit the application, with not much more than cutting and pasting strings. There is a considerable difference between a mathematician’s view of the world and a computer-scientist’s. To a mathematician all structures are static: they have always been and will always be; the only time dependence is that we just haven’t discovered them all yet. The computer scientist is concerned with (and fascinated by) the continuous creation, combination, separation and destruction of structures: time is of the essence. In the hands of a mathematician, the Peano axioms create the integers without reference to time, but if a computer scientist uses them to implement integer addition, he finds they describe a very slow process, which is why he will be looking for a more efficient approach. In this respect the computer scientist has more in com- mon with the physicist and the chemist; like these, he cannot do without a solid basis in several branches of applied mathematics, but, like these, he is willing (and often virtu- ally obliged) to take on faith certain theorems handed to him by the mathematician. Without the rigor of mathematics all science would collapse, but not all inhabitants of a building need to know all the spars and girders that keep it upright. Factoring off cer- tain detailed knowledge to specialists reduces the intellectual complexity of a task, which is one of the things computer science is about. This is the vein in which this book is written: parsing for anybody who has pars- ing to do: the compiler writer, the linguist, the data-base interface writer, the geologist or musicologist who want to test grammatical descriptions of their respective objects of interest, and so on. We require a good ability to visualize, some programming experi- ence and the willingness and patience to follow non-trivial examples; there is nothing better for understanding a kangaroo than seeing it jump. We treat, of course, the popu- lar parsing techniques, but we will not shun some weird techniques that look as if they are of theoretical interest only: they often offer new insights and a reader might find an application for them. 1.2 THE APPROACH USED This book addresses the reader at at least three different levels. The interested non- computer scientist can read the book as “the story of grammars and parsing”; he or she can skip the detailed explanations of the algorithms: each algorithm is first explained in general terms. The computer scientist will find much technical detail on a wide array of algorithms. To the expert we offer a systematic bibliography of over 400 entries, which is intended to cover all articles on parsing that have appeared in the readily available journals. Each entry is annotated, providing enough material for the reader to decide if the referred article is worth reading. No ready-to-run algorithms have been given, except for the general context-free parser of Chapter 12. The formulation of a parsing algorithm with sufficient precision to enable a programmer to implement and run it without problems requires a consider- able supporting mechanism that would be out of place in this book and in our experi- ence does little to increase one’s understanding of the process involved. The popular methods are given in algorithmic form in most books on compiler construction. The Sec. 1.2] The approach used 15 less widely used methods are almost always described in detail in the original publica- tion, for which see Chapter 13. 1.3 OUTLINE OF THE CONTENTS Since parsing is concerned with sentences and grammars and since grammars are them- selves fairly complicated objects, ample attention is paid to them in Chapter 2. Chapter 3 discusses the principles behind parsing and gives a classification of parsing methods. In summary, parsing methods can be classified as top-down or bottom-up and as direc- tional or non-directional; the directional methods can be further distinguished into deterministic and non-deterministic. This scheme dictates the contents of the next few chapters. In Chapter 4 we treat non-directional methods, including Unger and CYK. Chapter 5 forms an intermezzo with the treatment of finite-state automata, which are needed in the subsequent chapters. Chapters 6 through 9 are concerned with directional methods. Chapter 6 covers non-deterministic directional top-down parsers (recursive descent, Definite Clause Grammars), Chapter 7 non-deterministic directional bottom- up parsers (Earley). Deterministic methods are treated in Chapters 8 (top-down: LL in various forms) and 9 (bottom-up: LR, etc.). A combined deterministic/non- deterministic method (Tomita) is also described in Chapter 9. That completes the pars- ing methods per se. Error handling for a selected number of methods is treated in Chapter 10. The comparative survey of parsing methods in Chapter 11 summarizes the properties of the popular and some less popular methods. Chapter 12 contains the full code in Pascal for a parser that will work for any context-free grammar, to lower the threshold for experi- menting. 1.4 THE ANNOTATED BIBLIOGRAPHY The annotated bibliography is presented in Chapter 13 and is an easily accessible sup- plement of the main body of the book. Rather than listing all publications in alphabetic order, it is divided into fourteen named sections, each concerned with a particular aspect of parsing; inside the sections, the publications are listed chronologically. An author index replaces the usual alphabetic list. The section name plus year of publica- tion, placed in brackets, are used in the text to refer to an author’s work. For instance, the annotated reference to Earley’s publication of the Earley parser [CF 1970] can be found in the section CF at the position of the papers of 1970. Since the name of the first author is printed in bold letters, the actual reference is then easily located. 2 Grammars as a generating device 2.1 LANGUAGES AS INFINITE SETS In computer science as in everyday parlance, a “grammar” serves to “describe” a “language”. If taken on face value, this correspondence, however, is misleading, since the computer scientist and the naive speaker mean slightly different things by the three terms. To establish our terminology and to demarcate the universe of discourse, we shall examine the above terms, starting with the last one. 2.1.1 Language To the larger part of mankind, language is first and foremost a means of communica- tion, to be used almost unconsciously, certainly so in the heat of a debate. Communica- tion is brought about by sending messages, through air vibrations or through written symbols. Upon a closer look the language messages (“utterances”) fall apart into sen- tences, which are composed of words, which in turn consist of symbol sequences when written. Languages can differ on all these three levels of composition. The script can be slightly different, as between English and Irish, or very different, as between English and Chinese. Words tend to differ greatly and even in closely related languages people call un cheval or ein Pferd, that which is known to others as a horse. Differences in sentence structure are often underestimated; even the closely related Dutch often has an almost Shakespearean word order: “Ik geloof je niet”, “I believe you not”, and unrelated languages readily come up with constructions like the Hungarian “P  nzem van”, “Money-my is”, where the English say “I have money”. The computer scientist takes a very abstracted view of all this. Yes, a language has sentences, and these sentences possess structure; whether they communicate some- thing or not is not his concern, but information may possibly be derived from their structure and then it is quite all right to call that information the meaning of the sen- tence. And yes, sentences consist of words, which he calls tokens, each possibly carry- ing a piece of information, which is its contribution to the meaning of the whole sen- tence. But no, words cannot be broken down any further. The computer scientist is not worried by this. With his love of telescoping solutions and multi-level techniques, he blithely claims that if words turn out to have structure after all, they are sentences in a different language, of which the letters are the tokens. Sec. 2.1] Languages as infinite sets 19 2.1.3.1 Infinite sets from finite descriptions In fact there is nothing wrong with getting a single infinite set from a single finite description: “the set of all positive integers” is a very finite-size description of a defin- itely infinite-size set. Still, there is something disquieting about the idea, so we shall rephrase our question: “Can all languages be described by finite descriptions?”. As the lead-up already suggests, the answer is “No”, but the proof is far from trivial. It is, however, very interesting and famous, and it would be a shame not to present at least an outline of it here. 2.1.3.2 Descriptions can be enumerated The proof is based on two observations and a trick. The first observation is that descriptions can be listed and given a number. This is done as follows. First, take all descriptions of size one, that is, those of only one letter long, and sort them alphabeti- cally. This is the beginning of our list. Depending on what, exactly, we accept as a description, there may be zero descriptions of size one, or 27 (all letters + space), or 128 (all ASCII characters) or some such; this is immaterial to the discussion which fol- lows. Second, we take all descriptions of size two, sort them alphabetically to give the second chunk on the list, and so on for lengths 3, 4 and further. This assigns a position on the list to each and every description. Our description “the set of all positive integers”, for instance, is of size 32, not counting the quotation marks. To find its posi- tion on the list, we have to calculate how many descriptions there are with less than 32 characters, say L. We then have to generate all descriptions of size 32, sort them and determine the position of our description in it, say P, and add the two numbers L and P. This will, of course, give a huge number† but it does ensure that the description is on the list, in a well-defined position; see Figure 2.1. { descriptions of size 1 { descriptions of size 2 { descriptions of size 3 . . . . . {descriptions of size 31 L . . . . . . . . . . . . . . . . . . . . . . . {descriptions of size 32 “the set of all positive integers” P Figure 2.1 List of all descriptions of length 32 or less Two things should be pointed out here. The first is that just listing all descriptions alphabetically, without reference to their lengths, would not do: there are already infin- itely many descriptions starting with an “a” and no description starting with a higher   † Some (computer-assisted) calculations tell us that, under the ASCII-128 assumption, the number is 248 17168 89636 37891 49073 14874 06454 89259 38844 52556 26245 57755 89193 30291, or roughly 2.5*1067. 20 Grammars as a generating device [Ch. 2 letter could get a number on the list. The second is that there is no need to actually do all this. It is just a thought experiment that allows us to examine and draw conclusion about the behaviour of a system in a situation which we cannot possibly examine physi- cally. Also, there will be many nonsensical descriptions on the list; it will turn out that this is immaterial to the argument. The important thing is that all meaningful descrip- tions are on the list, and the above argument ensures that. 2.1.3.3 Languages are infinite bit-strings We know that words (sentences) in a language are composed of a finite set of symbols; this set is called quite reasonably the alphabet. We will assume that the symbols in the alphabet are ordered. Then the words in the language can be ordered too. We shall indi- cate the alphabet by Σ. Now the simplest language that uses alphabet Σ is that which consists of all words that can be made by combining letters from the alphabet. For the alphabet Σ={a, b} we get the language { , a, b, aa, ab, ba, bb, aaa, . . . }. We shall call this language Σ* , for reasons to be explained later; for the moment it is just a name. The set notation Σ* above started with “ { , a,”, a remarkable construction; the first word in the language is the empty word, the word consisting of zero a’s and zero b’s. There is no reason to exclude it, but, if written down, it may easily get lost, so we shall write it as ε (epsilon), regardless of the alphabet. So, Σ*= {ε, a, b, aa, ab, ba, bb, aaa, . . . }. In some natural languages, forms of the present tense of the verb “to be” are the empty word, giving rise to sentences of the form “I student”; Russian and Hebrew are examples of this. Since the symbols in the alphabet Σ are ordered, we can list the words in the language Σ* , using the same technique as in the previous section: First all words of size zero, sorted; then all words of size one, sorted; and so on. This is actually the order already used in our set notation for Σ* . The language Σ* has the interesting property that all languages using alphabet Σ are subsets of it. That means that, given another possibly less trivial language over Σ, called L, we can go through the list of words in Σ* and put ticks on all words that are in L. This will cover all words in L, since Σ* contains any possible word over Σ. Suppose our language L is “the set of all words that contain more a’s than b’s”. L={a, aa, aab, aba, baa, . . . }. The beginning of our list, with ticks, will look as fol- lows: ε ✔ a b ✔ aa ab ba bb ✔ aaa ✔ aab ✔ aba abb ✔ baa Sec. 2.1] Languages as infinite sets 21 bab bba bbb ✔ aaaa ... ... Given the alphabet with its ordering, the list of blanks and ticks alone is entirely suffi- cient to identify and describe the language. For convenience we write the blank as a 0 and the tick as a 1 as if they were bits in a computer, and we can now write L=0101000111010001 . . . (and Σ*=1111111111111111 . . . ). It should be noted that this is true for any language, be it a formal language like L, a programming language like Pascal or a natural language like English. In English, the 1’s in the bit-string will be very scarce, since hardly any arbitrary sequence of words is a good English sentence (and hardly any arbitrary sequence of letters is a good English word, depending on whether we address the sentence/word level or the word/letter level). 2.1.3.4 Diagonalization The previous section attaches the infinite bit-string 0101000111010001... to the description “the set of all the words that contain more a’s than b’s”. In the same vein we can attach such bit-strings to all descriptions; some descriptions may not yield a language, in which case we can attach an arbitrary infinite bit-string to it. Since all descriptions can be put on a single numbered list, we get, for instance, the following picture: Description Language Description #1 000000100... Description #2 110010001... Description #3 011011010... Description #4 110011010... Description #5 100000011... Description #6 111011011... ... ... At the left we have all descriptions, at the right all languages they describe. We now claim that many languages exist that are not on the list of languages above: the above list is far from complete, although the list of descriptions is complete. We shall prove this by using the diagonalization process (“Diagonalverfahren”) of Cantor. Consider the language C=100110 . . . , which has the property that its n-th bit is unequal to the n-th bit of the language described by Description #n. The first bit of C is a 1, because the first bit for Description #1 is a 0; the second bit of C is a 0, because the second bit for Description #2 is a 1, and so on. C is made by walking the NW to SE diagonal of the language field and copying the opposites of the bits we meet. The language C cannot be on the list! It cannot be on line 1, since its first bit differs (is made to differ, one should say) from that on line 1, and in general it cannot be on line n, since its n-th bit will differ from that on line n, by definition. So, in spite of the fact that we have exhaustively listed all possible finite descrip- tions, we have at least one language that has no description on the list. Moreover, any broken diagonal yields such a language, where a diagonal is “broken” by replacing a 24 Grammars as a generating device [Ch. 2 0. Name -> tom Name -> dick Name -> harry 1. Sentence -> Name Sentence -> List End 2. List -> Name List -> List , Name 3. , Name End -> and Name 4. the start symbol is Sentence Figure 2.2 A finite recipe for generating strings in the t, d & h language terminals left. 2.2 FORMAL GRAMMARS The above recipe form, based on replacement according to rules, is strong enough to serve as a basis for formal grammars. Similar forms, often called “rewriting systems”, have a long history among mathematicians, but the specific form of Figure 2.2 was first studied extensively by Chomsky [Misc 1959]. His analysis has been the foundation for almost all research and progress in formal languages, parsers and a considerable part of compiler construction and linguistics. Since formal languages are a branch of mathematics, work in this field is done in a special notation which can be a hurdle to the uninitiated. To allow a small peep into the formal linguist’s kitchen, we shall give the formal definition of a grammar and then explain why it describes a grammar like the one in Figure 2.2. The formalism used is indispensable for correctness proofs, etc., but not for understanding the principles; it is shown here only to give an impression and, perhaps, to bridge a gap. Definition 2.1: A generative grammar is a 4-tuple (VN ,VT ,R,S) such that (1) VN and VT are finite sets of symbols, (2) VN∩VT = ∅, (3) R is a set of pairs (P,Q) such that (3a) P∈(VN∪VT)+ and (3b) Q∈(VN∪VT)* , and (4) S∈VN . A 4-tuple is just an object consisting of 4 identifiable parts; they are the non- terminals, the terminals, the rules and the start symbol, in that order; the above defini- tion does not tell this, so this is for the teacher to explain. The set of non-terminals is named VN and the set of terminals VT . For our grammar we have: VN = {Name, Sentence, List, End} VT = {tom, dick, harry, ,, and} (note the , in the set of terminal symbols). The intersection of VN and VT (2) must be empty, that is, the non-terminals and the terminals may not have a symbol in common, which is understandable. R is the set of all rules (3), and P and Q are the left-hand sides and right-hand sides, respectively. Each P must consist of sequences of one or more non-terminals and terminals and each Q must consist of sequences of zero or more non-terminals and terminals. For our grammar we have: R = {(Name, tom), (Name, dick), (Name, harry), Sec. 2.2] Formal grammars 25 (Sentence, Name), (Sentence, List End), (List, Name), (List, List , Name), (, Name End, and Name)} Note again the two different commas. The start symbol S must be an element of VN , that is, it must be a non-terminal: S = Sentence This concludes our field trip into formal linguistics; the reader can be assured that there is lots and lots more. A good simple introduction is written by R  v  sz [Books 1985]. 2.2.1 Generating sentences from a formal grammar The grammar in Figure 2.2 is what is known as a phrase structure grammar for our t,d&h language (often abbreviated to PS grammar). There is a more compact notation, in which several right-hand sides for one and the same left-hand side are grouped together and then separated by vertical bars, |. This bar belongs to the formalism, just as the arrow -> and can be read “or else”. The right-hand sides separated by vertical bars are also called alternatives. In this more concise form our grammar becomes: 0. Name -> tom | dick | harry 1. SentenceS -> Name | List End 2. List -> Name | Name , List 3. , Name End -> and Name where the non-terminal with the subscript S is the start symbol. (The subscript S identi- fies the symbol, not the rule.) Now let’s generate our initial example from this grammar, using replacement according to the above rules only. We obtain the following successive forms for Sen- tence: Intermediate form Rule used Explanation Sentence the start symbol List End Sentence -> List End rule 1 Name , List End List -> Name , List rule 2 Name , Name , List End List -> Name , List rule 2 Name , Name , Name End List -> Name rule 2 Name , Name and Name , Name End -> and Name rule 3 tom , dick and harry rule 0, three times The intermediate forms are called sentential forms; if a sentential form contains no non-terminals it is called a sentence and belongs to the generated language. The transi- tions from one line to the next are called production steps and the rules are often called production rules, for obvious reasons. The production process can be made more visual by drawing connective lines between corresponding symbols, as shown in Figure 2.3. Such a picture is called a pro- duction graph or syntactic graph, because it depicts the syntactic structure (with regard to the given grammar) of the final sentence. We see that the production graph normally 26 Grammars as a generating device [Ch. 2 Sentence List End Name , List Name , List , Name End and Name tom , dick and harry Figure 2.3 Production graph for a sentence fans out downwards, but occasionally we may see starlike constructions, which result from rewriting a group of symbols. It is patently impossible to have the grammar generate tom, dick, harry, since any attempt to produce more than one name will drag in an End and the only way to get rid of it again (and get rid of it we must, since it is a non-terminal) is to have it absorbed by rule 3, which will produce the and. We see, to our amazement, that we have succeeded in implementing the notion “must replace” in a system that only uses “may replace”; looking more closely, we see that we have split “must replace” into “may replace” and “must not be a non-terminal”. Apart from our standard example, the grammar will of course also produce many other sentences; examples are: harry and tom harry tom, tom, tom, and tom and an infinity of others. A determined and foolhardy attempt to generate the incorrect form without the and will lead us to sentential forms like: tom, dick, harry End which are not sentences and to which no production rule applies. Such forms are called blind alleys. Note that production rules may not be applied in the reverse direction. Sec. 2.3] The Chomsky hierarchy of grammars and languages 29 in which three symbols are replaced by two. By restricting this freedom, we obtain Type 1 grammars. Strangely enough there are two, intuitively completely different definitions of Type 1 grammars, which can be proved to be equivalent. A grammar is Type 1 monotonic if it contains no rules in which the left-hand side consists of more symbols than the right-hand side. This forbids, for instance, the rule , N E -> and N. A grammar is Type 1 context-sensitive if all of its rules are context-sensitive. A rule is context-sensitive if actually only one (non-terminal) symbol in its left-hand side gets replaced by other symbols, while we find the others back undamaged and in the same order in the right-hand side. Example: Name Comma Name End -> Name and Name End which tells that the rule Comma -> and may be applied if the left context is Name and the right context is Name End. The con- texts themselves are not affected. The replacement must be at least one symbol long; this means that context-sensitive grammars are always monotonic; see Section 2.6. Here is a monotonic grammar for our t,d&h example. In writing monotonic gram- mars one has to be careful never to produce more symbols than will be produced even- tually. We avoid the need to delete the end-marker by incorporating it into the right- most name. Name -> tom | dick | harry SentenceS -> Name | List List -> EndName | Name , List , EndName -> and Name where EndName is a single symbol. And here is a context-sensitive grammar for it. Name -> tom | dick | harry SentenceS -> Name | List List -> EndName | Name Comma List Comma EndName -> and EndName context is ... EndName and EndName -> and Name context is and ... Comma -> , Note that we need an extra non-terminal Comma to be able to produce the terminal and in the correct context. Monotonic and context-sensitive grammars are equally powerful: for each language that can be generated by a monotonic grammar a context-sensitive grammar exists that generates the same language, and vice versa. They are less powerful than the Type 0 grammars, that is, there are languages that can be generated by a Type 0 grammar but not by any Type 1. Strangely enough no simple examples of such 30 Grammars as a generating device [Ch. 2 languages are known. Although the difference between Type 0 and Type 1 is funda- mental and is not just a whim of Mr. Chomsky, grammars for which the difference matters are too complicated to write down; only their existence can be proved (see e.g., Hopcroft and Ullman [Books 1979, pp. 183-184] or R  v  sz [Books 1985, p. 98]). Of course any Type 1 grammar is also a Type 0 grammar, since the class of Type 1 grammars is obtained from the class of Type 0 grammars by applying restrictions. But it would be confusing to call a Type 1 grammar a Type 0 grammar; it would be like calling a cat a mammal: correct but not informative enough. A grammar is named after the smallest class (that is, the highest type number) in which it will still fit. We saw that our t,d&h language, which was first generated by a Type 0 grammar, could also be generated by a Type 1 grammar. We shall see that there is also a Type 2 and a Type 3 grammar for it, but no Type 4 grammar. We therefore say that the t,d&h language is Type 3 language, after the most restricted (and simple and amenable) gram- mar for it. Some corollaries of this are: A Type n language can be generated by a Type n grammar or anything stronger, but not by a weaker Type n +1 grammar; and: If a language is generated by a Type n grammar, that does not necessarily mean that there is no (weaker) Type n +1 grammar for it. The use of a Type 0 grammar for our t,d&h language was a serious case of overkill, just for demonstration purposes. The standard example of a Type 1 language is the set of words that consist of equal numbers of a’s, b’s and c’s, in that order: a a . . . . a n of them b b . . . . b n of them c c . . . . c n of them 2.3.1.1 Constructing a Type 1 grammar For the sake of completeness and to show how one writes a Type 1 grammar if one is clever enough, we shall now derive a grammar for this toy language. Starting with the simplest case, we have the rule 0. S -> abc Having got one instance of S, we may want to prepend more a’s to the beginning; if we want to remember how many there were, we shall have to append something to the end as well at the same time, and that cannot be a b or a c. We shall use a yet unknown symbol Q. The following rule pre- and postpends: 1. S -> abc | aSQ If we apply this rule, for instance, three times, we get the sentential form aaabcQQ Now, to get aaabbbccc from this, each Q must be worth one b and one c, as was to be expected, but we cannot just write Q -> bc Sec. 2.3] The Chomsky hierarchy of grammars and languages 31 because that would allow b’s after the first c. The above rule would, however, be all right if it were allowed to do replacement only between a b and a c; there, the newly inserted bc will do no harm: 2. bQc -> bbcc Still, we cannot apply this rule since normally the Q’s are to the right of the c; this can be remedied by allowing a Q to hop left over a c: 3. cQ -> Qc We can now finish our derivation: aaabcQQ (3 times rule 1) aaabQcQ (rule 3) aaabbccQ (rule 2) aaabbcQc (rule 3) aaabbQcc (rule 3) aaabbbccc (rule 2) It should be noted that the above derivation only shows that the grammar will produce the right strings, and the reader will still have to convince himself that it will not gen- erate other and incorrect strings. SS -> abc | aSQ bQc -> bbcc cQ -> Qc Figure 2.6 Monotonic grammar for a nb nc n The grammar is summarized in Figure 2.6; since a derivation tree of a 3b 3c 3 is already rather unwieldy, a derivation tree for a 2b 2c 2 is given in Figure 2.7. The gram- mar is monotonic and therefore of Type 1; it can be proved that there is no Type 2 grammar for the language. S a S Q a b c Q b Q c a a b b c c Figure 2.7 Derivation of a 2b 2c 2 34 Grammars as a generating device [Ch. 2 SentenceS -> Subject Verb Object Subject -> NounPhrase Object -> NounPhrase NounPhrase -> the QualifiedNoun QualifiedNoun -> Noun | Adjective QualifiedNoun Noun -> castle | caterpillar | cats Adjective -> well-read | white | wistful | ... Verb -> admires | bark | criticize | ... which produces sentences like: the well-read cats criticize the wistful caterpillar Since, however, no context is incorporated, it will equally well produce the incorrect the cats admires the white well-read castle For keeping context we could use a phrase structure grammar (for a simpler language): SentenceS -> Noun Number Verb Number -> Singular | Plural Noun Singular -> castle Singular | caterpillar Singular | ... Singular Verb -> Singular admires | ... Singular -> ε Noun Plural -> cats Plural | ... Plural Verb -> Plural bark | Plural criticize | ... Plural -> ε where the markers Singular and Plural control the production of actual English words. Still this grammar allows the cats to bark.... For a better way to handle context, see the section on van Wijngaarden grammars (2.4.1). The bulk of examples of CF grammars originate from programming languages. Sentences in these languages (that is, programs) have to be processed automatically (that is, by a compiler) and it was soon recognized (around 1958) that this is a lot easier if the language has a well-defined formal grammar. The syntaxes of almost all pro- gramming languages in use today are defined through a formal grammar.† Some authors (for instance, Chomsky) and some parsing algorithms, require a CF grammar to be monotonic. The only way a CF rule can be non-monotonic is by having an empty right-hand side; such a rule is called an ε-rule and a grammar that contains no such rules is called ε-free. The requirement of being ε-free is not a real restriction, just a nuisance. Any CF grammar can be made ε-free be systematic substitution of the ε- rules (this process will be explained in detail in 4.2.3.1), but this in general does not improve the appearance of the grammar. The issue will be discussed further in Section   † COBOL and FORTRAN also have grammars but theirs are informal and descriptive, and were never intended to be generative. Sec. 2.3] The Chomsky hierarchy of grammars and languages 35 2.6. 2.3.2.1 Backus-Naur Form There are several different styles of notation for CF grammars for programming languages, each with endless variants; they are all functionally equivalent. We shall show two main styles here. The first is Backus-Naur Form (BNF) which was first used for defining ALGOL 60. Here is a sample: <name>::= tom | dick | harry <sentence>S::= <name> | <list> and <name> <list>::= <name>, <list> | <name> This form’s main properties are the use of angle brackets to enclose non-terminals and of ::= for “may produce”. In some variants, the rules are terminated by a semicolon. 2.3.2.2 van Wijngaarden form The second style is that of the CF van Wijngaarden grammars. Again a sample: name: tom symbol; dick symbol; harry symbol. sentenceS: name; list, and symbol, name. list: name, comma symbol, list; name. The names of terminal symbols end in ...symbol; their representations are hardware- dependent and are not defined in the grammar. Rules are properly terminated (with a period). Punctuation is used more or less in the traditional way; for instance, the comma binds tighter than the semicolon. The punctuation can be read as follows: : “is defined as a(n)” ; “, or as a (n)” , “followed by a(n)” . “, and as nothing else.” The second rule in the above grammar would be read as: “a sentence is defined as a name, or as a list followed by an and-symbol followed by a name, and as nothing else.” Although this notation achieves its full power only when applied in the two-level van Wijngaarden grammars, it also has its merits on its own: it is formal and still quite readable. 2.3.2.3 Extended CF grammars CF grammars are often made both more compact and more readable by introducing special short-hands for frequently used constructions. If we return to the Book grammar of Figure 2.9, we see that rules like: SomethingSequence -> Something | Something SomethingSequence occur repeatedly. In an extended context-free grammar (ECF grammar), we can write Something+ meaning “one or more Somethings” and we do not need to give a rule for Something+; the rule 36 Grammars as a generating device [Ch. 2 Something+ -> Something | Something Something+ is implicit. Likewise we can use Something* for “zero or more Somethings” and Something? for “zero or one Something” (that is, “optionally a Something”). In these examples, the operators +, * and ? work on the preceding symbol; their range can be extended by using parentheses: (Something ;)? means “optionally a Something-followed-by-a-;”. These facilities are very useful and allow the Book grammar to be written more efficiently (Figure 2.10). Some styles even allow construc- tions like Something+4 meaning “one or more Somethings with a maximum of 4” or Something+, meaning “one or more Somethings separated by commas”; this seems to be a case of overdoing a good thing. BookS -> Preface Chapter+ Conclusion Preface -> "PREFACE" Paragraph+ Chapter -> "CHAPTER" Number Paragraph+ Paragraph -> Sentence+ Sentence -> ... ... Conclusion -> "CONCLUSION" Paragraph+ Figure 2.10 An extended CF grammar of a book The extensions of an ECF grammar do not increase its expressive powers: all implicit rules can be made explicit and then a normal CF grammar results. Their strength lies in their user-friendliness. The star in the notation X * with the meaning “a sequence of zero or more X’s” is called the Kleene star. If X is a set, X * should be read as “a sequence of zero or more elements of X”; it is the same star that we saw in Σ* in Section 2.1.3.3. Forms involving the repetition operators *, + or ? and possibly the separators ( and ) are called regular expressions. ECF’s, which have regular expres- sions for their right-hand sides, are for that reason sometimes called regular right part grammars (RRP grammars) which is more descriptive than “extended context free”, but which is perceived to be a tongue twister by some. There are two different schools of thought about the structural meaning of a regu- lar right-hand side. One school maintains that a rule like: Book -> Preface Chapter+ Conclusion is an abbreviation of Book -> Preface α Conclusion α -> Chapter | Chapter α as shown above. This is the “(right)recursive” interpretation; it has the advantage that it is easy to explain and that the transformation to “normal” CF is simple. Disadvantages are that the transformation entails anonymous rules (identified by α here) and that the lopsided parse tree for, for instance, a book of four chapters does not correspond to our idea of the structure of the book; see Figure 2.11. The seconds school claims that Sec. 2.3] The Chomsky hierarchy of grammars and languages 39 Sentence List t ListTail , List d ListTail & h Figure 2.14 Production chain for a regular (Type 3) grammar most common one is the use of square brackets to indicate “one out of a set of charac- ters”: [tdh] is an abbreviation for t|d|h: SS -> [tdh] | L L -> [tdh] T T -> , L | & [tdh] which may look more cryptic at first but is actually much more convenient and in fact allows simplification of the grammar to SS -> [tdh] | L L -> [tdh] , L | [tdh] & [tdh] A second way is to allow macros, names for pieces of the grammar that are substi- tuted properly into the grammar before it is used: Name -> t | d | h SS -> $Name | L L -> $Name , L | $Name & $Name The popular parser generator for regular grammars lex (designed and written by Lesk and Schmidt [FS 1975]) features both facilities. Note that if we adhere to the Chomsky definition of Type 3, our grammar will not get smaller than: SS -> t | d | h | tM | dM | hM M -> ,N | &P N -> tM | dM | hM P -> t | d | h This form is evidently easier to process but less user-friendly than the lex version. We 40 Grammars as a generating device [Ch. 2 observe here that while the formal-linguist is interested in and helped by minimally sufficient means, the computer scientist values a form in which the concepts underlying the grammar ($Name, etc.) are easily expressed, at the expense of additional processing. There are two interesting observations about regular grammars which we want to make here. First, when we use a RE grammar for generating a sentence, the sentential forms will only contain one non-terminal and this will always be at the end; that’s where it all happens (using the grammar of Figure 2.13): SentenceS List t ListTail t , List t , d ListTail t , d & h The second observation is that all regular grammars can be reduced considerably in size by using the regular expression operators *, + and ? introduced in Section 2.3.2 for “zero or more”, “one or more” and “optionally one”, respectively. Using these operators and ( and ) for grouping, we can simplify our grammar to: SS -> (( [tdh], )* [tdh]& )? [tdh] Here the parentheses serve to demarcate the operands of the * and ? operators. Regular expressions exist for all Type 3 grammars. Note that the * and the + work on what pre- cedes them; to distinguish them from the normal multiplication and addition operators, they are often printed higher than the level text in print, but in computer input they are in line with the rest. 2.3.4 Type 4 grammars The last restriction we shall apply to what is allowed in a production rule is a pretty final one: no non-terminal is allowed in the right-hand side. This removes all the gen- erative power from the mechanism, except for the choosing of alternatives. The start symbol has a (finite) list of alternatives from which we are allowed to choose; this is reflected in the name finite-choice grammar (FC grammar). There is no FC grammar for our t,d&h language; if, however, we are willing to restrict ourselves to lists of names of a finite length (say, no more than a hundred), then there is one, since one could enumerate all combinations. For the obvious limit of three names, we get: SS -> [tdh] | [tdh] & [tdh] | [tdh] , [tdh] & [tdh] for a total of 3+3* 3+3* 3* 3=39 production rules. FC grammars are not part of the official Chomsky hierarchy, that is, they are not identified by Chomsky. They are nevertheless very useful and are often required as a tail-piece in some process or reasoning. The set of reserved words (keywords) in a pro- gramming language can be described by a FC grammar. Although not many grammars are FC in their entirety, some of the rules in many grammars are finite-choice. E.g., the first rule of our first grammar (Figure 2.2) was FC. Another example of a FC rule was Sec. 2.3] The Chomsky hierarchy of grammars and languages 41 the macro introduced in Section 2.3.3; we do not need the macro mechanism if we change zero or more terminals in the definition of a regular grammar to zero or more terminals or FC non-terminals In the end, the FC non-terminals will only introduce a finite number of terminals. 2.4 VW GRAMMARS 2.4.1 The human inadequacy of CS and PS grammars In the preceding paragraphs we have witnessed the introduction of a hierarchy of gram- mar types: phrase structure, context-sensitive, context-free, regular and finite-choice. Although each of the boundaries between the types is clear-cut, some boundaries are more important than others. Two boundaries specifically stand out: that between context-sensitive and context-free and that between regular (finite-state) and finite- choice; the significance of the latter is trivial, being the difference between productive and non-productive, but the former is profound. The border between CS and CF is that between global correlation and local independence. Once a non-terminal has been produced in a sentential form in a CF grammar, its further development is independent of the rest of the sentential form; a non-terminal in a sentential form of a CS grammar has to look at its neighbours on the left and on the right, to see what production rules are allowed for it. The local produc- tion independence in CF grammars means that certain long-range correlations cannot be expressed by them. Such correlations are, however, often very interesting, since they embody fundamental properties of the input text, like the consistent use of variables in a program or the recurrence of a theme in a musical composition. When we describe such input through a CF grammar we cannot enforce the proper correlations; one (often-used) way out is to settle for the CF grammar, accept the parsing it produces and then check the proper correlations with a separate program. This is, however, quite unsatisfactory since it defeats the purpose of having a grammar, that is, having a con- cise and formal description of all the properties of the input. The obvious solution would seem to be the use of a CS grammar to express the correlations (= the context-sensitivity) but here we run into another, non-fundamental but very practical problem: CS grammars can express the proper correlations but not in a way a human can understand. It is in this respect instructive to compare the CF gram- mars in Section 2.3.2 to the one CS grammar we have seen that really expresses a context-dependency, the grammar for a nb nc n in Figure 2.6. The grammar for the con- tents of a book (Figure 2.9) immediately suggests the form of the book, but the 44 Grammars as a generating device [Ch. 2 ci: c symbol. cii: c symbol, ci. ciii: c symbol, cii. ... ... Here the i’s count the number of a’s, b’s and c’s. Next we introduce a special kind of name called a metanotion. Rather than being capable of producing (part of) a sentence in the language, it is capable of producing (part of) a name in a grammar rule. In our example we want to catch the repetitions of i’s in a metanotion N, for which we give a context-free production rule (a metarule): N :: i ; i N . Note that we use a slightly different notation for metarules: left-hand side and right- hand side are separated by a double colon (::) rather than by a single colon and members are separated by a blank ( ) rather than by a comma. The metanotion N pro- duces i, ii, iii, etc., which are exactly the parts of the non-terminal names we need. We can use the production rules of N to collapse the four infinite groups of rules into four finite rule templates called hyper-rules. textS: a N, b N, c N. a i: a symbol. a i N: a symbol, a N. b i: b symbol. b i N: b symbol, b N. c i: c symbol. c i N: c symbol, c N. Each original rule can be obtained from one of the hyper-rules by substituting a production of N from the metarules for each occurrence of N in that hyper-rule, pro- vided that the same production of N is used consistently throughout. To distinguish them from normal names, these half-finished combinations of small letters and metano- tions (like a N or b i N) are called hypernotions. Substituting, for instance, N=iii in the hyperrule b i N: b symbol, b N. yields the CF rule for the CF non-terminal biiii biiii: b symbol, biii. We can also use this technique to condense the finite parts of a grammar by hav- ing a metarule A for the symbols a, b and c. Again the rules of the game require that the metanotion A be replaced consistently. The final result is shown in Figure 2.17. This grammar gives a clear indication of the language it describes: once the Sec. 2.4] VW grammars 45 N :: i ; i N . A :: a ; b ; c . textS: a N, b N, c N. A i: A symbol. A i N: A symbol, A N. Figure 2.17 A VW grammar for the language a nb nc n “value” of the metanotion N is chosen, production is straightforward. It is now trivial to extend the grammar to a nb nc nd n . It is also clear how long-range relations are esta- blished without having confusing messengers in the sentential form: they are esta- blished before they become long-range, through consistent substitution of metanotions in simple right-hand sides. The “consistent substitution rule” for metanotions is essen- tial to the two-level mechanism; without it, VW grammars would be equivalent to CF grammars (Meersman and Rozenberg [VW 1978]). A very good and detailed explanation of VW grammars has been written by Craig Cleaveland and Uzgalis [VW 1977], who also show many applications. Sintzoff [VW 1967] has proved that VW grammars are as powerful as PS grammars, which also shows that adding a third level to the building cannot increase its powers. Van Wijngaarden [VW 1974] has shown that the metagrammar need only be regular (although simpler grammars may be possible if it is allowed to be CF). 2.4.3 Infinite symbol sets In a sense, VW grammars are even more powerful than PS grammars: since the name of a symbol can be generated by the grammar, they can easily handle infinite symbol sets. Of course this just shifts the problem: there must be a (finite) mapping from sym- bol names to symbols somewhere. The VW grammar of Figure 2.18 generates sen- tences consisting of arbitrary numbers of equal-length stretches of equal symbols, for instance, s 1s 1s 1s 2s 2s 2 or s 1s 1s 2s 2s 3s 3s 4s 4s 5s 5 , where sn is the representation of in symbol. The minimum stretch length has been set to 2, to prevent the grammar from producing Σ* . N :: n N; ε . C :: i; i C. textS: N i tail. N C tail: ε; N C, N C i tail. N n C : C symbol, N C. C : ε. Figure 2.18 A grammar handling an infinite alphabet 2.4.4 BNF notation for VW grammars There is a different notation for VW grammars, sometimes used in formal language theory (for instance, Greibach [VW 1974]), which derives from the BNF notation (see Section 2.3.2.1). A BNF form of our grammar from Figure 2.17 is given in Figure 2.19; hypernotions are demarcated by angle brackets and terminal symbols are represented 46 Grammars as a generating device [Ch. 2 by themselves. N -> i | i N A -> a | b | c <text>S -> <aN> <bN> <cN> <Ai> -> A <AiN> -> A <AN> Figure 2.19 The VW grammar of Figure 2.17 in BNF notation 2.4.5 Affix grammars Like VW grammars, affix grammars establish long-range relations by duplicating information in an early stage; this information is, however, not part of the non-terminal name, but is passed as an independent parameter, an affix, which can, for instance, be an integer value. Normally these affixes are passed on to the members of a rule, until they are passed to a special kind of non-terminal, a primitive predicate. Rather than producing text, a primitive predicate contains a legality test. For a sentential form to be legal, all the legality tests in it have to succeed. The affix mechanism is equivalent to the VW metanotion mechanism, is slightly easier to handle while parsing and slightly more difficult to use when writing a grammar. An affix grammar for a nb nc n is given in Figure 2.20. The first two lines are affix definitions for N, M, A and B. Affixes in grammar rules are traditionally preceded by a +. The names of the primitive predicates start with where. To produce abc, start with text + 1; this produces list + 1 + a, list + 1 + b, list + 1 + c The second member of this, for instance, produces letter + b, where is decreased + 0 + 1, list + 0 + b the first member of which produces where is + b + b, b symbol. All the primitive predicates in the above are fulfilled, which makes the final sentence legal. An attempt to let letter + b produce a symbol introduces the primitive predicate where is + a + b which fails, invalidating the sentential form. Affix grammars have largely been replaced by attribute grammars, which achieve roughly the same effect through similar but conceptually different means (see Section 2.9.1). Sec. 2.5] Actually generating sentences from a grammar 49 formal-linguist says “It is undecidable whether a PS grammar produces the empty set”, which means that there cannot be an algorithm that will for every PS grammar correctly tell if the grammar produces at least one sentence. This does not mean that we cannot prove for some given grammar that it generates nothing, if that is the case, only that the proof method used will not work for all grammars: we could have a program that correctly says Yes in finite time if the answer is Yes but that takes infinite time if the answer is No; in fact, our generating procedure above is such an algorithm that gives the correct Yes/No answer in infinite time (although we can have an algorithm that gives a Yes/Don’t know answer in finite time). Although it is true that because of some deep theorem in formal linguistics we cannot always get exactly the answer we want, this does not prevent us from obtaining all kinds of useful information that gets close. We shall see that this is a recurring phenomenon. The computer scientist is aware of but not daunted by the impossibilities from formal linguistics. The second remark is that when we do get sentences from the above production process, they may be produced in an unpredictable order. For non-monotonic grammars the sentential forms may grow for a while and then suddenly shrink again, perhaps to the empty string. Formal linguistics says that there cannot be an algorithm that for all PS grammars will produce their sentences in increasing (actually “non-decreasing”) length. The production of all sentences from a van Wijngaarden grammar poses a special problem in that there are effectively infinitely many left-hand sides to match with. For a technique to solve this problem, see Grune [VW 1984]. 2.5.2 The CF case When we generate sentences from a CF grammar, many things are a lot simpler. It can still happen that our grammar will never produce a sentence, but now we can test for that beforehand, as follows. First scan the grammar to find all non-terminals that have a right-hand side that contains terminals only or is empty. These non-terminals are guaranteed to produce something. Now scan again to find non-terminals that have a right-hand side that consists of only terminals and non-terminals that are guaranteed to produce something. This will give us new non-terminals that are guaranteed to produce something. Repeat this until we find no more new such non-terminals. If we have not met the start symbol this way, it will not produce anything. Furthermore we have seen that if the grammar is CF, we can afford to just rewrite the left-most non-terminal every time (provided we rewrite it into all its alternatives). Of course we can also consistently rewrite the right-most non-terminal; both approaches are similar but different. Using the grammar 0. N -> t | d | h 1. SS -> N | L & N 2. L -> N , L | N let us follow the adventures of the sentential form that will eventually result in d,h&h. Although it will go several times up and down the production queue, we only depict here what changes are made to it. We show the sentential forms for left-most and right-most substitution, with the rules and alternatives involved; for instance, (1b) means rule 1 alternative b. 50 Grammars as a generating device [Ch. 2 S S 1b 1b L&N L&N 2a 0c N,L&N L&h 0b 2a d,L&N N,L&h 2b 2b d,N&N N,N&h 0c 0c d,h&N N,h&h 0c 0b d,h&h d,h&h The sequences of production rules used are not as similar as we would expect; of course, in grand total the same rules and alternatives are applied but the sequences are neither equal nor each other’s mirror image, nor is there any other obvious relationship. Still both define the same production tree: S L N N L N d , h & h but if we number the non-terminals in it in the order they were rewritten, we would get different numberings: S L N N L N d , h & h 1 2 6 3 4 5 Left-most derivation order S L N N L N d , h & h 1 3 2 6 4 5 Right-most derivation order The sequence of production rules used in left-most rewriting is called the left-most derivation of a sentence. We do not have to indicate where each rule must be applied Sec. 2.5] Actually generating sentences from a grammar 51 and need not even give its rule number; both are implicit in the left-most substitution. A right-most derivation is defined in the obvious way. The production sequence S → L&N → N,L&N → d,L&N → d,N&N → d,h&N → d,h&h can be abbreviated to S →*l d,h&h. Likewise, the sequence S → L&N → L&h → N,L&h → N,N&h → N,h&h → d,h&h can be abbreviated to S →*r d,h&h. The fact that S produces d,h&h in any way is written as S →* d,h&h. The task of parsing is to reconstruct the parse tree (or graph) for a given input string, but some of the most efficient parsing techniques can be understood more easily if viewed as attempts to reconstruct a left- or right-most derivation of the input string; the parse tree then follows automatically. This is why the notion “[left|right]-most derivation” will occur frequently in this book (note the FC grammar used here). 2.6 TO SHRINK OR NOT TO SHRINK In the previous paragraphs, we have sometimes been explicit as to the question if a right-hand side of a rule may be shorter than its left-hand side and sometimes we have been vague. Type 0 rules may definitely be of the shrinking variety, monotonic rules definitely may not, and Type 2 and 3 rules can shrink only by producing empty (ε), that much is sure. The original Chomsky hierarchy [Misc 1959] was very firm on the subject: only Type 0 rules are allowed to make a sentential form shrink. Type 1 to 3 rules are all monotonic. Moreover, Type 1 rules have to be of the context-sensitive variety, which means that only one of the non-terminals in the left-hand side is actually allowed to be replaced (and then not by ε). This makes for a proper hierarchy in which each next class is a proper subset of its parent and in which all derivation graphs except for those of Type 0 grammars are actually derivation trees. As an example consider the grammar for the language a nb nc n given in Figure 2.6: 1. SS -> abc | aSQ 2. bQc -> bbcc 3. cQ -> Qc which is monotonic but not context-sensitive in the strict sense. It can be made CS by expanding the offending rule 3 and introducing a non-terminal for c: 1. SS -> abC | aSQ 2. bQC -> bbCC 3a. CQ -> CX 3b. CX -> QX 3c. QX -> QC 4. C -> c Now the production graph of Figure 2.7 turns into a production tree: 54 Grammars as a generating device [Ch. 2 2.7 A CHARACTERIZATION OF THE LIMITATIONS OF CF AND FS GRAMMARS When one has been working for a while with CF grammars, one gradually gets the feel- ing that almost anything could be expressed in a CF grammar. That there are, however, serious limitations to what can be said by a CF grammar is shown by the famous uvwxy theorem, which is explained below. 2.7.1 The uvwxy theorem When we have obtained a sentence from a CF grammar, we may look at each (termi- nal) symbol in it, and ask: How did it get here? Then, looking at the production tree, we see that it was produced as, say, the n-th member of the right-hand side of rule number m. The left-hand side of this rule, the parent of our symbol, was again produced as the p-th member of rule q, and so on, until we reach the start symbol. We can, in a sense, trace the lineage of the symbol in this way. If all rule/member pairs in the lineage of a symbol are different, we call the symbol original, and if all the symbols in a sentence are original, we call the sentence “original”. Now there is only a finite number of ways for a given symbol to be original. This is easy to see as follows. All rule/member pairs in the lineage of an original symbol must be different, so the length of its lineage can never be more than the total number of different rule/member pairs in the grammar. There are only so many of these, which yields only a finite number of combinations of rule/member pairs of this length or shorter. In theory the number of original lineages of a symbol can be very large, but in practice it is very small: if there are more than, say, ten ways to produce a given sym- bol from a grammar by original lineage, your grammar will be very convoluted! This puts severe restrictions on original sentences. If a symbol occurs twice in an original sentence, both its lineages must be different: if they were the same, they would describe the same symbol in the same place. This means that there is a maximum length to original sentences: the sum of the numbers of original lineages of all symbols. For the average grammar of a programming language this length is in the order of some thousands of symbols, i.e., roughly the size of the grammar. So, since there is a longest original sentence, there can only be a finite number of original sentences, and we arrive at the surprising conclusion that any CF grammar produces a finite-size kernel of origi- nal sentences and (probably) an infinite number of unoriginal sentences! S .............................. u . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . y A .................... v . . . . . . . . . . . . . . . . . . . . . . . . . . . . x A ......... . . . . . . . . . . . . .q w Figure 2.23 An unoriginal sentence: uvwxy What do “unoriginal” sentences look like? This is where we come to the uvwxy Sec. 2.7] A characterization of the limitations of CF and FS grammars 55 theorem. An unoriginal sentence has the property that it contains at least one symbol in the lineage of which a repetition occurs. Suppose that symbol is a q and the repeated rule is A. We can then draw a picture similar to Figure 2.23, where w is the part pro- duced by the most recent application of A, vwx the part produced by the other applica- tion of A and uvwxy is the entire unoriginal sentence. Now we can immediately find another unoriginal sentence, by removing the smaller triangle headed by A and replac- ing it by a copy of the larger triangle headed by A; see Figure 2.24. S .............................. u . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . y A .................... v . . . . . . . . . . . . . . . . . . . . . . . . . . . . x A .................... v . . . . . . . . . . . . . . . . . . . . . . . . . . . . x A ......... . . . . . . . . . . . . .q w Figure 2.24 Another unoriginal sentence, uv 2wx 2y This new tree produces the sentence uvvwxxy and it is easy to see that we can, in this way, construct a complete family of sentences uv nwx ny for all n≥0; the w is nested in a number of v and x brackets, in an indifferent context of u and y. The bottom line is that when we examine longer and longer sentences in a context-free language, the original sentences become exhausted and we meet only fam- ilies of closely related sentences telescoping off into infinity. This is summarized in the uvwxy theorem: any sentence generated by a CF grammar, that is longer than the longest original sentence from that grammar, can be cut into five pieces u, v, w, x and y in such a way that uv nwx ny is a sentence from that grammar for all n≥0. The uvwxy theorem has several variants; it is also called the pumping lemma for context-free languages. Two remarks must be made here. The first is that if a language keeps on being ori- ginal in longer and longer sentences without reducing to families of nested sentences, there cannot be a CF grammar for it. We have already encountered the context- sensitive language a nb nc n and it is easy to see (but not quite so easy to prove!) that it does not decay into such nested sentences, as sentences get longer and longer. Conse- quently, there is no CF grammar for it. The second is that the longest original sentence is a property of the grammar, not of the language. By making a more complicated grammar for a language we can increase the set of original sentences and push away the border beyond which we are forced to resort to nesting. If we make the grammar infinitely complicated, we can push the border to infinity and obtain a phrase structure language from it. How we can make a CF grammar infinitely complicated, is described in the Section on two-level gram- mars, 2.4. 56 Grammars as a generating device [Ch. 2 2.7.2 The uvw theorem Start_symbol . . . P . . . . . . Q u A u . . . R u . . . . . . S u v A A appears again u v . . . T u v . . . . . . U u v w Figure 2.25 Repeated occurrence of A may result in repeated occurrence of v A simpler form of the uvwxy theorem applies to regular (Type 3) languages. We have seen that the sentential forms occurring in the production process for a FS grammar all contain only one non-terminal, which occurs at the end. During the production of a very long sentence, one or more non-terminals must occur two or more times, since there are only a finite number of non-terminals. Figure 2.25 shows what we see, when we list the sentential forms one by one; the substring v has been produced from one occurrence of A to the next, u is a sequence that allows us to reach A, and w is a sequence that allows us to terminate the production process. It will be clear that, start- ing from the second A, we could have followed the same path as from the first A, and thus have produced uvvw. This leads us to the uvw theorem, or the pumping lemma for regular languages: any sufficiently long string from a regular language can be cut into three pieces u, v and w, so that uv nw is a string in the language for all n≥0. 2.8 HYGIENE IN GRAMMARS Although the only requirement for a CF grammar is that there is exactly one non- terminal in the left-hand sides of all its rules, such a general grammar can suffer from a (small) number of ailments. 2.8.1 Undefined non-terminals The right-hand sides of some rules may contain non-terminals for which no production rule is given. Remarkably, this does not seriously affect the sentence generation pro- cess described in 2.5.2: if a sentential form containing an undefined non-terminal turns up for processing in a left-most production process, there will be no match, and the sen- tential form is a blind alley and will be discarded. The rule with the right-hand side containing the undefined non-terminal will never have issue and can indeed be removed from the grammar. (If we do this, we may of course remove the last defini- tion of another non-terminal, which will then in turn become undefined, etc.) From a theoretical point of view there is nothing wrong with an undefined non- terminal, but if a user-specified grammar contains one, there is almost certainly an error, and any grammar-processing program should mark such an occurrence as an error. Sec. 2.9] The semantic connection 59 + + 1 3 5 A 0=1 A 0=3 A 0=5 Figure 2.27 Initial stage of the attributed parse tree production rule are known, we can use its semantic clause to calculate the attribute of its left-hand side. This way the attribute values (semantics) percolate up the tree, finally reach the start symbol and provide as with the semantics of the whole sentence, as shown in Figure 2.28. Attribute grammars are a very powerful method of handling the semantics of a language. + + 1 3 5 A 0=9 A 0=8 A 0=1 A 0=3 A 0=5 Figure 2.28 Fully attributed parse tree 2.9.2 Transduction grammars Transduction grammars define the semantics of a string (the “input string”) as another string, the “output string” or “translation”, rather than as the final attribute of the start symbol. This method is less powerful but much simpler than using attributes and often sufficient. The semantic clause in a rule just contains the string that should be output for the corresponding node. We assume that the string for a node is output just after the strings for all its children. Other variants are possible and in fact usual. We can now write a transduction grammar which translates a sum of digits into instructions to cal- culate the value of the sum. 1. SumS -> Digit {"make it the result"} 2. Sum -> Sum + Digit {"add it to the previous result"} 3a. Digit -> 0 {"take a 0"} ... ... 3j. Digit -> 9 {"take a 9"} This transduction grammar translates 3+5+1 into: take a 3 make it the result take a 5 add it to the previous result take a 1 add it to the previous result 60 Grammars as a generating device [Ch. 2 which is indeed what 3+5+1 “means”. 2.10 A METAPHORICAL COMPARISON OF GRAMMAR TYPES Text books claim that “Type n grammars are more powerful than Type n +1 grammars, for n=0,1,2”, and one often reads statements like “A regular (Type 3) grammar is not powerful enough to match parentheses”. It is interesting to see what kind of power is meant. Naively, one might think that it is the power to generate larger and larger sets, but this is clearly incorrect: the largest possible set of strings, Σ* , is easily generated by the straightforward Type 3 grammar: SS -> [Σ] S | ε where [Σ] is an abbreviation for the symbols in the language. It is just when we want to restrict this set, that we need more powerful grammars. More powerful grammars can define more complicated boundaries between correct and incorrect sentences. Some boundaries are so fine that they cannot be described by any grammar (that is, by any generative process). This idea has been depicted metaphorically in Figure 2.29, in which a rose is approximated by increasingly finer outlines. In this metaphor, the rose corresponds to the language (imagine the sentences of the language as molecules in the rose); the grammar serves to delineate its silhouette. A regular grammar only allows us straight horizontal and vertical line segments to describe the flower; ruler and T-square suffice, but the result is a coarse and mechanical-looking picture. A CF grammar would approximate the outline by straight lines at any angle and by circle segments; the draw- ing could still be made using the classical tools of compasses and ruler. The result is stilted but recognizable. A CS grammar would present us with a smooth curve tightly enveloping the flower, but the curve is too smooth: it cannot follow all the sharp turns and it deviates slightly at complicated points; still, a very realistic picture results. An unrestricted phrase structure grammar can represent the outline perfectly. The rose itself cannot be caught in a finite description; its essence remains forever out of our reach. A more prosaic and practical example can be found in the successive sets of Pas- cal† programs that can be generated by the various grammar types.  The set of all lexically correct Pascal programs can be generated by a regular grammar. A Pascal program is lexically correct if there are no newlines inside strings, comment is terminated before end-of-file, all numerical constants have the right form, etc.  The set of all syntactically correct Pascal programs can be generated by a context-free grammar. These programs conform to the (CF) grammar in the manual.  The set of all semantically correct Pascal programs can be generated by a CS grammar (although a VW grammar would be more practical). These are the   † We use the programming language Pascal here because we expect that most of our readers will be more or less familiar with it. Any programming language for which the manual gives a CF grammar will do. Sec. 2.10] A metaphorical comparison of grammar types 61 Figure 2.29 The silhouette of a rose, approximated by Type 3 to Type 0 grammars programs that pass through a Pascal compiler without drawing error messages.  The set of all Pascal programs that would terminate in finite time when run with a given input can be generated by an unrestricted phrase structure grammar. Such a grammar would, however, be very complicated, even in van Wijngaarden form, since it would incorporate detailed descriptions of the Pascal library routines and the Pascal run-time system.  The set of all Pascal programs that solve a given problem (for instance, play chess) cannot be generated by a grammar (although the description of the set is finite). Note that each of the above sets is a subset of the previous set. 64 Introduction to parsing [Ch. 3 3.2 LINEARIZATION OF THE PARSE TREE Often it is inconvenient and unnecessary to construct the actual production tree: many parsers produce a list of rule numbers instead, which means that they linearize the parse tree. There are three main ways to linearize a tree, prefix, postfix and infix. In prefix notation, each node is listed by listing its number followed by prefix listings of the subnodes in left-to-right order; this gives us the left-most derivation (for the right tree in Figure 3.2): left-most: 2 2 1 3c 1 3e 1 3a In postfix notation, each node is listed by listing in postfix notation all the subnodes in left-to-right order, followed by the number of the rule in the node itself; this gives us the right-most derivation (for the same tree): right-most: 3c 1 3e 1 2 3a 1 2 In infix notation, each node is listed by first giving an infix listing between parentheses of the first n subnodes, followed by the rule number in the node, followed by an infix listing between parentheses of the remainder of the subnodes; n can be chosen freely and can even differ from rule to rule, but n =1 is normal. Infix notation is not common for derivations, but is occasionally useful. The case with n =1 is called the left-corner derivation; in our example we get: left-corner: (((3c)1) 2 ((3e)1)) 2 ((3a)1) The infix notation requires parentheses to enable us to reconstruct the production tree from it. The left-most and right-most derivations can do without, provided we have the grammar ready to find the number of subnodes for each node. Note that it is easy to tell if a derivation is left-most or right-most: a left-most derivation starts with a rule for the start symbol, a right-most derivation starts with a rule that produces terminal symbols only (if both conditions hold, there is only one rule, which is both left-most and right- most derivation). The existence of several different derivations should not be confused with ambi- guity. The different derivations are just notational variants for one and the same pro- duction tree. No semantic significance can be attached to their differences. 3.3 TWO WAYS TO PARSE A SENTENCE The basic connection between a sentence and the grammar it derives from is the parse tree, which describes how the grammar was used to produce the sentence. For the reconstruction of this connection we need a parsing technique. When we consult the extensive literature on parsing techniques, we seem to find dozens of them, yet there are only two techniques to do parsing; all the rest is technical detail and embellishment. The first method tries to imitate the original production process by rederiving the sentence from the start symbol. This method is called top-down, because the production tree is reconstructed from the top downwards.†   † Trees grow from their roots downwards in computer science; this is comparable to electrons Sec. 3.3] Two ways to parse a sentence 65 The second methods tries to roll back the production process and to reduce the sentence back to the start symbol. Quite naturally this technique is called bottom-up. 3.3.1 Top-down parsing Suppose we have the monotonic grammar for the language a nb nc n from Figure 2.6, which we repeat here: SS -> aSQ S -> abc bQc -> bbcc cQ -> Qc and suppose the (input) sentence is aabbcc. First we try the top-down parsing method. We know that the production tree must start with the start symbol: S Now what could the second step be? We have two rules for S: S->aSQ and S->abc. The second rule would require the sentence to start with ab, which it does not; this leaves us S->aSQ: S a S Q This gives us a good explanation of the first a in our sentence. Again two rules apply: S->aSQ and S->abc. Some reflection will reveal that the first rule would be a bad choice here: all production rules of S start with an a, and if we would advance to the stage aaSQQ, the next step would inevitably lead to aaa...., which contradicts the input string. The second rule, however, is not without problems either: S a S Q a a b c Q since now the sentence starts with aabc..., which also contradicts the input sentence. Here, however, there is a way out: cQ->Qc:   having a negative charge in physics. 66 Introduction to parsing [Ch. 3 S a S Q a b c Q a a b Q c Now only one rule applies: bQc->bbcc, and we obtain our input sentence (together with the production tree): S a S Q a b c Q b Q c a a b b c c Top-down parsing tends to identify the production rules (and thus to characterize the parse tree) in prefix order. 3.3.2 Bottom-up parsing Using the bottom-up technique, we proceed as follows. One production step must have been the last and its result must still be visible in the string. We recognize the right- hand side of bQc->bbcc in aabbcc. This gives us the final step in the production (and the first in the reduction): a a b Q c a a b b c c Now we recognize the Qc as derived by cQ->Qc: a a b c Q b Q c a a b b c c Again we find only one recognizable substring: abc: Sec. 3.4] Non-deterministic automata 69 the system of partially processed input, internal administration and partial parse tree consistent. This has the consequence that we may move the NDA any way we choose: it may move in circles, it may even get stuck, but if it ever gives us an answer, i.e., a finished parse tree, that answer will be correct. It is also essential that the NDA can make all correct moves, so that it can produce all parsings if the control mechanism is clever enough to guide the NDA there. This property of the NDA is also easily arranged. The inherent correctness of the NDA allows great freedom to the control mechan- ism, the “control” for short. It may be naive or sophisticated, it may be cumbersome or it may be efficient, it may even be wrong, but it can never cause the NDA to produce an incorrect parsing; and that is a comforting thought. (If it is wrong it may, however, cause the NDA to miss a correct parsing, to loop infinitely or to get stuck in a place where it should not). 3.4.1 Constructing the NDA The NDA derives directly from the grammar. For a top-down parser its moves consist essentially of the production rules of the grammar and the internal administration is ini- tially the start symbol. The control moves the machine until the internal administration is equal to the input string; then a parsing has been found. For a bottom-up parser the moves consist essentially of the reverse of the production rules of the grammar (see 3.3.2) and the internal administration is initially the input string. The control moves the machine until the internal administration is equal to the start symbol; then a parsing has been found. A left-corner parser works like a top-down parser in which a carefully chosen set of production rules has been reversed and which has special moves to undo this reversion when needed. 3.4.2 Constructing the control mechanism Constructing the control of a parser is quite a different affair. Some controls are independent of the grammar, some consult the grammar regularly, some use large tables precalculated from the grammar and some even use tables calculated from the input string. We shall see examples of each of these: the “hand control” that was demonstrated at the beginning of this section comes in the category “consults the gram- mar regularly”, backtracking parsers often use a grammar-independent control, LL and LR parsers use precalculated grammar-derived tables, the CYK parser uses a table derived from the input string and Earley’s and Tomita’s parsers use several tables derived from the grammar and the input string. Constructing the control mechanism, including the tables, from the grammar is almost always done by a program. Such a program is called a parser generator; it is fed the grammar and perhaps a description of the terminal symbols and produces a program which is a parser. The parser often consists of a driver and one or more tables, in which case it is called table-driven. The tables can be of considerable size and of extreme complexity. The tables that derive from the input string must of course be calculated by a rou- tine that is part of the parser. It should be noted that this reflects the traditional setting in which a large number of different input strings is parsed according to a relatively static and unchanging grammar. The inverse situation is not at all unthinkable: many grammars are tried to explain a given input string (for instance, an observed sequence of events). 70 Introduction to parsing [Ch. 3 3.5 RECOGNITION AND PARSING FOR TYPE 0 TO TYPE 4 GRAMMARS Parsing a sentence according to a grammar if we know in advance that the string indeed derives from the grammar, is in principle always possible. If we cannot think of any- thing better, we can just run the general production process of 2.5.1 on the grammar and sit back and wait until the sentence turns up (and we know it will); this by itself is not exactly enough, we must extend the production process a little, so that each senten- tial form carries its own partial production tree, which must be updated at the appropri- ate moments, but it is clear that this can be done with some programming effort. We may have to wait a little while (say a couple of million years) for the sentence to show up, but in the end we will surely obtain the parse tree. All this is of course totally impractical, but it still shows us that at least theoretically any string can be parsed if we know it is parsable, regardless of the grammar type. 3.5.1 Time requirements When parsing strings consisting of more than a few symbols, it is important to have some idea of the time requirements of the parser, i.e., the dependency of the time required to finish the parsing on the number of symbols in the input string. Expected lengths of input range from some tens (sentences in natural languages) to some tens of thousands (large computer programs); the length of some input strings may even be vir- tually infinite (the sequence of buttons pushed on a coffee vending machine over its life-time). The dependency of the time requirements on the input length is also called time complexity. Several characteristic time dependencies can be recognized. A time dependency is exponential if each following input symbol multiplies the required time by a constant factor, say 2: each additional input symbol doubles the parsing time. Exponential time dependency is written O(C n) where C is the constant multiplication factor. Exponential dependency occurs in the number of grains doubled on each field of the famous chess board; this way lies bankrupcy. A time dependency is linear if each following input symbol takes a constant amount of time to process; doubling the input length doubles the processing time. This is the kind of behaviour we like to see in a parser; the time needed for parsing is pro- portional to the time spent on reading the input. So-called real-time parsers behave even better: they can produce the parse tree within a constant time after the last input symbol was read; given a fast enough computer they can keep up indefinitely with an input stream of constant speed. (Note that the latter is not necessarily true of linear- time parsers: they can in principle read the entire input of n symbols and then take a time proportional to n to produce the parse tree.) Linear time dependency is written O(n). A time dependency is called quadratic if the processing time is proportional to the square of the input length (written O(n 2)) and cubic if it is proportional to the to the third power (written O(n 3)). In general, a depen- dency that is proportional to any power of n is called polynomial (written O(n p)). 3.5.2 Type 0 and Type 1 grammars It is a remarkable result in formal linguistics that the recognition problem for a arbi- trary Type 0 grammar cannot be solved. This means that there cannot be an algorithm that accepts an arbitrary Type 0 grammar and an arbitrary string and tells us in finite time if the grammar can produce the string or not. This statement can be proven, but the proof is very intimidating and, what is worse, does not provide any insight into the Sec. 3.5] Recognition and parsing for Type 0 to Type 4 grammars 71 cause of the phenomenon. It is a proof by contradiction: we can prove that, if such an algorithm existed, we could construct a second algorithm of which we can prove that it only terminates if it never terminates. Since the latter is a logical impossibility and since all other premisses that went into the intermediate proof are logically sound we are forced to conclude that our initial premiss, the existence of a recognizer for Type 0 grammars, is a logical impossibility. Convincing, but not food for the soul. For the full proof see Hopcroft and Ullman [Books 1979, pp. 182-183] or R  v  sz [Books 1985, p. 98]. It is quite possible to construct a recognizer that works for a certain number of Type 0 grammars, using a certain technique. This technique, however, will not work for all Type 0 grammars. In fact, however many techniques we collect, there will always be grammars for which they do not work. In a sense we just cannot make our recognizer complicated enough. For Type 1 grammars, the situation is completely different. The seemingly incon- sequential property that Type 1 production rules cannot make a sentential form shrink allows us to construct a control mechanism for a bottom-up NDA that will at least work in principle, regardless of the grammar. The internal administration of this control consists of a set of sentential forms that could have played a role in the production of the input sentence; it starts off containing only the input sentence. Each move of the NDA is a reduction according to the grammar. Now the control applies all possible moves of the NDA to all sentential forms in the internal administration in an arbitrary order, and adds each result to the internal administration if it is not already there. It continues doing so until each move on each sentential form results in a sentential form that has already been found. Since no move of the NDA can make a sentential form longer (because all right-hand sides are at least as long as their left-hand sides) and since there are only a finite number of sentential forms as long as or shorter than the input string, this must eventually happen. Now we search the sentential forms in the internal administration for one that consists solely of the start symbol; if it is there, we have recognized the input string, if it is not, the input string does not belong to the language of the grammar. And if we still remember, in some additional administration, how we got this start symbol sentential form, we have obtained the parsing. All this requires a lot of book-keeping, which we are not going to discuss, since nobody does it this way anyway. To summarize the above, we cannot always construct a parser for a Type 0 gram- mar, but for a Type 1 grammar we always can. The construction of a practical and rea- sonably efficient parser for such grammars is a very difficult subject on which slow but steady progress has been made during the last 20 years (see the bibliography on “Unrestricted PS and CS Grammars”). It is not a hot research topic, mainly because Type 0 and Type 1 grammars are well-known to be human-unfriendly and will never see wide application. Yet it is not completely devoid of usefulness, since a good parser for Type 0 grammars would probably make a good starting point for a theorem prover.† The human-unfriendliness consideration does not apply to two-level grammars. Having a practical parser for two-level grammars would be marvellous, since it would allow parsing techniques (with all their built-in automation) to be applied in many more   † A theorem prover is a program that, given a set of axioms and a theorem, proves or disproves the theorem without or with minimal human intervention. 74 Introduction to parsing [Ch. 3 production tree of Figure 2.14 and if we turn it 45° counterclockwise, we get the pro- duction line of Figure 3.7. The sequence of non-terminals roll on to the right, producing terminals symbols as they go. In parsing, we are given the terminals symbols and are supposed to construct the sequence of non-terminals. The first one is given, the start symbol (hence the preference for top-down). If only one rule for the start symbol starts with the first symbol of the input we are lucky and know which way to go. Very often, however, there are many rules starting with the same symbol and then we are in need of more wisdom. As with Type 2 grammars, we can of course find the correct continua- tion by trial and error, but far more efficient methods exist that can handle any regular grammar. Since they form the basis of some advanced parsing techniques, they are treated separately, in Chapter 5. Sentence List ListTail List ListTail t , d & h Figure 3.7 The production tree of Figure 2.14 as a production line 3.5.5 Type 4 grammars Finite-choice (FC) grammars do not involve production trees, and membership of a given input string to the language of the FC grammar can be determined by simple look-up. This look-up is generally not considered to be “parsing”, but is still mentioned here for two reasons. First it can benefit from parsing techniques and second it is often required in a parsing environment. Natural languages have some categories of words that have only a very limited number of members; examples are the pronouns, the prepositions and the conjunctions. It is often important to decide quickly if a given word belongs to one of these finite-choice categories or will have to be analysed further. The same applies to reserved words in a programming language. One approach is to consider the FC grammar as a regular grammar and apply the techniques of Chapter 5. This is often amazingly efficient. A second often-used approach is that using a hash table. See any book on algo- rithms, for instance, Smith [CSBooks 1989]. 3.6 AN OVERVIEW OF PARSING METHODS The reader of literature about parsing is confronted with a large number of techniques with often unclear interrelationships. Yet (almost) all techniques can be placed in a sin- gle framework, according to some simple criteria; see Figure 3.10. We have already seen that a parsing technique is either top-down or bottom-up. The next division is that between non-directional and directional. 3.6.1 Directionality A non-directional method constructs the parse tree while accessing the input in any order it sees fit; this of course requires the entire input to be in memory before parsing can start. There is a top-down and a bottom-up version. Sec. 3.6] An overview of parsing methods 75 3.6.1.1 Non-directional methods The non-directional top-down method is simple and straightforward and has probably been invented independently by many people. It was first described by Unger [CF 1968] but in his article he gives the impression that the method already existed. The method has not received much attention in the literature but is more important than one might think, since it is used anonymously in a number of other parsers. We shall call it Unger’s method; it is treated in Section 4.1. The non-directional bottom-up method has also been discovered independently by a number of people, among whom Cocke, Younger [CF 1967] and Kasami [CF 1969]; an earlier description is by Sakai [CF 1962]. It is named CYK (or sometimes CKY) after the three best-known inventors. It has received considerable attention since its naive implementation is much more efficient than that of Unger’s method. The effi- ciency of both methods can be improved, however, arriving at roughly the same perfor- mance (see Sheil [CF 1976]). The CYK method is treated in Section 4.2. 3.6.1.2 Directional methods The directional methods process the input symbol by symbol, from left to right. (It is also possible to parse from right to left, using a mirror image of the grammar; this is occasionally useful.) This has the advantage that parsing can start, and indeed progress, considerably before the last symbol of the input is seen. The directional methods are all based explicitly or implicitly on the parsing automaton described in Section 3.5.3, where the top-down method performs predictions and matches and the bottom-up method performs shifts and reduces. 3.6.2 Search techniques The next subdivision concerns the search technique used to guide the (non- deterministic!) parsing automaton through all its possibilities to find one or all parsings. There are in general two methods for solving problems in which there are several alternatives in well-determined points: depth-first search, and breadth-first search. In depth-first search we concentrate on one half-solved problem; if the problem bifurcates at a given point P, we store one alternative for later processing and keep concentrating on the other alternative. If this alternative turns out to be a failure (or even a success, but we want all solutions), we roll back our actions until point P and continue with the stored alternative. This is called backtracking. In breadth-first search we keep a set of half-solved problems. From this set we calculate a new set of (better) half-solved prob- lems by examining each old half-solved problem; for each alternative, we create a copy in the new set. Eventually, the set will come to contain all solutions. Depth-first search has the advantage that it requires an amount of memory that is proportional to the size of the problem, unlike breadth-first search, which may require exponential memory. Breadth-first search has the advantage that it will find the sim- plest solution first. Both methods require in principle exponential time; if we want more efficiency (and exponential requirements are virtually unacceptable), we need some means to restrict the search. See any book on algorithms, for instance, Sedgewick [CSBooks 1988], for more information on search techniques. These search techniques are not at all restricted to parsing and can be used in a wide array of contexts. A traditional one is that of finding an exit from a maze. Figure 3.8(a) shows a simple maze with one entrance and two exits. Figure 3.8(b) depicts the path a depth-first search will take; this is the only option for the human maze-walker: 76 Introduction to parsing [Ch. 3 he cannot duplicate himself and the maze. Dead ends make the depth-first search back- track to the most recent untried alternative. If the searcher will also backtrack at each exit, he will find all exits. Figure 3.8(c) shows which rooms are examined in each stage of the breadth-first search. Dead ends (in stage 3) cause the search branches in question to be discarded. Breadth-first search will find the shortest way to an exit (the shortest solution) first; if it continues until all there are no branches left, it will find all exits (all solutions). (a) (b) .. .. . . . . . . . . . . . .. ........... .. . . . . . . . . . .. .. . . . . . . ............... .. .. .. .. ............... .. .................... (c) 0 1 2 2 3 3 3 4 4 5 5 6 Figure 3.8 A simple maze with depth-first and breadth-first visits 3.6.3 General directional methods Combining depth-first or breadth-first with top-down or bottom-up gives four classes of parsing techniques. The top-down techniques are treated in Chapter 6. The depth- first top-down technique allows a very simple implementation called recursive descent; this technique, which is explained in Section 6.6, is very suitable for writing parsers by hand. The bottom-up techniques are treated in Chapter 7. The combination of breadth- first and bottom-up leads to the class of Earley parsers, which have among them some very effective and popular parsers for general CF grammars. See Section 7.2. 3.6.4 Linear methods Most of the general search methods indicated in the previous paragraph have exponen- tial time dependency in the worst case: each symbol more in the input multiplies the parsing time by a constant factor. Such methods are unusable except for very small input length, where 20 symbols is about the maximum. Even the best of the above methods require cubic time in the worst case: for 10 tokens they do 1000 actions, for 100 tokens 1000 000 actions and for 1000 tokens 1 000 000 000 actions, which, at 10 microseconds per action will already take almost 3 hours. It is clear that for real speed we should like to have a linear-time general parsing method. Unfortunately no such method has been discovered to date. On the other hand, there is no proof and not even an indication that such a method could not exist. (Compare this to the situation around unrestricted phrase structure parsing, where it has been proved that no algorithm for it can exist; see Section 3.5.2.) Worse even, nobody has ever come up with a specific CF grammar for which no ad hoc linear-time parser could be designed. The only thing is that we have at present no way to construct such a parser in the general case. This is a theoretically and practically unsatisfactory state of affairs that awaits further clarifica- tion.†   † There is a theoretically interesting but impractical method by Valiant [CF 1975] which does general CF parsing in O(n 2.81). Since this is only very slightly better than O(n 3.00) and since Sec. 3.6] An overview of parsing methods 79 3.6.6 Almost deterministic methods When our attempt to construct a deterministic control for a parser fails and leaves us with an almost deterministic one, we need not despair yet. We can fall back on breadth-first search to solve the remnants of non-determinism at run-time. The better our original method was, the less non-determinism will be left, the less often breadth- first search will be needed and the more efficient our parser will be. This avenue of thought has been explored for bottom-up parsers by Tomita [CF 1986], who achieves with it what is probably the best general CF parser available today. Of course, by reintroducing breadth-first search we are taking chances. The gram- mar and the input could conspire so that the non-determinism gets hit by each input symbol and our parser will again have exponential time dependency. In practice, how- ever, they never do so and such parsers are very useful. Tomita’s parser is treated in Section 9.8. No corresponding research on top-down parsers has been reported in the literature. This is perhaps due to the fact that no amount of breadth-first searching can handle left-recursion in a grammar (left- recursion is explained in Section 6.3.2). 3.6.7 Left-corner parsing In Section 3.6 we wrote that “almost” all parsing methods could be assigned a place in Figure 3.10. The principal class of methods that has been left out concerns “left-corner parsing”. It is a third division alongside top-down and bottom-up, and since it is a hybrid between the two it should be assigned a separate column between these. In left-corner parsing, the right-hand side of each production rule is divided into two parts: the left part is called the left corner and is identified by bottom-up methods. The division of the right-hand side is done so that once its left corner has been identi- fied, parsing of the right part can proceed by a top-down method. Although left-corner parsing has advantages of its own, it tends to combine the disadvantages or at least the problems of top-down and bottom-up parsing, and is hardly used in practice. For this reason it has not been included in Figure 3.10. From a certain point of view, top-down and bottom-up can each be considered special cases of left-corner, which gives it some theoretical significance. See Section 13.7 for literature references. 3.6.8 Conclusion Figure 3.10 summarizes parsing techniques as they are treated in this book. Nijholt [Misc 1981] paints a more abstract view of the parsing landscape, based on left-corner parsing. See Deussen [Misc 1979] for an even more abstracted overview. An early sys- tematic survey was given by Griffiths and Petrick [CF 1965]. 80 Introduction to parsing [Ch. 3 Top-down Bottom-up   Non-directional methods Unger parser CYK parser   Directional methods The predict/match automaton Depth-first search (backtrack) Breadth-first search (Greibach) Recursive descent Definite Clause grammars The shift/reduce automaton Depth-first search (backtrack) Breadth-first search Breadth-first search, restricted (Earley)   Linear directional methods: breadth-first, with breadth restricted to 1 There is only one top-down method: LL(k) There is a whole gamut of methods: precedence bounded-context LR(k) LALR(1) SLR(1)  Efficient general directional methods: maximally restricted breadth-first search (no research reported) Tomita                                                                                    Figure 3.10 An overview of parsing techniques 4 General non-directional methods In this chapter we will present two general parsing methods, both non-directional: Unger’s method and the CYK method. These methods are called non-directional because they access the input in an seemingly arbitrary order. They require the entire input to be in memory before parsing can start. Unger’s method is top-down; if the input belongs to the language at hand, it must be derivable from the start symbol of the grammar. Therefore, it must be derivable from a right-hand side of the start symbol, say A 1A 2 . . . Am . This, in turn, means that A 1 must derive a first part of the input, A 2 a second part, etc. If the input sentence is z 1z 2 . . . zn , this demand can be depicted as follows: A 1 . . . Ai . . . Am z 1 . . . zk . . . zn Unger’s method tries to find a partition of the input that fits this demand. This is a recursive problem: if a non-terminal Ai is to derive a certain part of the input, there must be a partition of this part that fits a right-hand side of Ai . Ultimately, such a right-hand side must consist of terminal symbols only, and these can easily be matched with the current part of the input. The CYK method approaches the problem the other way around: it tries to find occurrences of right-hand sides in the input; whenever it finds one, it makes a note that the corresponding left-hand side derives this part of the input. Replacing the occurrence of the right-hand side with the corresponding left-hand side results in some sentential forms that derive the input. These sentential forms are again the subject of a search for right-hand sides, etc. Ultimately, we may find a sentential form that both derives the input sentence and is a right-hand side of the start symbol. In the next two sections, these methods are investigated in detail. 84 General non-directional methods [Ch. 4  Expr  Expr + Term   ( i +i)×i ( i+ i)×i ( i+i )×i ( i+i) ×i ( i+i)× i (i + i)×i (i +i )×i (i +i) ×i (i +i)× i (i+ i )×i (i+ i) ×i (i+ i)× i (i+i ) ×i (i+i )× i (i+i) × i  Even a small example like this already results in 15 partitions, and we will not examine them all here, although the unoptimized version of the algorithm requires this. We will only examine the partitions that have at least some chance of succeeding: we can elim- inate all partitions that do not match the terminal symbol of the right-hand side. So, the only partition worth investigating further is:  Expr  Expr + Term   (i + i)×i  The first sub-problem here is to find out whether and, if so, how Expr derives (i. We cannot partition (i into three non-empty parts because it only consists of 2 symbols. Therefore, the only rule that we can apply is the rule Expr -> Term. Similarly, the only rule that we can apply next is the rule Term -> Factor. So, we now have  Expr  Term  Factor   (i  However, this is impossible, because the first right-hand side of Factor has too many symbols, and the second one consists of one terminal symbol only. Therefore, the par- tition we started with does not fit, and it must be rejected. The other partitions were already rejected, so we can conclude that the rule Expr -> Expr + Term does not derive the input. The second right-hand side of Expr consists of only one symbol, so we only have one partition here, consisting of one part. Partitioning this part for the first right-hand side of Term again results in 15 possibilities, of which again only one has a chance of Sec. 4.1] Unger’s parsing method 85 succeeding: Expr Term Term × Factor (i+i) × i Continuing our search, we will find the following derivation: Expr -> Term -> Term × Factor -> Factor × Factor -> ( Expr ) × Factor -> ( Expr + Term ) × Factor -> ( Term + Term ) × Factor -> ( Factor + Term ) × Factor -> ( i + Term ) × Factor -> ( i + Factor ) × Factor -> ( i + i ) × Factor -> ( i + i ) × i and this is the only derivation to be found. This example demonstrates several aspects of the method: even small examples require a considerable amount of work, but even some simple checks can result in huge savings. For instance, matching the terminal symbols in a right-hand side with the par- tition at hand often leads to the rejection of the partition without investigating it any further. Unger [CF 1968] presents several more of these checks. For instance, one can compute the minimum length of strings of terminal symbols derivable from each non- terminal. Once it is known that a certain non-terminal only derives terminal strings of length at least n, all partitions that fit this non-terminal with a substring of length less than n can be immediately rejected. 4.1.2 Unger’s method with ε-rules So far, we only have dealt with grammars without ε-rules, and not without reason. Complications arise when the grammar contains ε-rules, as is demonstrated by the fol- lowing example: consider the grammar rule S→ABC and input sentence pqr. If we want to examine whether this rule derives the input sentence, and we allow for ε-rules, many more partitions will have to be investigated, because each of the non-terminals A, B, and C may derive the empty string. In this case, generating all partitions proceeds just as above, except that we first generate the partitions that have no marble at all in the first cup, then the partitions that have marble 1 in the first cup, etc.: 86 General non-directional methods [Ch. 4 S A B C pqr p qr pq r pqr p qr p q r p qr pq r pq r pqr Now suppose that we are investigating whether B derives pqr, and suppose there is a rule B→SD. Then, we will have to investigate the following partitions:  B  S D   pqr p qr pq r pqr  It is the last of these partitions that will cause trouble: in the process of finding out whether S derives pqr, we end up asking the same question again, in a different context. If we are not careful and do not detect this, our parser will loop forever, or run out of memory. When searching along this path, we are looking for a derivation that is using a loop in the grammar. This may even happen if the grammar does not contain loops. If this loop actually exists in the grammar, there are infinitely many derivations to be found along this path, provided that there is one, so we will never be able to present them all. The only interesting derivations are the ones without the loop. Therefore, we will cut off the search process in these cases. On the other hand, if the grammar does not contain such a loop, a cut-off will not do any harm either, because the search is doomed to fail anyway. So, we can avoid the problem altogether by cutting off the search process in these cases. Fortunately, this is not a difficult task. All we have to do is to maintain a list of questions that we are currently investigating. Before starting to investigate a new question (for instance “does S derive pqr?”) we first check that the question does not already appear in the list. If it does, we do not investigate this ques- tion. Instead, we proceed as if the question were answered negatively. Consider for instance the following grammar: S -> LSD | ε L -> ε D -> d This grammar generates sequences of d’s in an awkward way. The complete search for Sec. 4.2] The CYK parsing method 89 grammar and an input sentence. The first phase of the algorithm constructs a table tel- ling us which non-terminal(s) derive which substrings of the sentence. This is the recognition phase. It ultimately also tells us whether the input sentence can be derived from the grammar. The second phase uses this table and the grammar to construct all possible derivations of the sentence. We will first concentrate on the recognition phase, which really is the distinctive feature of the algorithm. 4.2.1 CYK recognition with general CF grammars To see how the CYK algorithm solves the recognition and parsing problem, let us con- sider the grammar of Figure 4.4. This grammar describes the syntax of numbers in scientific notation. An example sentence produced by this grammar is 32.5e+1. We will now use this grammar and sentence as an example. NumberS -> Integer | Real Integer -> Digit | Integer Digit Real -> Integer Fraction Scale Fraction -> . Integer Scale -> e Sign Integer | Empty Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Empty -> ε Sign -> + | - Figure 4.4 A grammar describing numbers in scientific notation The CYK algorithm first concentrates on substrings of the input sentence, shortest substrings first, and then works its way up. The following derivations of substrings of length 1 can be read directly from the grammar: Digit Digit Digit Sign Digit 3 2 . 5 e + 1 This means that Digit derives 3, Digit derives 2, etc. Note however, that this pic- ture is not yet complete. For one thing, there are several other non-terminals deriving 3. This complication arises because the grammar contains so-called unit rules, rules of the form A→B, where A and B are non-terminals. Such rules are also called single rules or chain rules. We can have chains of them in a derivation. So, the next step consists of applying the unit rules, repetitively, for instance to find out which other non-terminals derive 3. This gives us the following result: Number, Integer, Digit Number, Integer, Digit Number, Integer, Digit Sign Number, Integer, Digit 3 2 . 5 e + 1 90 General non-directional methods [Ch. 4 Now, we already see some combinations that we recognize from the grammar: For instance, an Integer followed by a Digit is again an Integer, and a . (dot) fol- lowed by an Integer is a Fraction. We get (again also using unit rules): Number, Integer Fraction Scale Number, Integer, Digit Number, Integer, Digit Number, Integer, Digit Sign Number, Integer, Digit 3 2 . 5 e + 1 At this point, we see that the Real-rule is applicable in several ways, and then the Number-rule, so we get: Number, Real Number, Real Number, Integer Fraction Scale Number, Integer, Digit Number, Integer, Digit Number, Integer, Digit Sign Number, Integer, Digit 3 2 . 5 e + 1 We find that Number does indeed derive 32.5e+1. In the example above, we have seen that unit rules complicate things a bit. Another complication, one that we have avoided until now, is formed by ε-rules. For instance, if we want to recognize the input 43.1 according to the example grammar, we have to realize that Scale derives ε here, so we get the following picture: Number, Real Number, Real Number, Integer Fraction Scale Number, Integer, Digit Number, Integer, Digit Number Integer Digit 4 3 . 1 In general this is even more complicated. We must take into account the fact that several non-terminals can derive ε between any two adjacent terminal symbols in the input sentence, and also in front of the input sentence or at the back. However, as we shall see, the problems caused by these kinds of rules can be solved, albeit at a certain cost. In the meantime, we will not let these problems discourage us. In the example, Sec. 4.2] The CYK parsing method 91 we have seen that the CYK algorithm works by determining which non-terminals derive which substrings, shortest substrings first. Although we skipped them in the example, the shortest substrings of any input sentence are, of course, the ε-substrings. We shall have to be able to recognize them in arbitrary position, so let us first see if we can compute R ε , the set of non-terminals that derive ε. Initially, this set R ε consists of the set of non-terminals A for which A→ε is a grammar rule. For the example grammar, R ε is initially the set { Empty }. Next, we check each grammar rule: If a right-hand side consists only of symbols that are a member of R ε , we add the left-hand side to R ε (it derives ε, because all symbols in the right-hand side do). In the example, Scale would be added. This process is repeated until no new non-terminals can be added to the set. For the example, this results in R ε = { Empty, Scale }. Now, we direct our attention to the non-empty substrings of the input sentence. Sup- pose we have an input sentence z = z 1z 2 . . . zn and we want to compute the set of non-terminals that derive the substring of z starting at position i, of length l. We will use the notation si,l for this substring, so, si,l = zizi +1 . . . zi +l −1. Figure 4.5 presents this notation graphically, using a sentence of 4 symbols. s 1,4 s 2,3 s 1,3 s 3,2 s 2,2 s 1,2 s 1,1 s 2,1 s 3,1 s 4,1 z 1 z 2 z 3 z 4 Figure 4.5 A graphical presentation of substrings We will use the notation Rsi,l for the set of non-terminals deriving the substring si,l . This notation can be extended to deal with substrings of length 0: si, 0 = ε, and Rsi, 0 = R ε . Because shorter substrings are dealt with first, we can assume that we are at a stage in the algorithm where all information on substrings with length smaller than a certain l is available. Using this information, we check each right-hand side in the grammar, to see if it derives si,l , as follows: suppose we have a right-hand side A 1 . . . Am . Then we divide si,l into m (possibly empty) segments, such that A 1 derives the first segment, A 2 the second, etc. We start with A 1 . If A 1 . . . Am is to derive si,l , A 1 has to derive a first part of it, say of length k. That is, A 1 must derive si,k (be a member of Rsi,k ), and A 2 . . . Am must derive the rest: 94 General non-directional methods [Ch. 4 B a member of a set on the V arrow, and C a member of the corresponding set on the W arrow. For B, substrings are taken starting at position i, with increasing length k. So the V arrow is vertical and rising, visiting Rsi, 1 , Rsi, 2 , . . . , Rsi,k , . . . , Rsi,l −1 ; for C, sub- strings are taken starting at position i +k, with length l −k, with end-position i +l −1, so the W arrow is diagonally descending, visiting Rsi +1,l −1 , Rsi +2,l −2 , . . . , Rsi +k,l −k , . . . , Rsi +l −1,1 . As described above, the recognition table is computed in the order depicted in Figure 4.7(a). We could also compute the recognition table in the order depicted in Fig- ure 4.7(b). In this last order, Rsi,l is computed as soon as all sets and input symbols needed for its computation are available. For instance, when computing Rs 3,3 , Rs 5,1 is relevant, but Rs 6,1 is not, because the substring at position 3 with length 3 does not con- tain the substring at position 6 with length 1. This order makes the algorithm particu- larly suitable for on-line parsing, where the number of symbols in the input is not known in advance, and additional information is computed each time a symbol is entered. (a) off-line order (b) on-line order Figure 4.7 Different orders in which the recognition table can be computed Now, let us examine the cost of this algorithm. Figure 4.6 shows that there are (n*(n +1))/2 substrings to be examined. For each substring, at most n −1 different k- positions have to be examined. All other operations are independent of n, so the algo- rithm operates in a time at most proportional to the cube of the length of the input sen- tence. As such, it is far more efficient than exhaustive search, which needs a time that is exponential in the length of the input sentence. 4.2.3 Transforming a CF grammar into Chomsky Normal Form The previous section has demonstrated that it is certainly worth while to try to transform a general CF grammar into CNF. In this section, we will discuss this transformation, using our number grammar as an example. The transformation is split up into several stages:  first, ε-rules are eliminated.  then, unit rules are eliminated.  then, non-productive non-terminals are removed.  then, non-reachable non-terminals are removed.  then, finally, the remaining grammar rules are modified, and rules are added, until they all have the desired form, that is, either A→a or A→BC. All these transformations will not change the language defined by the grammar. This is not proven here. Most books on formal language theory discuss these transformations Sec. 4.2] The CYK parsing method 95 more formally and provide proofs, see for example Hopcroft and Ullman [Books 1979]. 4.2.3.1 Eliminating ε-rules Suppose we have a grammar G, with an ε-rule A→ε, and we want to eliminate this rule. We cannot just remove the rule, as this would change the language defined by the non-terminal A, and also probably the language defined by the grammar G. So, some- thing has to be done about the occurrences of A in the right-hand sides of the grammar rules. Whenever A occurs in a grammar rule B→αAβ, we replace this rule with two others: B→αA ′β, where A ′ is a new non-terminal, for which we shall add rules later (these rules will be the non-empty grammar rules of A), and B→αβ, which handles the case where A derives ε in a derivation using the B→αAβ rule. Notice that the α and β in the rules above could also contain A; in this case, each of the new rules must be replaced in the same way, and this process must be repeated until all occurrences of A are removed. When we are through, there will be no occurrence of A left in the gram- mar. Every ε-rule must be handled in this way. Of course, during this process new ε- rules may originate. This is only to be expected: the process makes all ε-derivations explicit. The newly created ε-rules must be dealt with in exactly the same way. Ulti- mately, this process will stop, because the number of of non-terminals deriving ε is limited and, in the end, none of these non-terminals occur in any right-hand side. The next step in eliminating the ε-rules is the addition of grammar rules for the new non-terminals. If A is a non-terminal for which an A ′ was introduced, we add a rule A ′→α for all non-ε-rules A→α. Since all ε-rules have been made explicit, we can be sure that if a rule does not derive ε directly, it cannot do so indirectly. A problem that may arise here is that there may not be a non-ε-rule for A. In this case, A only derives ε, so we remove all rules using A ′. All this leaves us with a grammar that still contains ε-rules. However, none of the non-terminals having an ε-rule occurs in any right-hand side. These occurrences have just been carefully removed. So, these non-terminals can never play a role in any derivation from the start symbol S, with one important exception: S itself. In particular, we now have a rule S→ε if and only if ε is a member of the language defined by the grammar G. All other non-terminals with ε-rules can be removed safely. Cleaning up the grammar is left to later transformations. S -> L a M L -> L M L -> ε M -> M M M -> ε Figure 4.8 An example grammar to test ε-rule elimination schemes The grammar of Figure 4.8 is a nasty grammar to test your ε-rule elimination scheme. Our scheme transforms this grammar into the grammar of Figure 4.9. This grammar still has ε-rules, but these will be eliminated by the removal of non- productive and/or non-reachable non-terminals. Cleaning up this mess will leave only one rule: S→a. Removing the ε-rules in our number grammar results in the grammar of Figure 4.10. Note that the two rules to produce ε, Empty and Scale, are still 96 General non-directional methods [Ch. 4 S -> L’ a M’ | a M’ | L’ a | a L -> L’ M’ | L’ | M’ | ε M -> M’ M’ | M’ | ε L’ -> L’ M’ | L’ | M’ M’ -> M’ M’ | M’ Figure 4.9 Result after our ε-rule elimination scheme NumberS -> Integer | Real Integer -> Digit | Integer Digit Real -> Integer Fraction Scale’ | Integer Fraction Fraction -> . Integer Scale’ -> e Sign Integer Scale -> e Sign Integer | ε Empty -> ε Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Sign -> + | - Figure 4.10 Our number grammar after elimination of ε-rules present but are not used any more. 4.2.3.2 Eliminating unit rules The next trouble-makers to be eliminated are the unit rules, that is, rules of the form A→B. It is important to realize that, if such a rule A→B is used in a derivation, it must be followed at some point by the use of a rule B→α. Therefore, if we have a rule A→B, and the rules for B are B → α1 | α2 | . . . | αn , we can replace the rule A→B with A → α1 | α2 | . . . | αn . In doing this, we can of course introduce new unit rules. In particular, when repeating this process, we could at some point again get the rule A→B. In this case, we have an infinitely ambiguous grammar, because B derives B. Now this may seem to pose a problem, but we can just leave such a unit rule out; the effect is that we short-cut derivations like A → B → . . . → B → . . . Also rules of the form A→A are left out. In fact, a pleasant side-effect of removing ε- rules and unit rules is that the resulting grammar is not infinitely ambiguous any more. Removing the unit rules in our ε-free number grammar results in the grammar of Figure 4.11. 4.2.3.3 Removing non-productive non-terminals Non-productive non-terminals are non-terminals that have no terminal derivation. Every sentential form that can be derived from it will contain non-terminals. These are Sec. 4.2] The CYK parsing method 99 NumberS -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 NumberS -> Integer Digit NumberS -> N1 Scale’ | Integer Fraction N1 -> Integer Fraction Integer -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Integer -> Integer Digit Fraction -> T1 Integer T1 -> . Scale’ -> N2 Integer N2 -> T2 Sign T2 -> e Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Sign -> + | - Figure 4.13 Our number grammar in CNF 4.2.4 The example revisited Now, let us see how the CYK algorithm works with our example grammar, which we have just transformed into CNF. Again, our input sentence is 32.5e+1. The recogni- tion table is given in Figure 4.14. The bottom row is read directly from the grammar; for instance, the only non-terminals having a production rule with right-hand side 3 are Number, Integer, and Digit. Notice that for each symbol a in the sentence there must be at least one non-terminal A with a production rule A→a, or else the sentence cannot be derived from the grammar. The other rows are computed as described before. Actually, there are two ways to compute a certain Rsi,l . The first method is to check each right-hand side in the gram- mar; for instance, to check whether the right-hand side N1 Scale’ derives the sub- string 2.5e (= s 2,4). The recognition table derived so far tells us that  N1 is not a member of Rs 2,1 or Rs 2,2 ,  N1 is a member of Rs 2,3 , but Scale’ is not a member of Rs 5,1 so the answer is no. Using this method, we have to check each right-hand side in this way, adding the left-hand side to Rs 2,4 if we find that the right-hand side derives s 2,4 . The second method is to compute possible right-hand sides from the recognition table computed so far; for instance, Rs 2,4 is the set of non-terminals that have a right- hand side AB where either  A is a member of Rs 2,1 and B is a member of Rs 3,3 , or  A is a member of Rs 2,2 and B is a member of Rs 4,2 , or  A is a member of Rs 2,3 and B is a member of Rs 5,1 . This gives as possible combinations for AB: N1 T2 and Number T2. Now we check all rules in the grammar to see if they have a right-hand side that is a member of this set. If so, the left-hand side is added to Rs 2,4 . 4.2.5 CYK parsing with Chomsky Normal Form We now have an algorithm that determines whether a sentence belongs to a language or not, and it is much faster than exhaustive search. Most of us, however, not only want to know whether a sentence belongs to a language, but also, if so, how it can be derived 100 General non-directional methods [Ch. 4 Number ∅ Number ∅ ∅ ∅ Number, N1 ∅ ∅ ∅ ∅ Number, N1 ∅ ∅ Scale’ Number, Integer ∅ Fraction ∅ N2 ∅ Number, Integer, Digit Number, Integer, Digit T1 Number, Integer, Digit T2 Sign Number, Integer, Digit 7 6 5 4 l 3 2 1 3 1 2 2 . 3 5 4 e 5 + 6 1 7 i Figure 4.14 The recognition table for the input sentence 32.5e+1 from the grammar. If it can be derived in more than one way, we probably want to know all possible derivations. As the recognition table contains the information on all derivations of substrings of the input sentence that we could possible make, it also con- tains the information we want. Unfortunately, this table contains too much informa- tion, so much that it hides what we want to know. The table may contain information about non-terminals deriving substrings, where these derivations cannot be used in the derivation of the input sentence from the start symbol S. For instance, in the example above, Rs 2,3 contains N1, but the fact that N1 derives 2.5 cannot be used in the deriva- tion of 32.5e+1 from Number. The key to the solution of this problem lies in the simple observation that the derivation must start with the start-symbol S. The first step of the derivation of the input sentence z, with length n, can be read from the grammar, together with the recog- nition table. If n =1, there must be a rule S→z; if n ≥2, we have to examine all rules S→AB, where A derives the first k symbols of z, and B the rest, that is, A is a member of Rs 1,k and B is a member of Rsk +1,n −k , for some k. There must be at least one such a rule, or else S would not derive z. Now, for each of these combinations AB we have the same problem: how does A derive s 1,k and B derive sk +1,n −k? These problems are solved in exactly the same way. It does not matter which non-terminal is examined first. Consistently taking the left- Sec. 4.2] The CYK parsing method 101 most one results in a left-most derivation, consistently taking the right-most one results in a right-most derivation. Notice that we can use an Unger-style parser for this. However, it would not have to generate all partitions any more, because we already know which partitions will work. Let us try to find a left-most derivation for the example sentence and grammar, using the recognition table of Figure 4.14. We begin with the start symbol, Number. Our sentence contains seven symbols, which is certainly more than one, so we have to use one of the rules with a right-hand side of the form AB. The Integer Digit rule is not applicable here, because the only instance of Digit that could lead to a derivation of the sentence is the one in Rs 7,1 , but Integer is not a member of Rs 1,6 . The Integer Fraction rule is not applicable either, because there is no Fraction deriving the last part of the sentence. This leaves us with the production rule Number -> N1 Scale’, which is indeed applicable, because N1 is a member of Rs 1,4 , and Scale’ is a member of Rs 5,3 , so N1 derives 32.5 and Scale’ derives e+1. Next, we have to find out how N1 derives 32.5. There is only one applicable rule: N1 -> Integer Fraction, and it is indeed applicable, because Integer is a member of Rs 1,2 , and Fraction is a member of Rs 3,2 , so Integer derives 32, and Fraction derives .5. In the end, we find the following derivation: Number -> N1 Scale’ -> Integer Fraction Scale’ -> Integer Digit Fraction Scale’ -> 3 Digit Fraction Scale’ -> 3 2 Fraction Scale’ -> 3 2 T1 Integer Scale’ -> 3 2 . Integer Scale’ -> 3 2 . 5 Scale’ -> 3 2 . 5 N2 Integer -> 3 2 . 5 T2 Sign Integer -> 3 2 . 5 e Sign Integer -> 3 2 . 5 e + Integer -> 3 2 . 5 e + 1 Unfortunately, this is not exactly what we want, because this is a derivation that uses the rules of the grammar of Figure 4.13, not the rules of the grammar of Figure 4.4, the one that we started with. 4.2.6 Undoing the effect of the CNF transformation When we examine the grammar of Figure 4.4 and the recognition table of Figure 4.14, we see that the recognition table contains the information we need on most of the non- terminals of the original grammar. However, there are a few non-terminals missing in the recognition table: Scale, Real, and Empty. Scale and Empty were removed because they became non-reachable, after the elimination of ε-rules. Empty was removed altogether, because it only derived the empty string, and Scale was replaced by Scale’, where Scale’ derives exactly the same as Scale, except for the empty
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved