Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Parsing: Understanding Context-Free Grammars and LL(1) Parsing, Study notes of Computer Science

An introduction to parsing, which is the process of determining if a sequence of tokens forms a syntactically correct program according to a context-free grammar (cfg). The basics of cfgs, recursive descent parsing, ll(1) parsing, and calculating first and follow sets. It also discusses removing ambiguity, left recursion, and left factoring in grammars.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-yfg
koofers-user-yfg 🇺🇸

10 documents

1 / 17

Toggle sidebar

Related documents


Partial preview of the text

Download Parsing: Understanding Context-Free Grammars and LL(1) Parsing and more Study notes Computer Science in PDF only on Docsity! CS414-2008-03 Parsing 1 03-0: Parsing • Once we have broken an input file into a sequence of tokens, the next step is to determine if that sequence of tokens forms a syntactically correct program – parsing • Parsing a sequence of tokens == determining if the string of tokens could be generated by a Context-Free Grammar. 03-1: CFG Example S → print(E); S → while (E) S S → { L } E → identifier E → integer literal L → SL L → ǫ Examples / Parse Trees 03-2: Recursive Descent Parser • Write Java code that repeatedly calls getNextToken(), and determines if the stream of returned tokens can be generated by the CFG • If so, end normally • If not, call an error function 03-3: Recursive Descent Parser A Recursive Descent Parser is implemented as a suite of recursive functions, one for each non-terminal in the grammar: • ParseS will terminate normally if the next tokens in the input stream can be derived from the non-terminal S • ParseL will terminate normally if the next tokens in the input stream can be derived from the non-terminal L • ParseE will terminate normally if the next tokens in the input stream can be derived from the non-terminal E 03-4: Recursive Descent Parser S → print(E); S → while (E) S S → { L } E → identifier E → integer literal L → SL L → ǫ Code for Parser.java.html on web browser 03-5: LL(1) Parsers These recursive descent parsers are also known as LL(1) parsers, for Left-to-right, Leftmost derivation, with 1 symbol lookahead • The input file is read from left to right (starting with the first symbol in the input stream, and proceeding to the last symbol). • The parser ensures that a string can be derived by the grammar by building a leftmost derivation. CS414-2008-03 Parsing 2 • Which rule to apply at each step is decided upon after looking at just 1 symbol. 03-6: Building LL(1) Parsers S′ → S$ S → AB S → Ch A → ef A → ǫ B →hg C → DD C → fi D → g ParseS use rule S → AB on e, h use the rule S → Ch on f, g 03-7: First sets First(S) is the set of all terminals that can start strings derived from S (plus ǫ, if S can produce ǫ) S′ → S$ First(S′) = S → AB First(S) = S → Ch First(A) = A → ef First(B) = A → ǫ First(C) = B →hg First(D) = C → DD C → fi D → g 03-8: First sets First(S) is the set of all terminals that can start strings derived from S (plus ǫ, if S can produce ǫ) S′ → S$ First(S′) = {e, f, g, h} S → AB First(S) = {e, f, g, h} S → Ch First(A) = {e, ǫ} A → ef First(B) = {h} A → ǫ First(C) = {f, g} B →hg First(D) = {g} C → DD C → fi D → g 03-9: First sets We can expand the definition of First sets to include strings of terminals and non-terminals S′ → S$ First(aB) = S → AB First(BC) = S → Ch First(AbC) = A → ef First(AC) = A → ǫ First(abS) = B →hg First(DDA) = C → DD C → fi D → g 03-10: First sets We can expand the definition of First sets to include strings of terminals and non-terminals CS414-2008-03 Parsing 5 void ParseA() { void ParseC() { switch(currentToken) { switch(currentToken) { case e: case f: checkToken(e); checkToken(f); checkToken(f); checkToken(i); break; break; case h: case g: /* epsilon case */ ParseD(); break; ParseD(); otherwise: break; error("Parse Error"); otherwise: } error("Parse Error"); } } } 03-21: LL(1) Parser Example Z ′ → Z$ Z → XY Z | d X → a | Y Y → ǫ | c (Initial Symbol = Z ′) 03-22: LL(1) Parser Example Z ′ → Z$ First(Z ′) = {a, c, d} Z → XY Z | d First(Z) = {a, c, d} X → a | Y First(X) = {a, c, ǫ} Y → ǫ | c First(Y ) = {c, ǫ} (Initial Symbol = Z ′) 03-23: LL(1) Parser Example Z ′ → Z$ First(Z ′) = {a, c, d} Follow(Z ′) = { } Z → XY Z | d First(Z) = {a, c, d} Follow(Z) = {$} X → a | Y First(X) = {a, c, ǫ} Follow(X) = {a, c, d} Y → ǫ | c First(Y ) = {c, ǫ} Follow(Y ) = {a, c, d} (Initial Symbol = Z ′) 03-24: LL(1) Parser Example Z ′ → Z$ First(Z ′) = {a, c, d} Follow(Z ′) = { } Z → XY Z | d First(Z) = {a, c, d} Follow(Z) = {$} X → a | Y First(X) = {a, c, ǫ} Follow(X) = {a, c, d} Y → ǫ | c First(Y ) = {c, ǫ} Follow(Y ) = {a, c, d} a c d Z ′ Z ′ → Z$ Z′ → Z$ Z′ → Z$ Z Z → XY Z Z → XY Z Z → XY Z Z → d X X →a X → Y X → Y X → Y Y Y → ǫ Y → c Y → ǫ Y → ǫ 03-25: non-LL(1) Grammars • Not all grammars can be parsed by a LL(1) parser • A grammar is LL(1) if the LL(1) parse table contains no duplicate entires • Previous CFG is not LL(1) 03-26: non-LL(1) Grammars • Not all grammars can be parsed by a LL(1) parser • A grammar is LL(1) if the LL(1) parse table contains no duplicate entires CS414-2008-03 Parsing 6 • Previous CFG is not LL(1) • No ambiguous grammar is LL(1) 03-27: LL(1) Parser Example S′ → S$ S → if E then S else S S → begin L end S → print(E) L → ǫ L → SL′ L′ →; SL′ L′ → ǫ E → num = num 03-28: LL(1) Parser Example Non-Terminal First Follow S′ {if, begin, print} { } S {if, begin, print} {$, end, ;} L {ǫ, if, begin, print} {end} L′ {ǫ, ;} {end} E {num} {)} 03-29: LL(1) Parser Example if then else begin end print S ′ S ′ → S$ S′ → S$ S′ → S$ S S → if E then S else S S → begin L end S → print(E) L L → SL ′ S ′ → S$ L → ǫ S′ → S$ L ′ E ( ) ; num = S ′ S L L ′ L ′ →; SL′ E E → num = num 03-30: LL(1) Parser Example S′ → S$ S → ABC A → a A → ǫ B → b A → ǫ C → c C → ǫ 03-31: LL(1) Parser Example Non-terminal First Follow S′ {a, b, c} {} S {a, b, c} {$} A {a} {b, c, $} B {b} {c, $} C {c} {$} CS414-2008-03 Parsing 7 a b c $ S′ S′ → S$ S′ → S$ S′ → S$ S S → ABC S → ABC S → ABC S → ABC A A → a A → ǫ A → ǫ A → ǫ B B → b B → ǫ B → ǫ C C → c A → ǫ 03-32: Creating an LL(1) CFG • Not all grammars are LL(1) • We can often modify a CFG that is not LL(1) • New CFG generates the same language as the old CFG • New CFG is LL(1) 03-33: Creating an LL(1) CFG • Remove Ambiguity • No ambiguous grammar is LL(1) • Grammar is ambiguous if there are two ways to generate the same string • If there are two ways to generate a string α, modify the CFG so that one of the ways to generate the string is removed 03-34: Removing Ambiguity • Often a grammar is ambiguous when there is a special case rule that can be generated by a general rule • Solution: Remove the special case, let the general case handle it S → V = E; S → V = identifier; E → V E → integer literal V → identifier Structured variable definitions commonly have this problem 03-35: Removing Ambiguity • This grammar, for describing variable accesses, is also ambiguous • Ambiguous in the same was as expression CFG • Can be made unambiguous in a similar fashion V → V . V V → identifier 03-36: Removing Ambiguity • Some Languages are inherently ambiguous • A Language L is ambiguous if: • For each CFG G that generates L, G is ambiguous • No programming languages are inherently ambiguous CS414-2008-03 Parsing 10 • What about: S → Sα1 S → Sα2 S → β1 S → β2 S → BA We can use the same B → β1 method for arbitrarily B → β2 complex grammars A → α1A A → α2A A → ǫ 03-49: Left Factoring • Consider Fortran DO statements: Fortran: do var = intial, final loop body end do Java Equivalent: for (var=initial; var <= final; var++) { loop body } 03-50: Left Factoring • Consider Fortran DO statements: Fortran: do var = intial, final, inc loop body end do Java Equivalent: for (var=initial; var <= final; var+=inc) { loop body } 03-51: Left Factoring • CFG for Fortran DO statements: S → do L S L → id = exp, exp L → id = exp, exp, exp • Is this Grammar LL(1)? CS414-2008-03 Parsing 11 03-52: Left Factoring • CFG for Fortran DO statements: S → do L S L → id = exp, exp L → id = exp, exp, exp • Is this Grammar LL(1)? No! • The problem is in the rules for L • Two rules for L that start out exactly the same • No way to know which rule to apply when looking at just 1 symbol 03-53: Left Factoring • Factor out the similar sections from the rules: S → do L S L → id = exp, exp L → id = exp, exp, exp 03-54: Left Factoring • Factor out the similar sections from the rules: S → do L S L → id = exp, exp L′ L′ → , exp L′ → ǫ • We can also use EBNF: S → do L S L → id = exp, exp (, exp)? 03-55: Left Factoring • In general, if we have rules of the form: S → α β1 S → α β2 . . . S → α βn • We can left factor these rules to get: S → α B B → β1 B → β2 . . . B → βn 03-56: Building an LL(1) Parser • Create a CFG for the language • Remove ambiguity from the CFG, remove left recursion, and left-factor it CS414-2008-03 Parsing 12 • Find First/Follow sets for all non-terminals • Build the LL(1) parse table • Use the parse table to create a suite of mutually recursive functions 03-57: Building an LL(1) Parser • Create an EBNF for the language • Remove ambiguity from the EBNF, remove left recursion, and left-factor it • Find First/Follow sets for all non-terminals • Build the LL(1) parse table • Use the parse table to create a suite of mutually recursive functions • Use a parser generator tool that converts the EBNF into parsing functions 03-58: Structure JavaCC file foo.jj options{ /* Code to set various options flags */ } PARSER_BEGIN(foo) public class foo { /* This segment is often empty */ } PARSER_END(foo) TOKEN_MGR_DECLS : { /* Declarations used by lexical analyzer */ } /* Token Rules & Actions */ /* JavaCC Rules and Actions -- EBNF for language*/ 03-59: JavaCC Rules • JavaCC rules correspond to EBNF rules • JavaCC rules have the form: void nonTerminalName() : { /* Java Declarations */ } { /* Rule definition */ } • For now, the Java Declarations section will be empty (we will use it later on, when building parse trees) • Non terminals in JavaCC rules are followed by () • Terminals in JavaCC rules are between < and > 03-60: JavaCC Rules • For example, the CFG rules: S → while (E) S S → V = E; CS414-2008-03 Parsing 15 void S(): {} { LOOKAHEAD(2) "A" (("B" "C") | ("B" "D")) } • This is not a valid use of lookahead – the grammar will not be parsed correctly. Why not? 03-70: JavaCC & Non-LL(k) • JavaCC will produce a parser for grammars that are not LL(1) (and even for grammars that are not LL(k), for any k) • The parser that is produced is not guaranteed to correctly parse the language described by the grammar • A warning will be issued when JavaCC is run on a non-LL(1) grammar 03-71: JavaCC & Non-LL(k) • What does JavaCC do for a non-LL(1) grammar? • The rule that appears first will be used void S() : {} { "a" "b" "c" | "a" "b" "d" } 03-72: JavaCC & Non-LL(k) • Infamous dangling else void statement(): {} { <IF> expression() <THEN> statement() | <IF> expression() <THEN> statement() <ELSE> statement() | /* Other statement definitions */ } • Why doesn’t this grammar work? 03-73: JavaCC & Non-LL(k) void statement() : {} { <IF> expression() <THEN> statement() optionalelse() | /* Other statement definitions */ } void optionalelse() : {} { <ELSE> statement() | /* nothing */ { } } if <e> then <S> if <e> then <S> else <S> if <e> then if <e> then <S> else <S> CS414-2008-03 Parsing 16 03-74: JavaCC & Non-LL(k) void statement() : {} { <IF> expression() <THEN> statement() optionalelse() | /* Other statement definitions */ } void optionalelse() : {} { /* nothing */ { } | <ELSE> statement() } • What about this grammar? 03-75: JavaCC & Non-LL(k) void statement() : {} { <IF> expression() <THEN> statement() optionalelse() | /* Other statement definitions */ } void optionalelse() : {} { /* nothing */ { } | <ELSE> statement() } • What about this grammar? • Doesn’t work! (why?) 03-76: JavaCC & Non-LL(k) void statement() : {} { <IF> expression() <THEN> statement() (<ELSE> statement)? | /* Other statement definitions */ } • This grammar will also work correctly • Also produces a warning 03-77: JavaCC & Non-LL(k) void statement() : {} { <IF> expression() <THEN> statement() (LOOKAHEAD(1) <ELSE> statement)? | /* Other statement definitions */ } • This grammar also works correctly • Produces no warnings • (not because it is any more safe – if you include a LOOKAHEAD directive, the system assumes you know what you are doing) 03-78: Parsing Project • For your next project, you will write a parser for simpleJava using JavaCC. • Provided Files: • ParseTest.java A main program to test your parser You must use program as your starting non-terminal for ParseTest to work correctly! CS414-2008-03 Parsing 17 • test*.sjava Various simpleJava programs to test your parser. Some have parsing errors, some do not. Be sure to test your parser on other test cases, too! These files are not meant to be exhaustive!! 03-79: Parsing Project “Gotcha’s” • Expressions can be tricky. Read the text for more examples and suggestions • Structured variable accesses are similar to expressions, and have some of the same issues • Avoid specific cases that can be handled by a more general case 03-80: Parsing Project “Gotcha’s” • Procedure calls and assignment statements can be tricky for LL(1) parsers. You may need to left-factor, and/or use LOOKAHEAD directives • LOOKAHEAD directives are useful, but can be dangerous (for instance, you will not get warnings for the sections that use LOOKAHEAD.) Try left-factoring, or other techniques, before resorting to LOOKAHEAD. • This project is much more difficult than the lexical analyzer. Start Early!
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved