Use two UNIX utilities, lex and yacc, in program development
Preface
This lab
manual shows you how to use two Unix
utilities, lex and yacc, in program development. Lex and
Yacc tools used to generate lexical analyzers and Parsers. These tools help
programmers build compilers and interpreters, but they also have a wider range
of applications. Any application that looks for patterns in its input or
command language is a good candidate for lex and Yacc.
The introduction part explains the lex and yacc tools, development sequence when using lex and yacc, different sections of the lex & yacc, running lex & yacc. Using lex section explains the regular expressions, examples of regular expressions. Using YACC section explains the Grammar, Yacc parser, compiling and running a simple parser. PART A section contains the list of programs for lab session.
This manual also provides stepwise procedure for each program. The student will expose to develop compilers. This manual gives information about
the cursers, which contains the functions that are used to develop text editor.
In programs with structured input, two
tasks that occur over and over are dividing the input into meaningful units,
and then discovering the relationship among the units. This division into units
(which are usually called tokens) is
known as lexical
analysis, or lexing for short. Lex helps you by taking a set of descriptions
of possible tokens and producing a C routine, which we call a lexical analyzer, or a lexer, or a scanner for
short, that can identify tokens. This set of descriptions we give to lex is
called a Lex specification.
The token description that lex uses are known a regular expression. Lex turns this regular expression into a form that the lexer can use to scan the input text extremely fast independent, of the number of expression that it is trying to match.
+----------------------+
Source -> I Lex I -> yylex
+----------------------+
+----------------------+
input -> yylex I -> output
+----------------------+
An overview of Lex
As the input is divided into token, a program often needs to establish the relationship among the tokens. A C complier needs to find the expression, statement, declaration, blocks, and procedures in the program. This task is known as parsing and the list of rule that defines the relationships that the program understands is a grammar. Yacc takes a concise description of a grammar and produces a C routine that can be parsing that grammar, a parser. The Yacc parser automatically detects whenever a sequence of input token match one of the rules in the grammar and also detects a syntax error whenever its input does not match any of the rules. When a task involves dividing the inputs into units and establishing some relationship among those units, we should think of Lex and Yacc.
Lexical rules grammer rules
| |
|Lex | |yacc|
Input -> |yylex| -> yyparse
-> parsed input
Fig. Lex with yacc
The first section, the definition
section, introduces any initial C program code we want copied into the
final program. This is especially important if, for
example, we have Header file that must be include for code in the file to work. We surround the C code with the special delimiters “%{“ and “%}“. Lex copies the material between “%{“and “%}“ directly to the generated C file, so we may write valid C code here. The %% marks the end of the section.
example, we have Header file that must be include for code in the file to work. We surround the C code with the special delimiters “%{“ and “%}“. Lex copies the material between “%{“and “%}“ directly to the generated C file, so we may write valid C code here. The %% marks the end of the section.
The next section is the rule section. Each rule is made up of two section: a pattern and an action, separated by a white sheet. The lexer that lex generates will execute the action when it recognizes the pattern. These patterns are UNIX STYLE REGULAR EXPRESSION.
For example the first rule is as follows:
[/t]+/ ignore white
space */;.
The square bracket,” [ }“, indicates that any one of the characters within the brackets matches the pattern. For our example, we expect either ‘\t” (a tab character) or” “(a space). The “+” means that the patterns matches one or more consecutive copies of the sub pattern that precedes that plus. Thus, this pattern describes white space (any combination of tab and space.) The second part of the rule, the action, is simply a semicolon, a do nothing C statement. Its effect is to ignore the input.
Yytext():
The array yytext contains the text that matched the pattern.
[a-zA-Z]+
{printf(”%s: is not a verb\n”,yytext);}
.|\n {ECHO; /*normal
default anyway */}
The pattern “[a-zA-z] +” is a common one. It indicates any alphabetical strings with atleast one character. The “-” character has a special meaning when used inside square bracket: it denotes a range of characters beginning with the character to the left of the “-” and ending with the character to its right. The action of one of these patterns is to print the matched token.
A special character “.” (period) matches any single character other than a newline, and “\n” matches a newline character. The end of the rules section is delimited by another %%.
The final section is the user subroutines section, which can consist of any gal C code. Lex copies it to the C file after the end of the Lex generated code. We have included a main() program.
%%
main()
{
yylex();
}
5
main()
{
yylex();
}
5
The lexer produced by lex is a C
routine called yylex(), so we call it. Unless the actions contain explicit
return statement, yylex() won’t return until it has processed the entire input.
To create an executable program on our
UNIX system we enter these commands:
% lex
filename.|
% cc lex.yy.c
—o first —II
Lex translates the lex specification
into a C source file called lex.yy.c which we compiled and linked with lex
library -II.
Grammars
To recognize specific sequences of tokens and perform appropriate actions. Traditionally, a description of such a set action is known as a grammar.
To recognize specific sequences of tokens and perform appropriate actions. Traditionally, a description of such a set action is known as a grammar.
Parser-Lexer Communication
When we use a lex scanner and a yacc
parser together, the parser is the higher level routine. It calls the lexer
yylex() whenever it needs a token from the input. The lexer then scans through
the input recognizing tokens. As soon as it finds a token of interest to the
parser, it returns to the parser returning the token’s code as the value of
yylex().
Yacc can optionally write a C header file containing all of the token definitions. You include this file, called y.tab.h on UNIX systems and ytab.h or yytab.h on MS-DOS, in the lexer and use the preprocessor symbols in your lexer action code.
A Yacc Parser
The structure of a yacc parser is, not
by accident, similar to that of lex lexer. Our first section, the definition section, has a literal
code block enclosed in “%{“ and “%}”. We use it here for a C comment (as with lex,
C comments belong inside C code blocks, at least within the definition section)
and a single include file.
Then come definitions of all the tokens we expect to receive from the lexical analyzer. In this example, they correspond to the eight parts of speech. The name of a token does not have any intrinsic meaning to yacc, although well-chosen token names tell reader what they represent. Although yacc lets you use any valid C identifier name for a yacc symbol, universal custom dictates that token names be all uppercase and other names in the parser mostly or entirely lowercase.
The first %% indicates the beginning
of the rules section. The second %% indicates the end of the rules and the
beginning of the user subroutines section. The most important subroutine is
main() which repeatedly calls yyparse() until the lexer’s input file runs out.
The routine yyparse() is the parser generated by yacc, so our main program
repeatedly tries to parse sentences until the input runs out. (The lexer
returns a zero token whenever it sees a period at the end of a line; that’s the
signal to the parser that the input for the current parse is complete.)
The Rules Section
The rules section describes the actual
grammar as a set of production rules or simply rules. Each rule consist of a
single name on the left-hand side of the “:” operator, a list of symbols and
action code on the right-hand side, and a semicolon indicating the end of the
rule. By default, the first rule is the highest-level rule. That is, the parser
attempts to find a list of tokens, which match this initial rule, or more
commonly, rule found from the initial rule. This expression on the right-hand
side of the rule is a list of zero or more names. A typical simple rule has a
single symbol on the right-hand side as in the object rule, which is defined to
be a NOUN. The symbol on the left-hand side of the rule can then be used like a
token in other rules. From this, we build complex grammars.
The action part of a rule consists of a C block, beginning with “{”and ending with “}”. The parser executes an action at the end of a rule as soon as the rule matches. In our sentence rule, the action reports that we’ve successfully parsed a sentence. Since sentence is the top-level symbol, the entire input must match a sentence. The parser returns to its caller, in this case the main program, when the lexer reports the end of the input. Subsequent calls to yyparse() reset the state and begin processing again. The parser caller yyerror(), which we provide in the user subroutines section, and then recognizes the special rule error. You can provide error recovery code, yyparse() returns to the caller after it finds an error.
The third and final section, the user subroutines section, begins after
the second %%. This section can contain any C code and is copied, verbatim,
into the resulting parser. In our example, we have provided the minimal set of
functions necessary to a yacc-generated parser using lex-generated lexer to
compile; main () and yyerror(). The main routine keeps calling the parser until
it reaches the end- of-file on yylex() which is provided by our exer.
Running Lex and Yacc
We conclude
by describing how we built these tools on our system. We called our various
lexers filename-n.I, where N
corresponded to a particular Lex specification example. Similarly, we called
our parsers’ filename-m.y where
again M is the number of example. Then, to build the output, we did the
following in UNIX;
% lex filename.l
% yacc —d
filename.y
% cc-c
lex.yy.c y.tab.c
% cc-o example-m.n
lex.yy.o y.tab.o-II
The first line runs lex over the lex specification and generates a file Iex.yy.c, which contains C code for the lexer. In the second line, we use yacc to generate both y.tab.c and y.tab.h (the later is the file of token definition created by the –d switch). The next line compiles each of the two C files. The final line links them together and uses the routines in the lex library libl.a. noramally in /usr/lib/lib/.a on most UNIX systems. If you are not using AT & T lex and yacc, but one of the other implementations, you may be able to simply substitute the commands names and little else will change.
USING
LEX
Lex tools for building lexical
analyzers or lexers. A leker takes an arbitrary input stream and tokenizes it,
i.e., divides it up into lexical tokens. This tokenized output can then be
processed further, usually by yacc, or it can be the “end product”. When we
write a lex-specification, we
create a set of patterns which lex matches against the input. Each time one of
the patterns matches, the lex program invokes C code that you provide which
does something with the matched text. In this way a lex program divides the
input into strings, which we call, tokens. Lex itself doesn’t produce an
executable program; instead it translates the lex specification into a file
containing a C routine called yylex().Your program calls yylex() to run the
lexer. Usually your regular C compiler, you compile the file that lex produced
along with any other files and libraries you want.
Regular
Expressions
Regular expressions are widely used
within the UNIX environment, and le uses a rich regular expression language.
A regular expression is a pattern
description using a “meta” language, a language that you use to describe particular
patterns of interest. The characters used in this metalanguage are part of the
standard ASCII character set used in UNIX and MS-DOS, which can sometimes lead
to confusion. The characters that form regular expressions are:
Matches any single character expect
the newline character (“\n”).
* Matches zero or more copies of the
preceding expression.
[] A character
class which matches any character within the brackets. If the first character
is a circumflex (AI) it changes the meaning to match any character except the
ones within the brackets. A dash inside the square brackets indicates a
character range, e.g., “[0-9]” means the same thing as
“[0123456789]”.
“[0123456789]”.
A Matches the beginning of a line as
the first character of a regular
expression.
expression.
Also used for negation within square
brackets.
$ Matches the
end of a line as the last character of a regular expression.
Indicates how many times the pervious
pattern is allowed to many when containing one or two numbers. For example:
A{1,3}
Matches one to three occurrences of
the letter A. If they contain name, they refer to a substitution by that name.
\ Used to escape metacharacters, and as pa
of the usual C escape sequences,
e.g., “\n” is a newline character,
while t\* is a literal aster
+ Matches one or more occurrence of the
preceding regular expression.
For example;
[0-9]+
matches “1”,”111”, or “12345” but not an empty string. (If the plus sign
matches “1”,”111”, or “12345” but not an empty string. (If the plus sign
were an asterisk, it would also match
the empty string.)
? Matches zero or one occurrence of the
preceding regular expression.
For example
-?[0-9]+
matches a signed number including an optional leading minus.
matches a signed number including an optional leading minus.
I Matches either the preceding regular
expression or the following
regular expression. For example;
cow |pig| sheep
matches any of the three words.
”…” Interprets everything within the quotation marks literally-meta-
characters other than C escape
sequences lose their meaning.
/ Matches the
preceding regular expression but only if followed by the
following regular expression. For
example;
0/1
matches “0” in the string “01” but would not match anything in the
matches “0” in the string “01” but would not match anything in the
strings “0” or “02”. The material
matched by the pattern following the
slash is not “consumed” and remains to
be turned into subsequent
tokes. Only one slash is permitted per
pattern.
() Groups a series of regular expressions
together into a new regular
expression. For example;
(01)
represents the character sequence 01. Parentheses are useful when
represents the character sequence 01. Parentheses are useful when
building up complex patterns with *,+ and
|.
Note that some of these operators operate on single characters (e.g., []) while others operate on regular expressions. Usually, complex regular expressions are built up from simple regular expressions.
Examples of Regular Expressions
We are ready for some examples. First,
we’ve already shown you a regular expression for a “digit”:
[0-9]
We can use this to build a regular expression for an integer:
We can use this to build a regular expression for an integer:
[0-9]+
We require at least one digit. This would have allowed no digits at all:
We require at least one digit. This would have allowed no digits at all:
[09]*
Let’s add an optional unary minus:
Let’s add an optional unary minus:
-?[O-9]+
We can then expand this to allow decimal numbers. First we will specify a decimal number (for the moment we insist that the last character always a digit);
We can then expand this to allow decimal numbers. First we will specify a decimal number (for the moment we insist that the last character always a digit);
[09]*\.[09]÷
Notice
the “\” before the period to make it a literal period rather than a wild card
character. This pattern matches “0.0”, “4.5” or “.31415”. But it won’t match
“0” or “2”. We’d like to combine our definitions to match them as well. Leaving
out our unary minus, we could use:
([0-9]+)
| ([09]*\.[09]+)
We
use the grouping symbols “Q” to specify what the regular expression are for the
“I” operation. Now let’s add the unary minus:
-? (([O-9]+) | ([0-9]*\.[0-9]+))
We can expand this further by allowing a float-style exponent to be specified as well. First, let’s write a regular expression for an exponent:
[eE]
[-+]?[0-9]+
This matches an upper or lowercase latter E, then an optional plus or minus sign, then a string of digits. For instance, this will match “e12” or “E-3”. We can then use this expression to build our final expression, one that specifies a real number:
-?(([0-9]+)
| ([0-9]*\. [O-9]+|) ([eE][-+]?[0-9]+)?)
Our expression makes the exponent part optional. Let’s write a real lexer that uses this expression. Nothing fancy, but it examines the input and tells us each time it matches a number according to our regular expression.
Example: Lex specification for decimal numbers
%%
[\n\t];
[\n\t];
—?(([0-9]+) |
([0-9]*\.[0-9]+) ([eE] [-+]?[0-9]+)?)Printf(“number\n”);
ECHO;
%%
%%
Main()
{
yylex();
yylex();
}
Finally, here
is a regular expression for matching quoted strings
\”[^”\n]*[“\n]
It might seem
adequate to use a simpler expression.
\“.*\“
Unfortunately,
this causes lex to match incorrectly if there are two strings on the same input
line. For instance:
“how” to “do”
Would match
as a single pattern since “‘k” matches as much as possible knowing this, we
then might try:
\”[^”]*\”
When yylex() reaches the end of its
input file, it calls yywrap(), which return a value of 0 or 1. if the value is I the program is done and there is no more input, if the
value is 0, on the other hand, the lexer assumes that yywrap() has opened
another file for it to read from yyin. The default yywrap() always returns 1 by providing our own version of yywrap(), we
can have our program read all of the files named on the command line, one at a
time.
USING
YACC
Grammars
Yacc takes a grammar that you specify and writes a parser that recognizes valid “sentences” in that grammar. We use the term “sentence” here in a fairly general way for a C language grammar the sentences are syntactically valid C programs.
Yacc takes a grammar that you specify and writes a parser that recognizes valid “sentences” in that grammar. We use the term “sentence” here in a fairly general way for a C language grammar the sentences are syntactically valid C programs.
A Yacc Parser
A yacc grammar has the same three-part
structure as a lex specification. (Lex copied its structure from yacc). This
first section, the definition section, handles control information for the
yacc-generated parser (from here on we will call it the parser), and generally
sets up the execution environment in which the parser will operate. The second
section contains the rules for the parser, and the third sections is C code
copied verbatim into the generated C program
The definition section
The definition section included
declarations of the tokens used in the grammar, the types of values used on the
parser stack, and other odds and ends. It can also include a literal block, C
code enclosed in %{%} lines. We
start our first parser by declaring two symbolic tokens.
For ex:
%token NAME NUMBER
You can use single quoted characters
as tokens without declaring them, so we don’t need to declare “=”, “+”, or “-”.
The Rules section
The rule section simply consists of a
list of grammar rules in much the same format as we used above. Since ASCII
keyboards don’t have a à key, we use
a colon between the left- and right-hand sides of a rule, and we put a
semicolon at the end of each rule.
For ex:
%token NAME
NUMBER
%%
statement: NAME ‘=’ expression
statement: NAME ‘=’ expression
|expression
expression:
NUMBER ‘÷’ NUMBER
| NUMBER ‘-’ NUMBER
Unlike lex, yacc pays no attention to
line boundaries in the rules section, and you will find that a lot of white
space makes grammars easier to read. We’ve added one new rule to the parser; a
statement can be a plain expression as well as an assignment. If the user
enters a plain expression, we’ll print out its result.
The symbol on the left-hand side of the first rule in the grammar is normally the start, though you can a %start declaration in the definition section to override that.
Symbol Values and Actions
Every symbol in a yacc parser has a
value. The value gives a additional information about a particular instance of
a symbol. If a symbol represents a number, the value would be the particular
number. If it represents a literal text string, the value would probably be a
pointer to a copy of the string. If it represents a variable in a program, the
value would be a pointer to a symbol table entry describing the variable. Some
tokens don’t have a useful value, e.g., a token representing a close
parenthesis, since one close parenthesis is the same as another.
Non-terminal symbols can have any values you want, created by code in the parser. Often the action code builds a parse tree corresponding to the input, so that later code can process a whole statement or even a whole program at a time.
In the current parser, the value of a
NUMBER or an expression is the numerical value of the number or expression, and
the value of a NAME will be a symbol table pointer.
YYSTYPE
In real parsers, the values of different symbols use different data types, e.g., mt and double for numeric symbols, char * for strings and pointers to structures for higher level symbols. If you have multiple value types, you have to list all the value types used in a parser so that yacc can create a C union typedef called YYSTYPE to contain them. (Fortunately, yacc gives you a lot of help ensuring that you use the right value for each symbol.)
Whenever the parser reduces a rule, it executes user C code associated with the rule, known as the rule’s action. The action appears in braces after the end of the rule, before the semicolon or vertical bar. The action code can refer to the values of the right-hand side symbols as $1, $2, and can set the value of the left-hand side by setting $$. In our parser, the value of an expression symbol is the value of the expression it represents. We add some code to evaluate and print expressions, bringing our grammar up to that used in figure:
%token NAME
NUMBER
%%
statement:
NAME ‘=’ expression
|expression {printf (“=%d\n”,$1)}
;
expression:
expression ‘+’ NUMBER {$$ = $1 + $3;}
|expression
‘-’ NUMBER {$$ = $1 - $3;}
|NUMBER
The rules that build an expression
compute the appropriate values, and the rule that recognizes an expression as a
statement prints out the result. In the expression building rules, the first
and second numbers values are $1 and $3, respectively. The operator’s value
would be $2, although in this grammar the operators do not have interesting
values. The action on the last rule is not strictly necessary, since the default
action that yacc performs after every reduction, before running any explicit
action code assigns the value $1 to $$.
The Lexer
To try out our parser, we need a lexer
to1rokens. As we mentioned earlier, the parser is the higher level routine, and
calls the lexer yylex() when ever it need a token from the input. As soon as
the lexer finds a token of interest to the parser, it returns to the parser,
returning the token code as the value. Yacc defines the token names in the
parser as C preprocessor names in y.tab.h (or some similar name on MS-DOS
system so the lexer can use them.)
Sample program on lexer to provide tokens for our parser:
%{
#include
“y.tab.h”
extern mt
yylval;
%}
%%
[O-9]+
{yylval = atoi(yytext); return NUMBER;}
[\t]; 1* ignore white space *1
\n return 0; 1* logical EOF *1
return
yytext[0];
%%
Strings of digits are numbers, whitespace is ignored, and a newline returns an end of input token (number zero) to tell the parser that there is no more to read.
The last rule in the lexer is a very
common catch-all, which says to return any character otherwise not handled as a
single character token to the parser. Character tokens are usually punctuation
such as parentheses, semicolons and single character operators. If the parser receives
a token that it doesn’t know about, it generates a syntax error, so this rule
lets you handle all of the single-character token easily while letting yacc’s
error checking catch and complain about invalid input.
Whenever the lexer returns a token to the parser, if the token has an associated value, the lexer must store the value in yylval before returning. In this example, we explicitly declare yylval. In more complex parsers, yacc defines yylval as a union and puts the definition in y.tab.h.
Compiling and Running a Simple Parser
On UNIX system, Yacc
takes your grammar and creates y.tab.c, the C language parser, and y.tab.h, the
include file with the token number definition. Lex creates lex.yy.c, and the C
language lexer. You need only to compile them together with the yacc and lex
libraries. The libraries contain usable default versions of all of the
supporting routines, including a main() that calls the parser yyparse() and
exits.
% yacc —d
filename.y #makes y.tab.c
and “y.tab.h
% lex filename.l
#mkes lex.yy.c
% cc-0
filename y.tab.c lex.yy.c —ly —II# compile and link C files
% filename
Arithmetic Expression and Ambiguity
Precedence controls which operators to
execute first in an expression. Mathematical and programming tradition (dating
back past the first Fortran compiler in 1950) says that multiplication and
division take care precedence over addition and subtraction, so a+b*c means
a+(b*c) and-d/e-f means (d/e)-f. In any expression grammar, operators are
grouped into levels of precedence from lowest to highest.
The total number of levels depends on
the language. The C language is notorious for having too many precedence
levels, a total of fifteen levels.
Associativity controls the grouping of operators at the same precedence level. Operators may group to the left, e.g., a-b-c in C means (a-b)-c, or to the right, e.g., abc in C means a=(b=c). In some cases operators do not group at all, e.g., in Fortran A.LE.B.LE.C is invalid.
There are two ways to specify precedence and associativity in a grammar, implicitly and explicitly. To specify them implicitly, rewrite the grammar using separate non-terminals symbols for each precedence level. Assuming the usual precedence and left associativity for everything, we could rewrite our expression rules this way:
Expression:
expression ‘+’ mulexp
|expression ‘-’ mulexp
|mulexp
mulexp: mulexp ‘‘ primary
|mulexp ‘/’ primary
|primary
primary: ‘(‘
expression’)‘
|‘-’ primary
|NUMBER
This is a perfectly reasonable way to
write grammar, and if yacc didn’t have explicit precedence rules, it would be
the only way.
But yacc also lets you specify precedence explicitly. We can add these lines to the definition section, resulting in the grammar
%left ‘+’ ‘-’
%left ‘*’ ‘/’
%nonassoc
UMINUS
Each of these declarations defines a
level of precedence. They tell yacc that “+” and “-” are left associative and
at lowest precedence level. “*” And “/”
are left associative and at a higher precedence level and UMINUS, a
pseudo-token standing for unary minus, has no associativity and is at the
highest precedence. (We don’t have any right associative operators here, but if
we did they’d use %right) yacc assigns each rule the precedence of the
rightmost token on the right- hand side; if the rule contains no token with
precedence assigned, the rule has no precedence of its own. When yacc
encounters a shift/reduce conflict due to an ambiguous grammar, it consults the
table of precedence’s, and if all of the rules involved in the conflict include
a token, which appears in a precedence declaration, it uses precedence to
resolve the conflict.
In our grammar, all of the conflicts occur in the rules of the form expression OPERATOR expression, so setting precedence’s for the four operators allows it to resolve all of the conflicts. This parser using precedence is slightly smaller and faster than the one with the extra rules for implicit precedence, since it has fewer rules to reduce.
The Calculator grammar with expressions and precedence
% token NAME
NUMBER
%left ‘-’ ‘+’
%left “*’ ‘/‘
%nonassoc
UMINUS
%%
statement:
NAME ‘=’ expression
expression {
prinif (“=%d\n”,$l);
;
expression:
expression ‘+’ expression {$$ = $1 + $3;}
|expression ‘-’ expression
{$$ = $1 - S3;}
|expression ‘‘expression {$$
= $1 * $3;}
|expression ‘/’ expression
{ if ($3 ==
0)
yyerror (“divide by zero”);
else
$$=$1 /$3;
}
| ‘-’ expression %prec UMINUS {$$=$2;}
| ‘(‘expression’)’ {$$=$2;}
| NUMBER {$$ = $1;}
%%
The rule for negation includes “%prec
Uminus”. The only operator this rule includes is “-”, which has low precedence, but we want unary minus to have
higher precedence than multiplication rather than lower. The %prec tells yacc
to use the precedence of UMINUS for this rule.
1. Lex program to count the numbers of vowels and
consonants in a
given string.
Step 1:
Initialize cv = 0 and cc 0 in the
definition section. The initializations are done in the definition section
because the definitions section is composed of
substitutions, code, and start states. Code in the definitions section is simply copied as-is to the top of the generated C file, and must be bracketed with “%{“and“%}“ markers. Substitutions simplifr pattern-matching rules.
substitutions, code, and start states. Code in the definitions section is simply copied as-is to the top of the generated C file, and must be bracketed with “%{“and“%}“ markers. Substitutions simplifr pattern-matching rules.
Step 2:
The next step consists of pattern matching
that is to be done to find the number of vowels and consonants. This is done in
the rules section and it is bracketed between %% and %%.The first %% is always
required, as there must always be a rules section. However, if we don’t specify
any rules, then the default action is to match everything and copy it to
output. Defaults for input and output are stdin and stdout, respectively. In
rules section regular expression for checking vowel and consonents is written.
[aeiouAEIOU]
(If character class matches aeiou or AEIOU then cv((vowel count) should be
incremented by one
[a-zA-Z&&”AEIOUaeiou] (if character matches any character between a to z or A to Z except aeiou AEIOU then consonant count cc should be incremented by one.
Step 3:
The next step is the actual
implementation of the program and the last section of lex specification is the
user subroutine section. This section contains support routines. In this
section the main routine calls yylex() function and printf statements. Here
yylex( )is a function is 1 to invoke the lexer and it return the matched
pattern. It uses default to read from standard input.
2. Lex
program to count the nuiiber of wotds Lines, aracters and
blanks IN a given input.
Step 1:
Initialize cc, lc, wc, be to zero in
the definition section. The initializations are done in the definition section
because the definitions section is composed of substitutions, code, and start
states. Code in the definitions section is simply copied as-is to the top of
the generated C file, and must be bracketed with “%{“ and “%}“ markers.
Substitutions simplify pattern-matching rules.
Step 2:
The next step consists of pattern
matching that is to be done to find the number of words, characters lines and
blank spaces. This is done in the rules section and it is bracketed between %%
and %%.The first %% is always required, as there must always be a rules
section. Defaults for input and output are stdin and stdout, respectively.
[] bc++ (The regular
expression pattern matching is done and if the blank spaces, are found then bc
are incremented by one
[\n] lc++
(The regular expression pattern matching is done and if the new line are found
then ic is incremented by one.
[A-Za-z]+
(Any character between a-z or A-Z is found then cc and wc are incremented by
one. words are found then their lengths are returned using yyleng. yyleng() is
lex internal variable contains the length of the string our lexer recognized.
Step 3:
The next step is the actual
implementation of the program in the subroutine section. It first calls yylex()
function which is s a call to invoke the lexer and then calls the printf( ) to
print the results of this run.
3. Lex program to count number of Posj,tcnd Negative integers and
Positive and negative fractions.
Step 1:
Initialize pi,ni,pf’,nf to zero in the
definition section. The initializations are done in the definition section
because the definitions section is composed of substitutions, code, and start
states.
Step 2:
The next step consist regular expressions
for pattern matching. This is done in the rules section and it is bracketed
between %% and %%.The first %% is always required, as there must always be a
rules section. Pattern matching is done to find the positive, negative integers
and if found pi and ni are incremented. Pattern matching is done to find the
positive, negative floating point numbers and if found pf and nf are
incremented.
[+]?[0-9]+ (Regular expression for an integer.
-[0-9]+ ( Regular expression for an
signed integer.
[+]?[0-9]+.[0-9]+
(Regular expression for a positive
fraction.
-[0-9]+.[0-9]+
(Regular expression for a signed
negative fraction.
Step 3:
The next step is the actual
implementation Of the program in the subroutine section. Here yylex() function
is a call to invoke the lexer and it return the matched pattern. The printf
statements will print the number of positive and negative integers,and positive
and negative fractions.
4. Lex
program to count the nurn of comments lines in a given C
program
Step 1:
Initialize flag, open1, open2, close1,
close2 to zero in the definition section. The initializations are done in the
definition section because the definitions section is composed of
substitutions, code, and start states and bracketed with “%{“ and “%}“ markers.
Step 2:
The rules section and it is bracketed
between %% and %%.The first %% is always required, as there must always be a
rules section. Pattern matching is done to find the opening and closing braces
if found open 1 and close 1 are incremented. Pattern matching is done to find
the opening and closing brackets if found open2 and close2 are incremented.
(main\(\)) (If pattern matches the main () keyword then
it is a valid C
program set flat to1.
[{] (opening brace for beginning of
program.
[}] (closing brace for end of program.
“1* “[0-9a-zA-Z]
*”*/“(Regular expression for recognizing comment line. Start with /* and end
with */ with in between any character or integer. If pattern matches then
increment comment
Step 3:
The next step is the actual
implementation of the program in the subroutine section. Here yylex() function
is a call to invoke the lexer and it return the matched pattern. In the main
part if the number of opening braces and opening brackets match the closing
braces and closing brackets then the given program is a valid one. Then print
the number of comments.
5. Lex
program to count the number of printf’s and scanf’s
statements in a C program. Replace them
with readf and writef
Step 1:
Initialize flag, open1, open2, close1,
close2, pr((printf count), sc(scanf count) to zero in the definition section.
The initializations are done in the definition section because the definitions
section is composed of substitutions, code, and start states. The definition
section must be bracketed with
“% {“ and “%}“ markers.
Step 2:
The rules section is bracketed between
%% and %% and define the regular expression for C program with some number of
printf and scanf. The first %% is always required, as there must always be a
rules section. Pattern matching is done to find the main () keyword and set
flag to 1. Pattern matching is done to find the opening and c1osin braces if
found open 1 and close 1 are incremented. Pattern matching is done to find the
opening and closing brackets if found open2 and close2 are incremented. Pattern
matching is done to find the printf and scanf statements and if found pr and sc
are incremented and print ‘writef’ and ‘readf’ in place of printf and scanf respectively.
(\printf) -
Regular expression to recognize printf.
(\scanf) -
Regular expression to recognize scanf.
Step 3:
The next step is the actual
implementation of the program in the subroutine section. Here yylex() function is a call to invoke the
lexer and it return the matched pattern. In the main part if the number of
opening braces and opening brackets match the closing braces and closing
brackets then the given program is a valid one. The number of printf and scanf
are printed.
6. Lex
program to recognize a valid arithmetic expression and
identify the identifiers and operators.
Print them separately
Step 1:
Initialize valid to one and index, plus,
minus, mult, div, opnd, op to zero in the definition section. The
initializations are done in the definition section because the definitions
section is composed of substitutions, code, and start states.The definition
section must be bracketed by “%{‘ and “%}“ markers.
Step 2:
The pattern matching is done in the
rules section and it is bracketed between %% and %%.The first %% is always
required, as there must always be a rules section.
Validity of expression is checked by stack.
Index is
incremented by one for opening parenthesis and decremented by one for closing
parenthesis
Expression
parenthesis may be a {or [ or ( or } or I or).
Stack is
pushed with ‘c’ ‘s’ ‘p for ‘{‘ ‘[‘ and (respectively.
Whenever
},j,) is found if stack[index] is not equal to ‘c’ ‘s’ and p then closing
paranthesis is missing. Then it is not a valid expression. Set valid to zero.
Then decrement
index by one.
[0-9a-zA-Z]+ Ã Regular
expression for recognizing operand.+ indicates non empty character or digit. If
operand is found then increment operand by one.
[—!@#$%&_.;”,.?<>\\]
- If any special character is found in a expression then expression is not a
valid.
[\f] Ã If pattern
matches + then increment plus and operators by one.
Similarly for
-,* and / operators
Step 3:
The next step is the actual
implementation of the program in the subroutine section. Here yylex() function
is a call to invoke the lexer and it return the matched pattern. ln the main
part if the expression is valid and the number of operands are number of
operators plus 1 or operands and operators are equal to zero then the
expression is valid. The number of operators (+,-,*,/) are printed. The numbers
of operands are also printed.
7. Lex program to recognize whether a given sentence is a simple or
compound.
Step 1:
Initialize cmp and smp to zero in the
definition section. The definitions section is must be bracketed with
“%{“and”%}” markers.
Step 2:
The rule section is bracketed between
%% and %%. In this program pattern matching is done for checking if any of the
compound words like if, while, switch, for and do are encountered and if
encountered then cmp is set to the value of 1.
Step 3:
The next step is the actual
implementation of the program in the subroutine section. Here yylex() function
is a call to invoke the lexer and it return the matched pattern. In this
program if cmp = 1 then the sentence is compound else simple.
8. Lex
program to recognize and count the no identifiers in a iven
input file.
Step 1:
Initialize i=0, d=0, c=0, f=0 and
count 0 in the definition section. The initializations are done in the
definition section because the definitions section is composed of
substitutions, code, and start states. The definition section is enclosed in %{
and %} markers.
Step 2:
The next step consists of pattern
matching to match the datatype. This is done in the rules section and it is
bracketed between %% and %%.The first %% is always required, as there must
always be a rules section. In rules section regular expression for datatypes
and any valid identifier is written.
(\i’nt) i++; Ã If character matches int keyword followed by any identifier then increment variable i
(\double)d++;
à If character
matches double keyword followed by any identifier then increment variable d
(\char) c++; Ã If character
matches char keyword followed by any identifier
then
increment variable c
(\float) f++;
à If character
matches float keyword followed by any identifier
then
increment variable f
Step 3:
The next step is the actual
implementation of the program and the last section of lex specification is the
user subroutine section. This section contains support routines. In this
section the main routine calls yylex() function and prinif statements. Here
yylex()is a function is Ito invoke the lexer and it return the matched pattern.
The printf will display the number of identifiers based on data type matching.
PART
B
The YACC require both Lex(lexer) and Yacc(parser) code.
The Lexer
In order to run parser we need a lexer
to feed it tokens. The parser is a high level routine and calls lexer yylex()
whenever it needs a token from input. As soon as the lexer finds a token of
interest to the parser, it returns to the parser, returning the token code as
the value. Yacc defines the token names in the parser as C preprocessor names
in y.tab.h so the lexer can use them.
AYacc parser
A yacc grammer has the same three
-part structure as a lex specification. The first section, the definition
section handles control information for the yacc generated parser and generally
sets up the execution environment in which the parser will operate.
The second
section contains the rules for the parser.
The third
section is C code copied verbatim into the generated C program.
1. Yacc program to test the validity of a simple expression involving
operators. +,-,*
and /.
Step 1:
Explicitly
declare yylval
yylval : Whenever
the lexer returns a token to the parser, if the token has an associated value
,the lexer will store the value in yylval before returning.
Include the
header file “y.tab.h” in the
definition section.
The initializations are done in the
definition section because the definitions section is composed of
substitutions, code, and start states. Code in the definitions section is
simply copied as-is to the top of the generated C file, and must be bracketed
with “%{‘ and “%}“ markers. Substitutions simplify pattern-matching rules.
Step 2:
The next step consists of pattern
matching to be done in the lex program. This is done in the rules section and it is bracketed
between %% and %%.The first %% is always required, as there must always be a
rules section. However, if we don’t specify any rules, then the default action
is to match everything and copy it to output. Defaults for input and output are
stdin and stdout, respectively.
{O-9]+ - Regular expression for
strings of digits as numbers.
yytext contains the text that matched the pattern. The string of
digits is taken and converted to integers and the value is stored in yylval. The
value is returned as NUMBER. A newline returns an end of input token (number
zero) to tell the parser that there is no more to read.
Step 3:
The next step is the YACC part of the
program. It consists of definition section in which all the definitions, header
files are stored .The second section is the rules section in which the pattern
matching is done .It resembles the BNF grammar The left-hand side of a
production, or non terminal, is entered left-justified, followed by a colon.
This is followed by the Right-hand side of the production. Actions associated
with a rule are entered in braces.
By utilizing left-recursion, we have specified that a program consists of zero or more expressions. Each expression terminates with a new line. When a new line is detected, we print the value of the expressions. When we apply the rule
By utilizing left-recursion, we have specified that a program consists of zero or more expressions. Each expression terminates with a new line. When a new line is detected, we print the value of the expressions. When we apply the rule
expr: expr’ ‘+’ expr { $$ = $1 + $3;}
$$ Ã Contains the
value on left-hand side.
$1 and $3 Ã Contains the
value of first and second operand
$2 Ã contains the
value of operator.
We replace the right-hand side of the
production in the parse stack with the left-hand side of the same production. In
this case, we pop ‘expr ‘+‘ expr” and push “expr”. We have reduced the stack by
popping three terms off the stack, and pushing back one term. We may reference
positions in the value stack in our C code by specifying “$1” for the first
term on the right-hand Side of the production, “$2” for the second, and so on.
“$$” designates the top of the stack after Reduction has taken place. The above
action adds the value associated with two expressions, Pops three terms off the
value stack, and pushes back a single sum. Thus, the parse and value Stacks
remain synchronized. Numeric values are initially entered on the stack when we
reduce from NUMBER to expr. After NUMBER is shifted to the stack, we apply the
rule
Expr: NUMBER {$$ = $1;}
The NUMBER token is popped off the parse stack, followed by a push of expr. For the value Stack, we pop the integer value off the stack, and then push it back on again. In other words, we do nothing. In fact, this is the default action, and need not be specified. Finally, when a new line is encountered, the value associated with expr is printed
Step 4:
The next step is the actual
implementation of the program. The entry point to a yacc generated parser is
function main() and yyparse()
yyparse():
When program calls yyparse() the parser attempts to parse an input stream. The parser returns a value of zero if the parse succeeds and non-zero if not. Every time you call yyparse() the parser starts parsing a new forgetting whatever state have been in the last time it returned. This is similar to yylex() which picks up where it left off each time you call it.
When program calls yyparse() the parser attempts to parse an input stream. The parser returns a value of zero if the parse succeeds and non-zero if not. Every time you call yyparse() the parser starts parsing a new forgetting whatever state have been in the last time it returned. This is similar to yylex() which picks up where it left off each time you call it.
The function yyparse() parses an input
expression. lf it founds valid then print as a valid expression else execute
yyerror() function with string as parameter to print invalid expression.
yyerror( ):
In the event of syntax errors, yacc
calls the user-supplied function yyerror. If you need to modify the interlace
to yyerror, you can alter the canned file that yacc includes to fit your needs.
2. Yacc
program to recognizes nested IF control statement and
displays the number of levels of nesting
Step 1:
Initialize yylval which is the value
associated with the token and include the header file “y.tab.h” in the
definition section. The initializations are done in the definition section
because the definitions section is composed of substitutions, code, and start
states. Code in the definitions section is simply copied as-is to the top of
the generated C file, and must be bracketed with “%{“ and “%}“ markers.
Substitutions simplify pattern-matching rules.
Step 2:
The next step consists of pattern
matching to be done in the lex program. This is done in the rules section and
it is bracketed between %% and %%.The first %% is always required, as there
must always be a rules section.
Regular expressions for matching valid nested if statement:
[\t\n]+; - If
whitespace and newline are found do not do anything.
If {return IF;}
à In a pattern
lex replaces the name inside the braces { } with substitution. If “if” keyword
is found, the action is return IF to yacc.
Similarly for
“(“return OPP(opening paranthesis)
“)“return CLP (closing
paranthesis)
“{“return OPB (opening brace)
“}“return
CLB( return closing brace)
“<“||”>“
return RELOP(relational operator)
I “W” return
LOGOP(logical operator)
I “-” return
AROP (arithmatic operator)
[0-9]+
{return NUMBER} - matches one or more occurrences of the (nonempty) digits.
“;”à return
SEM(semicolon).
[a-z]+ Ã return ID Ã Any
combination of letters is an identifier.
.:Ã matches any
single character except the newline character
(“\n)ignore
it.
Step 3:
The next step
is the YACC part of the program.
It consists
of definition section in which all the definitions, header files are stored.
The numbers
of levels are initialized to zero.
All the
tokens are defined.
The second section is the rules section in which the pattern matching is done. It resembles the BNF grammar. The left-hand side of a production, or non terminal, is entered left- justified, followed by a colon. This is followed by the right-hand side of the production. Actions associated with a rule are entered in braces. The program the pattern matching is done to find whether the Nested IF stateme1s are encountered and if they are encountered then number of levels are incremented for every if statement found.
If OPP COND CLP OPB explicit
callifcond CLB is found then increment level by 1.
Callifcond is
called as many levels.
Check for
COND matching pattern for ex. COND: exp LOGOP exp
Explist
indicates simply expression followed by simicolon for ex. I++:
Step 4:
The next step is the actual
implementation of the program in user subroutine section.
Call the
function yyparse() and printf to print the number of levels.
In the event of syntax errors, yacc
calls the user-supplied function yyerror. If you need to modify the interface
to yyerror, you can alter the canned file that yacc includes to fit your needs
4. Yacc program to recognize the validity of a variable
which starts
with a letter, followed
by any number of letters of digits.
Step 1:
Explicity declare yylval which is the
value associated with the token and include the header file “y.tab.h” in the
definition section. The definition section is bracketed with “%{“and “%}”
markers.
Step 2:
The next step consists of pattern
matching to be done in the lex program. This is done in the rules section and
it is bracketed between %% and %%. In this section pattern matching is done to
return the letter or digit.
[\t\n]+; ignore white space and
newline
[a-z] Ã If any character is found return a
letter token to yacc.
[0-9] Ã If any digit is found return a digit
token to yacc.
Step 3:
The next step is the YACC part of the
program. It consists of definition section in which all the definitions, header
files are declared. The second section is the rules section in which the
pattern matching is done. It resembles the BNF grammer the left-hand side of a
production or non terminal is entered left justified, followed by a colon. This
is followed by the right-hand side of the production. In this program the
pattern matching to check whether a variable name is always a letter is
followed by number.
Step 4:
The next step is the actual
implementation of the program.
The yyparse() function is called the
parser attempts to parse an input stream. In the event of syntx errors, yacc
calls the user-supplied function yyerror. If you needto modify the interface to
yyerror, you can alter the canned file that yacc includes to fit your needs. In
the main part if the variable name is valid then valid value is returned else a
yyerror is done.
5. Yacc
program to evaluate an arithmetic expression involving
operators +,
and /.
Step 1:
Explicitly declare yylval which is the
value associated with the token and include the header file “y.tab.h” in the
definition section.
Step 2:
The next step consists of pattern
matching to be done in the lex program. This is done in the rules section and
it is bracketed between %% and %%.The first %% is always required, as there
must always be a rules section. In this program all the numbers and the strings
are bracketed in [].When the pattern matching is found then the any string
found should be converted into a number and returned.
All the tokens for (, ), +, -, b, /
are returned to yacc.
Step 3:
The next step is the YACC part of the
program. It consists of definition section in which all the definitions, header
files are stored .The second section is the rules section in which the pattern
matching is done.
All the tokens
are declared.
% left ‘+‘
‘-‘
% right ‘*‘
Each of these declarations defines a
level of precedence. They tell yacc that ‘+’ and are left associative and at
the lowest precedence level. ‘*’ and ‘I’
are right associative and at a higher precedence level.
It resembles the BNF grammar The
left-hand side of a production, or non terminal, is entered left-justified,
followed by a colon. This is followed by the right-hand side of the production.
Actions associated with a rule are entered in braces. By utilizing left-recursion,
we have specified that a program consists of zero or more expressions. Each
expression terminates with a new line. When a new line is detected, we print
the value of the Expression. When we apply the rule
expr:
expr ‘+‘ expr { $$ = $1 + $3;}
We replace the right-hand side of the
production in the parse stack with the left-hand side of the same production.
In this case, we pop ‘expr ‘+‘ expr” and push “expr”. We have reduced the stack
by popping three terms off the stack, and pushing back one term. We may
reference positions in the value stack in our C code by specifying ‘$1’ for the
first term on the right-hand Side of the production, “$2” for the second, and
so on. “$$“ designates the top of the stack after Reduction has taken place.
The above action adds the value associated with two expressions, Pops three
terms off the value stack, and pushes back a single sum. Thus, the parse and
value Stacks remain synchronized. Numeric values are initially entered on the
stack when we reduce from NUMBER to expr.
The NUMBER token is popped off the
parse stack, followed by a push of expr. For the value Stack, we pop the
integer value off the stack, and then push it back on again. In other words, we
do nothing. In fact, this is the default action, and need not be specified.
Finally, when a new line is encountered, the value associated with expr is
printed
Step 4:
The next step is the actual
implementation of the program. yyparse() will parse the input expressions. In
the event of syntax errors, yacc calls the user-supplied function yyerror. If
you need to modify the interface, you can alter the canned file that yacc
includes to fit your needs.
6.
Yaac program to recognize strings ‘aaab ‘,’ abbb’, ‘ab’ and ‘a’.
Using the grammar (a^n b^n, n>=0).
Step 1:
Explicitly declare the yylval variable
which holds the value associated with the token and include the header file
“y.tab.h” in the definition section. The definition section must be bracketed
with “%{“and “%}” markers.
Step 2:
The next step consists of pattern
matching to be done in the lex program. This is done in the rules section and
it is bracketed between %% and %%. The first %% is always required, as there
must always be a rules section. In this section pattern matching is done to
return the letters a or b.
Step 3:
The next step is the YAAC part of the
program. It consists of definition section in which all the definitions, header
files are stored. The second section is the rules section in which the pattern
matching is done. It resembles the BNF grammar. The left-hand side of a
production, or non terminal, is entered left-justified, followed by a colon.
This is followed by the right-hand side of the production. In this program the
pattern matching to check whether the strings matches the given strings. The
production rules for a^nb^n is written in this section.
Step 4:
The next step is the actual
implementation of the program. Yyparse() function is invoked to parse the input
string. In the event of syntax errors, yacc calls the user-supplied function
yyerror. If you need to modify the interface to yyerror, you can alter the
canned file that yacc includes to fit your needs. In the main part if the
string is valid then valid string is displayed else yyerror is called to print
invalid string.
7. Yacc program to recognize the grammar {an .b.
n ³
0}
Step 1:
Initialize yylval which is the value
associated with the token and include the header file “y.tab.h” in the
definition section. The initializations are done in the definition section
because the definitions section is composed of substitutions, code, and start
states. Code in the definitions section is simply copied as-is to the tope of
the generated C file, and must be bracketed with “%{“ and “%}” markers.
Substitutions simplify pattern matching rules.
Step 2:
The next step consists of pattern
matching to be done in the lex program. This is done in the rules section and
it is bracketed between %% and %%. The first %% is always required as there
must always be a rules section. However, if we don’t specify any rules, then
the default action is to match everything and copy it to output. Defaults for
input and output are stdin and stdout, respectively. In this section pattern
matching is done to return the letters a or b.
Step
3:
The next step is the YACC part of the
program. It consist of definition section in which all the definitions, header
files are stored. The second section is the rules section in which the pattern
matching is done. It resembles the BNF grammar. The left-hand side of
production, or non terminal, is entered left-justified, followed by a colon.
This is followed by the right-hand side of the production. In this program the
pattern matching to check whether the given strings matches the given grammar
a^nb. The production rules for a^nb is written in rules section. Whenever the
token values for a and b with input string is passed from lex to yacc, the
input is evaluated as per rules.
Step
4:
The next step is the actual
implementation of the program. Yyparse() parse the input string. In the event
of syntax errors, yacc calls the user-supplied function yyerror. If you need to
modify the interface to yyerrror, you can alter the canned file that yacc
includes to fit your needs. In the main part if the grammer is valid then valid
string is displayed else yyerror is called to print invalid.
Comments
Post a Comment