Use two UNIX utilities, lex and yacc, in program development

Preface

This lab manual shows you how to use two Unix utilities, lex and yacc, in program development. Lex and Yacc tools used to generate lexical analyzers and Parsers. These tools help programmers build compilers and interpreters, but they also have a wider range of applications. Any application that looks for patterns in its input or command language is a good candidate for lex and Yacc.

The introduction part explains the lex and yacc tools, development sequence when using lex and yacc, different sections of the lex & yacc, running lex & yacc. Using lex section explains the regular expressions, examples of regular expressions. Using YACC section explains the Grammar, Yacc parser, compiling and running a simple parser. PART A section contains the list of programs for lab session.

This manual also provides stepwise procedure for each program. The student will expose to develop compilers. This manual gives information about
the cursers, which contains the functions that are used to develop text editor.

In programs with structured input, two tasks that occur over and over are dividing the input into meaningful units, and then discovering the relationship among the units. This division into units (which are usually called tokens) is known as lexical analysis, or lexing for short. Lex helps you by taking a set of descriptions of possible tokens and producing a C routine, which we call a lexical analyzer, or a lexer, or a scanner for short, that can identify tokens. This set of descriptions we give to lex is called a Lex specification.

The token description that lex uses are known a regular expression. Lex turns this regular expression into a form that the lexer can use to scan the input text extremely fast independent, of the number of expression that it is trying to match.

+----------------------+

Source -> I Lex I -> yylex

+----------------------+

input -> yylex I -> output

+----------------------+

An overview of Lex

As the input is divided into token, a program often needs to establish the relationship among the tokens. A C complier needs to find the expression, statement, declaration, blocks, and procedures in the program. This task is known as parsing and the list of rule that defines the relationships that the program understands is a grammar. Yacc takes a concise description of a grammar and produces a C routine that can be parsing that grammar, a parser. The Yacc parser automatically detects whenever a sequence of input token match one of the rules in the grammar and also detects a syntax error whenever its input does not match any of the rules. When a task involves dividing the inputs into units and establishing some relationship among those units, we should think of Lex and Yacc.

Lexical rules grammer rules

| |

|Lex | |yacc|

Input -> |yylex| -> yyparse -> parsed input

Fig. Lex with yacc

The first section, the definition section, introduces any initial C program code we want copied into the final program. This is especially important if, for
example, we have Header file that must be include for code in the file to work. We surround the C code with the special delimiters “%{“ and “%}“. Lex copies the material between “%{“and “%}“ directly to the generated C file, so we may write valid C code here. The %% marks the end of the section.

The next section is the rule section. Each rule is made up of two section: a pattern and an action, separated by a white sheet. The lexer that lex generates will execute the action when it recognizes the pattern. These patterns are UNIX STYLE REGULAR EXPRESSION.

For example the first rule is as follows:

[/t]+/ ignore white space */;.

The square bracket,” [ }“, indicates that any one of the characters within the brackets matches the pattern. For our example, we expect either ‘\t” (a tab character) or” “(a space). The “+” means that the patterns matches one or more consecutive copies of the sub pattern that precedes that plus. Thus, this pattern describes white space (any combination of tab and space.) The second part of the rule, the action, is simply a semicolon, a do nothing C statement. Its effect is to ignore the input.

Yytext():
The array yytext contains the text that matched the pattern.

[a-zA-Z]+ {printf(”%s: is not a verb\n”,yytext);}

.|\n {ECHO; /*normal default anyway */}

The pattern “[a-zA-z] +” is a common one. It indicates any alphabetical strings with atleast one character. The “-” character has a special meaning when used inside square bracket: it denotes a range of characters beginning with the character to the left of the “-” and ending with the character to its right. The action of one of these patterns is to print the matched token.

A special character “.” (period) matches any single character other than a newline, and “\n” matches a newline character. The end of the rules section is delimited by another %%.

The final section is the user subroutines section, which can consist of any gal C code. Lex copies it to the C file after the end of the Lex generated code. We have included a main() program.

%%
main()
{
yylex();
}
5

The lexer produced by lex is a C routine called yylex(), so we call it. Unless the actions contain explicit return statement, yylex() won’t return until it has processed the entire input.

To create an executable program on our UNIX system we enter these commands:

% lex filename.|

% cc lex.yy.c —o first —II

Lex translates the lex specification into a C source file called lex.yy.c which we compiled and linked with lex library -II.

Grammars
To recognize specific sequences of tokens and perform appropriate actions. Traditionally, a description of such a set action is known as a grammar.

Parser-Lexer Communication

When we use a lex scanner and a yacc parser together, the parser is the higher level routine. It calls the lexer yylex() whenever it needs a token from the input. The lexer then scans through the input recognizing tokens. As soon as it finds a token of interest to the parser, it returns to the parser returning the token’s code as the value of yylex().

Yacc can optionally write a C header file containing all of the token definitions. You include this file, called y.tab.h on UNIX systems and ytab.h or yytab.h on MS-DOS, in the lexer and use the preprocessor symbols in your lexer action code.

A Yacc Parser

The structure of a yacc parser is, not by accident, similar to that of lex lexer. Our first section, the definition section, has a literal code block enclosed in “%{“ and “%}”. We use it here for a C comment (as with lex, C comments belong inside C code blocks, at least within the definition section) and a single include file.

Then come definitions of all the tokens we expect to receive from the lexical analyzer. In this example, they correspond to the eight parts of speech. The name of a token does not have any intrinsic meaning to yacc, although well-chosen token names tell reader what they represent. Although yacc lets you use any valid C identifier name for a yacc symbol, universal custom dictates that token names be all uppercase and other names in the parser mostly or entirely lowercase.

The first %% indicates the beginning of the rules section. The second %% indicates the end of the rules and the beginning of the user subroutines section. The most important subroutine is main() which repeatedly calls yyparse() until the lexer’s input file runs out. The routine yyparse() is the parser generated by yacc, so our main program repeatedly tries to parse sentences until the input runs out. (The lexer returns a zero token whenever it sees a period at the end of a line; that’s the signal to the parser that the input for the current parse is complete.)

The Rules Section

The rules section describes the actual grammar as a set of production rules or simply rules. Each rule consist of a single name on the left-hand side of the “:” operator, a list of symbols and action code on the right-hand side, and a semicolon indicating the end of the rule. By default, the first rule is the highest-level rule. That is, the parser attempts to find a list of tokens, which match this initial rule, or more commonly, rule found from the initial rule. This expression on the right-hand side of the rule is a list of zero or more names. A typical simple rule has a single symbol on the right-hand side as in the object rule, which is defined to be a NOUN. The symbol on the left-hand side of the rule can then be used like a token in other rules. From this, we build complex grammars.

The action part of a rule consists of a C block, beginning with “{”and ending with “}”. The parser executes an action at the end of a rule as soon as the rule matches. In our sentence rule, the action reports that we’ve successfully parsed a sentence. Since sentence is the top-level symbol, the entire input must match a sentence. The parser returns to its caller, in this case the main program, when the lexer reports the end of the input. Subsequent calls to yyparse() reset the state and begin processing again. The parser caller yyerror(), which we provide in the user subroutines section, and then recognizes the special rule error. You can provide error recovery code, yyparse() returns to the caller after it finds an error.

The third and final section, the user subroutines section, begins after the second %%. This section can contain any C code and is copied, verbatim, into the resulting parser. In our example, we have provided the minimal set of functions necessary to a yacc-generated parser using lex-generated lexer to compile; main () and yyerror(). The main routine keeps calling the parser until it reaches the end- of-file on yylex() which is provided by our exer.

Running Lex and Yacc

We conclude by describing how we built these tools on our system. We called our various lexers filename-n.I, where N corresponded to a particular Lex specification example. Similarly, we called our parsers’ filename-m.y where again M is the number of example. Then, to build the output, we did the following in UNIX;

% lex filename.l

% yacc —d filename.y

% cc-c lex.yy.c y.tab.c

% cc-o example-m.n lex.yy.o y.tab.o-II

The first line runs lex over the lex specification and generates a file Iex.yy.c, which contains C code for the lexer. In the second line, we use yacc to generate both y.tab.c and y.tab.h (the later is the file of token definition created by the –d switch). The next line compiles each of the two C files. The final line links them together and uses the routines in the lex library libl.a. noramally in /usr/lib/lib/.a on most UNIX systems. If you are not using AT & T lex and yacc, but one of the other implementations, you may be able to simply substitute the commands names and little else will change.

USING LEX

Lex tools for building lexical analyzers or lexers. A leker takes an arbitrary input stream and tokenizes it, i.e., divides it up into lexical tokens. This tokenized output can then be processed further, usually by yacc, or it can be the “end product”. When we write a lex-specification, we create a set of patterns which lex matches against the input. Each time one of the patterns matches, the lex program invokes C code that you provide which does something with the matched text. In this way a lex program divides the input into strings, which we call, tokens. Lex itself doesn’t produce an executable program; instead it translates the lex specification into a file containing a C routine called yylex().Your program calls yylex() to run the lexer. Usually your regular C compiler, you compile the file that lex produced along with any other files and libraries you want.

Regular Expressions

Regular expressions are widely used within the UNIX environment, and le uses a rich regular expression language.

A regular expression is a pattern description using a “meta” language, a language that you use to describe particular patterns of interest. The characters used in this metalanguage are part of the standard ASCII character set used in UNIX and MS-DOS, which can sometimes lead to confusion. The characters that form regular expressions are:

Matches any single character expect the newline character (“\n”).

* Matches zero or more copies of the preceding expression.

[] A character class which matches any character within the brackets. If the first character is a circumflex (AI) it changes the meaning to match any character except the ones within the brackets. A dash inside the square brackets indicates a character range, e.g., “[0-9]” means the same thing as
“[0123456789]”.

A Matches the beginning of a line as the first character of a regular
expression.

Also used for negation within square brackets.

$ Matches the end of a line as the last character of a regular expression.

Indicates how many times the pervious pattern is allowed to many when containing one or two numbers. For example:

A{1,3}

Matches one to three occurrences of the letter A. If they contain name, they refer to a substitution by that name.

\ Used to escape metacharacters, and as pa of the usual C escape sequences,

e.g., “\n” is a newline character, while t\* is a literal aster

+ Matches one or more occurrence of the preceding regular expression.

For example;

[0-9]+
matches “1”,”111”, or “12345” but not an empty string. (If the plus sign

were an asterisk, it would also match the empty string.)

? Matches zero or one occurrence of the preceding regular expression.

For example

-?[0-9]+
matches a signed number including an optional leading minus.

I Matches either the preceding regular expression or the following

regular expression. For example;

cow |pig| sheep

matches any of the three words.

”…” Interprets everything within the quotation marks literally-meta-

characters other than C escape sequences lose their meaning.

/ Matches the preceding regular expression but only if followed by the

following regular expression. For example;

0/1
matches “0” in the string “01” but would not match anything in the

strings “0” or “02”. The material matched by the pattern following the

slash is not “consumed” and remains to be turned into subsequent

tokes. Only one slash is permitted per pattern.

() Groups a series of regular expressions together into a new regular

expression. For example;

(01)
represents the character sequence 01. Parentheses are useful when

building up complex patterns with *,+ and |.

Note that some of these operators operate on single characters (e.g., []) while others operate on regular expressions. Usually, complex regular expressions are built up from simple regular expressions.

Examples of Regular Expressions

We are ready for some examples. First, we’ve already shown you a regular expression for a “digit”:

[0-9]
We can use this to build a regular expression for an integer:

[0-9]+
We require at least one digit. This would have allowed no digits at all:

[09]*
Let’s add an optional unary minus:

-?[O-9]+
We can then expand this to allow decimal numbers. First we will specify a decimal number (for the moment we insist that the last character always a digit);

[09]*\.[09]÷

Notice the “\” before the period to make it a literal period rather than a wild card character. This pattern matches “0.0”, “4.5” or “.31415”. But it won’t match “0” or “2”. We’d like to combine our definitions to match them as well. Leaving out our unary minus, we could use:

([0-9]+) | ([09]*\.[09]+)

We use the grouping symbols “Q” to specify what the regular expression are for the “I” operation. Now let’s add the unary minus:

-? (([O-9]+) | ([0-9]*\.[0-9]+))

We can expand this further by allowing a float-style exponent to be specified as well. First, let’s write a regular expression for an exponent:

[eE] [-+]?[0-9]+

This matches an upper or lowercase latter E, then an optional plus or minus sign, then a string of digits. For instance, this will match “e12” or “E-3”. We can then use this expression to build our final expression, one that specifies a real number:

-?(([0-9]+) | ([0-9]*\. [O-9]+|) ([eE][-+]?[0-9]+)?)

Our expression makes the exponent part optional. Let’s write a real lexer that uses this expression. Nothing fancy, but it examines the input and tells us each time it matches a number according to our regular expression.
Example: Lex specification for decimal numbers

%%
[\n\t];

—?(([0-9]+) | ([0-9]*\.[0-9]+) ([eE] [-+]?[0-9]+)?)Printf(“number\n”);

ECHO;
%%

Main()

{
yylex();

}

Finally, here is a regular expression for matching quoted strings

\”[^”\n]*[“\n]

It might seem adequate to use a simpler expression.

\“.*\“

Unfortunately, this causes lex to match incorrectly if there are two strings on the same input line. For instance:

“how” to “do”

Would match as a single pattern since “‘k” matches as much as possible knowing this, we then might try:

\”[^”]*\”

When yylex() reaches the end of its input file, it calls yywrap(), which return a value of 0 or 1. if the value is I the program is done and there is no more input, if the value is 0, on the other hand, the lexer assumes that yywrap() has opened another file for it to read from yyin. The default yywrap() always returns 1 by providing our own version of yywrap(), we can have our program read all of the files named on the command line, one at a time.

USING YACC

Grammars
Yacc takes a grammar that you specify and writes a parser that recognizes valid “sentences” in that grammar. We use the term “sentence” here in a fairly general way for a C language grammar the sentences are syntactically valid C programs.

A Yacc Parser

A yacc grammar has the same three-part structure as a lex specification. (Lex copied its structure from yacc). This first section, the definition section, handles control information for the yacc-generated parser (from here on we will call it the parser), and generally sets up the execution environment in which the parser will operate. The second section contains the rules for the parser, and the third sections is C code copied verbatim into the generated C program

The definition section

The definition section included declarations of the tokens used in the grammar, the types of values used on the parser stack, and other odds and ends. It can also include a literal block, C code enclosed in %{%} lines. We start our first parser by declaring two symbolic tokens.

For ex:

%token NAME NUMBER

You can use single quoted characters as tokens without declaring them, so we don’t need to declare “=”, “+”, or “-”.

The Rules section

The rule section simply consists of a list of grammar rules in much the same format as we used above. Since ASCII keyboards don’t have a à key, we use a colon between the left- and right-hand sides of a rule, and we put a semicolon at the end of each rule.

For ex:

%token NAME NUMBER

%%
statement: NAME ‘=’ expression

|expression

expression: NUMBER ‘÷’ NUMBER

| NUMBER ‘-’ NUMBER

Unlike lex, yacc pays no attention to line boundaries in the rules section, and you will find that a lot of white space makes grammars easier to read. We’ve added one new rule to the parser; a statement can be a plain expression as well as an assignment. If the user enters a plain expression, we’ll print out its result.

The symbol on the left-hand side of the first rule in the grammar is normally the start, though you can a %start declaration in the definition section to override that.

Symbol Values and Actions

Every symbol in a yacc parser has a value. The value gives a additional information about a particular instance of a symbol. If a symbol represents a number, the value would be the particular number. If it represents a literal text string, the value would probably be a pointer to a copy of the string. If it represents a variable in a program, the value would be a pointer to a symbol table entry describing the variable. Some tokens don’t have a useful value, e.g., a token representing a close parenthesis, since one close parenthesis is the same as another.

Non-terminal symbols can have any values you want, created by code in the parser. Often the action code builds a parse tree corresponding to the input, so that later code can process a whole statement or even a whole program at a time.

In the current parser, the value of a NUMBER or an expression is the numerical value of the number or expression, and the value of a NAME will be a symbol table pointer.

YYSTYPE
In real parsers, the values of different symbols use different data types, e.g., mt and double for numeric symbols, char * for strings and pointers to structures for higher level symbols. If you have multiple value types, you have to list all the value types used in a parser so that yacc can create a C union typedef called YYSTYPE to contain them. (Fortunately, yacc gives you a lot of help ensuring that you use the right value for each symbol.)

Whenever the parser reduces a rule, it executes user C code associated with the rule, known as the rule’s action. The action appears in braces after the end of the rule, before the semicolon or vertical bar. The action code can refer to the values of the right-hand side symbols as $1, $2, and can set the value of the left-hand side by setting $$. In our parser, the value of an expression symbol is the value of the expression it represents. We add some code to evaluate and print expressions, bringing our grammar up to that used in figure:

%token NAME NUMBER

statement: NAME ‘=’ expression

|expression {printf (“=%d\n”,$1)}

;

expression: expression ‘+’ NUMBER {$$ = $1 + $3;}

|expression ‘-’ NUMBER {$$ = $1 - $3;}

|NUMBER

The rules that build an expression compute the appropriate values, and the rule that recognizes an expression as a statement prints out the result. In the expression building rules, the first and second numbers values are $1 and $3, respectively. The operator’s value would be $2, although in this grammar the operators do not have interesting values. The action on the last rule is not strictly necessary, since the default action that yacc performs after every reduction, before running any explicit action code assigns the value $1 to $$.

The Lexer

To try out our parser, we need a lexer to1rokens. As we mentioned earlier, the parser is the higher level routine, and calls the lexer yylex() when ever it need a token from the input. As soon as the lexer finds a token of interest to the parser, it returns to the parser, returning the token code as the value. Yacc defines the token names in the parser as C preprocessor names in y.tab.h (or some similar name on MS-DOS system so the lexer can use them.)

Sample program on lexer to provide tokens for our parser:

#include “y.tab.h”

extern mt yylval;

[O-9]+ {yylval = atoi(yytext); return NUMBER;}

[\t]; 1* ignore white space *1

\n return 0; 1* logical EOF *1

return yytext[0];

Strings of digits are numbers, whitespace is ignored, and a newline returns an end of input token (number zero) to tell the parser that there is no more to read.

The last rule in the lexer is a very common catch-all, which says to return any character otherwise not handled as a single character token to the parser. Character tokens are usually punctuation such as parentheses, semicolons and single character operators. If the parser receives a token that it doesn’t know about, it generates a syntax error, so this rule lets you handle all of the single-character token easily while letting yacc’s error checking catch and complain about invalid input.

Whenever the lexer returns a token to the parser, if the token has an associated value, the lexer must store the value in yylval before returning. In this example, we explicitly declare yylval. In more complex parsers, yacc defines yylval as a union and puts the definition in y.tab.h.

Compiling and Running a Simple Parser

On UNIX system, Yacc takes your grammar and creates y.tab.c, the C language parser, and y.tab.h, the include file with the token number definition. Lex creates lex.yy.c, and the C language lexer. You need only to compile them together with the yacc and lex libraries. The libraries contain usable default versions of all of the supporting routines, including a main() that calls the parser yyparse() and exits.

% yacc —d filename.y #makes y.tab.c and “y.tab.h

% lex filename.l #mkes lex.yy.c

% cc-0 filename y.tab.c lex.yy.c —ly —II# compile and link C files

% filename

Arithmetic Expression and Ambiguity

Precedence controls which operators to execute first in an expression. Mathematical and programming tradition (dating back past the first Fortran compiler in 1950) says that multiplication and division take care precedence over addition and subtraction, so a+b*c means a+(b*c) and-d/e-f means (d/e)-f. In any expression grammar, operators are grouped into levels of precedence from lowest to highest.

The total number of levels depends on the language. The C language is notorious for having too many precedence levels, a total of fifteen levels.

Associativity controls the grouping of operators at the same precedence level. Operators may group to the left, e.g., a-b-c in C means (a-b)-c, or to the right, e.g., abc in C means a=(b=c). In some cases operators do not group at all, e.g., in Fortran A.LE.B.LE.C is invalid.

There are two ways to specify precedence and associativity in a grammar, implicitly and explicitly. To specify them implicitly, rewrite the grammar using separate non-terminals symbols for each precedence level. Assuming the usual precedence and left associativity for everything, we could rewrite our expression rules this way:

Expression: expression ‘+’ mulexp

|expression ‘-’ mulexp

|mulexp

mulexp: mulexp ‘‘ primary

|mulexp ‘/’ primary

|primary

primary: ‘(‘ expression’)‘

|‘-’ primary

|NUMBER

This is a perfectly reasonable way to write grammar, and if yacc didn’t have explicit precedence rules, it would be the only way.

But yacc also lets you specify precedence explicitly. We can add these lines to the definition section, resulting in the grammar

%left ‘+’ ‘-’

%left ‘*’ ‘/’

%nonassoc UMINUS

Each of these declarations defines a level of precedence. They tell yacc that “+” and “-” are left associative and at lowest precedence level. “*” And “/” are left associative and at a higher precedence level and UMINUS, a pseudo-token standing for unary minus, has no associativity and is at the highest precedence. (We don’t have any right associative operators here, but if we did they’d use %right) yacc assigns each rule the precedence of the rightmost token on the right- hand side; if the rule contains no token with precedence assigned, the rule has no precedence of its own. When yacc encounters a shift/reduce conflict due to an ambiguous grammar, it consults the table of precedence’s, and if all of the rules involved in the conflict include a token, which appears in a precedence declaration, it uses precedence to resolve the conflict.

In our grammar, all of the conflicts occur in the rules of the form expression OPERATOR expression, so setting precedence’s for the four operators allows it to resolve all of the conflicts. This parser using precedence is slightly smaller and faster than the one with the extra rules for implicit precedence, since it has fewer rules to reduce.

The Calculator grammar with expressions and precedence

% token NAME NUMBER

%left ‘-’ ‘+’

%left “*’ ‘/‘

%nonassoc UMINUS

statement: NAME ‘=’ expression

expression { prinif (“=%d\n”,$l);

;

expression: expression ‘+’ expression {$$ = $1 + $3;}

|expression ‘-’ expression {$$ = $1 - S3;}

|expression ‘‘expression {$$ = $1 * $3;}

|expression ‘/’ expression

{ if ($3 == 0)

yyerror (“divide by zero”);

else

$$=$1 /$3;

}

| ‘-’ expression %prec UMINUS {$$=$2;}

| ‘(‘expression’)’ {$$=$2;}

| NUMBER {$$ = $1;}

The rule for negation includes “%prec Uminus”. The only operator this rule includes is “-”, which has low precedence, but we want unary minus to have higher precedence than multiplication rather than lower. The %prec tells yacc to use the precedence of UMINUS for this rule.

1. Lex program to count the numbers of vowels and consonants in a

given string.

Step 1:

Initialize cv = 0 and cc 0 in the definition section. The initializations are done in the definition section because the definitions section is composed of
substitutions, code, and start states. Code in the definitions section is simply copied as-is to the top of the generated C file, and must be bracketed with “%{“and“%}“ markers. Substitutions simplifr pattern-matching rules.

Step 2:

The next step consists of pattern matching that is to be done to find the number of vowels and consonants. This is done in the rules section and it is bracketed between %% and %%.The first %% is always required, as there must always be a rules section. However, if we don’t specify any rules, then the default action is to match everything and copy it to output. Defaults for input and output are stdin and stdout, respectively. In rules section regular expression for checking vowel and consonents is written.

[aeiouAEIOU] (If character class matches aeiou or AEIOU then cv((vowel count) should be incremented by one

[a-zA-Z&&”AEIOUaeiou] (if character matches any character between a to z or A to Z except aeiou AEIOU then consonant count cc should be incremented by one.

Step 3:

The next step is the actual implementation of the program and the last section of lex specification is the user subroutine section. This section contains support routines. In this section the main routine calls yylex() function and printf statements. Here yylex( )is a function is 1 to invoke the lexer and it return the matched pattern. It uses default to read from standard input.

2. Lex program to count the nuiiber of wotds Lines, aracters and

blanks IN a given input.

Step 1:

Initialize cc, lc, wc, be to zero in the definition section. The initializations are done in the definition section because the definitions section is composed of substitutions, code, and start states. Code in the definitions section is simply copied as-is to the top of the generated C file, and must be bracketed with “%{“ and “%}“ markers. Substitutions simplify pattern-matching rules.

Step 2:

The next step consists of pattern matching that is to be done to find the number of words, characters lines and blank spaces. This is done in the rules section and it is bracketed between %% and %%.The first %% is always required, as there must always be a rules section. Defaults for input and output are stdin and stdout, respectively.

[] bc++ (The regular expression pattern matching is done and if the blank spaces, are found then bc are incremented by one

[\n] lc++ (The regular expression pattern matching is done and if the new line are found then ic is incremented by one.

[A-Za-z]+ (Any character between a-z or A-Z is found then cc and wc are incremented by one. words are found then their lengths are returned using yyleng. yyleng() is lex internal variable contains the length of the string our lexer recognized.

Step 3:

The next step is the actual implementation of the program in the subroutine section. It first calls yylex() function which is s a call to invoke the lexer and then calls the printf( ) to print the results of this run.

3. Lex program to count number of Posj,tcnd Negative integers and

Positive and negative fractions.

Step 1:

Initialize pi,ni,pf’,nf to zero in the definition section. The initializations are done in the definition section because the definitions section is composed of substitutions, code, and start states.

Step 2:

The next step consist regular expressions for pattern matching. This is done in the rules section and it is bracketed between %% and %%.The first %% is always required, as there must always be a rules section. Pattern matching is done to find the positive, negative integers and if found pi and ni are incremented. Pattern matching is done to find the positive, negative floating point numbers and if found pf and nf are incremented.

[+]?[0-9]+ (Regular expression for an integer.

-[0-9]+ ( Regular expression for an signed integer.

[+]?[0-9]+.[0-9]+ (Regular expression for a positive fraction.

-[0-9]+.[0-9]+ (Regular expression for a signed negative fraction.

Step 3:

The next step is the actual implementation Of the program in the subroutine section. Here yylex() function is a call to invoke the lexer and it return the matched pattern. The printf statements will print the number of positive and negative integers,and positive and negative fractions.

4. Lex program to count the nurn of comments lines in a given C

program

Step 1:

Initialize flag, open1, open2, close1, close2 to zero in the definition section. The initializations are done in the definition section because the definitions section is composed of substitutions, code, and start states and bracketed with “%{“ and “%}“ markers.

Step 2:

The rules section and it is bracketed between %% and %%.The first %% is always required, as there must always be a rules section. Pattern matching is done to find the opening and closing braces if found open 1 and close 1 are incremented. Pattern matching is done to find the opening and closing brackets if found open2 and close2 are incremented.

(main) (If pattern matches the main () keyword then it is a valid C

program set flat to1.

[{] (opening brace for beginning of program.

[}] (closing brace for end of program.

“1* “[0-9a-zA-Z] *”*/“(Regular expression for recognizing comment line. Start with /* and end with */ with in between any character or integer. If pattern matches then increment comment

Step 3:

The next step is the actual implementation of the program in the subroutine section. Here yylex() function is a call to invoke the lexer and it return the matched pattern. In the main part if the number of opening braces and opening brackets match the closing braces and closing brackets then the given program is a valid one. Then print the number of comments.

5. Lex program to count the number of printf’s and scanf’s

statements in a C program. Replace them with readf and writef

Step 1:

Initialize flag, open1, open2, close1, close2, pr((printf count), sc(scanf count) to zero in the definition section. The initializations are done in the definition section because the definitions section is composed of substitutions, code, and start states. The definition section must be bracketed with “% {“ and “%}“ markers.

Step 2:

The rules section is bracketed between %% and %% and define the regular expression for C program with some number of printf and scanf. The first %% is always required, as there must always be a rules section. Pattern matching is done to find the main () keyword and set flag to 1. Pattern matching is done to find the opening and c1osin braces if found open 1 and close 1 are incremented. Pattern matching is done to find the opening and closing brackets if found open2 and close2 are incremented. Pattern matching is done to find the printf and scanf statements and if found pr and sc are incremented and print ‘writef’ and ‘readf’ in place of printf and scanf respectively.

(\printf) - Regular expression to recognize printf.

(\scanf) - Regular expression to recognize scanf.

Step 3:

6. Lex program to recognize a valid arithmetic expression and

identify the identifiers and operators. Print them separately

Step 1:

Initialize valid to one and index, plus, minus, mult, div, opnd, op to zero in the definition section. The initializations are done in the definition section because the definitions section is composed of substitutions, code, and start states.The definition section must be bracketed by “%{‘ and “%}“ markers.

Step 2:

The pattern matching is done in the rules section and it is bracketed between %% and %%.The first %% is always required, as there must always be a rules section.

Validity of expression is checked by stack.

Index is incremented by one for opening parenthesis and decremented by one for closing parenthesis

Expression parenthesis may be a {or [ or ( or } or I or).

Stack is pushed with ‘c’ ‘s’ ‘p for ‘{‘ ‘[‘ and (respectively.

Whenever },j,) is found if stack[index] is not equal to ‘c’ ‘s’ and p then closing paranthesis is missing. Then it is not a valid expression. Set valid to zero.

Then decrement index by one.

[0-9a-zA-Z]+ à Regular expression for recognizing operand.+ indicates non empty character or digit. If operand is found then increment operand by one.

[—!@#$%&_.;”,.?<>\\] - If any special character is found in a expression then expression is not a valid.

[\f] à If pattern matches + then increment plus and operators by one.

Similarly for -,* and / operators

Step 3:

The next step is the actual implementation of the program in the subroutine section. Here yylex() function is a call to invoke the lexer and it return the matched pattern. ln the main part if the expression is valid and the number of operands are number of operators plus 1 or operands and operators are equal to zero then the expression is valid. The number of operators (+,-,*,/) are printed. The numbers of operands are also printed.

7. Lex program to recognize whether a given sentence is a simple or

compound.

Step 1:

Initialize cmp and smp to zero in the definition section. The definitions section is must be bracketed with “%{“and”%}” markers.

Step 2:

The rule section is bracketed between %% and %%. In this program pattern matching is done for checking if any of the compound words like if, while, switch, for and do are encountered and if encountered then cmp is set to the value of 1.

Step 3:

The next step is the actual implementation of the program in the subroutine section. Here yylex() function is a call to invoke the lexer and it return the matched pattern. In this program if cmp = 1 then the sentence is compound else simple.

8. Lex program to recognize and count the no identifiers in a iven

input file.

Step 1:

Initialize i=0, d=0, c=0, f=0 and count 0 in the definition section. The initializations are done in the definition section because the definitions section is composed of substitutions, code, and start states. The definition section is enclosed in %{ and %} markers.

Step 2:

The next step consists of pattern matching to match the datatype. This is done in the rules section and it is bracketed between %% and %%.The first %% is always required, as there must always be a rules section. In rules section regular expression for datatypes and any valid identifier is written.

(\i’nt) i++; à If character matches int keyword followed by any identifier then increment variable i

(\double)d++; à If character matches double keyword followed by any identifier then increment variable d

(\char) c++; à If character matches char keyword followed by any identifier

then increment variable c

(\float) f++; à If character matches float keyword followed by any identifier

then increment variable f

Step 3:

The next step is the actual implementation of the program and the last section of lex specification is the user subroutine section. This section contains support routines. In this section the main routine calls yylex() function and prinif statements. Here yylex()is a function is Ito invoke the lexer and it return the matched pattern. The printf will display the number of identifiers based on data type matching.

PART B

The YACC require both Lex(lexer) and Yacc(parser) code.

The Lexer

In order to run parser we need a lexer to feed it tokens. The parser is a high level routine and calls lexer yylex() whenever it needs a token from input. As soon as the lexer finds a token of interest to the parser, it returns to the parser, returning the token code as the value. Yacc defines the token names in the parser as C preprocessor names in y.tab.h so the lexer can use them.

AYacc parser

A yacc grammer has the same three -part structure as a lex specification. The first section, the definition section handles control information for the yacc generated parser and generally sets up the execution environment in which the parser will operate.

The second section contains the rules for the parser.

The third section is C code copied verbatim into the generated C program.

1. Yacc program to test the validity of a simple expression involving

operators. +,-,* and /.

Step 1:

Explicitly declare yylval

yylval : Whenever the lexer returns a token to the parser, if the token has an associated value ,the lexer will store the value in yylval before returning.

Include the header file “y.tab.h” in the definition section.

The initializations are done in the definition section because the definitions section is composed of substitutions, code, and start states. Code in the definitions section is simply copied as-is to the top of the generated C file, and must be bracketed with “%{‘ and “%}“ markers. Substitutions simplify pattern-matching rules.

Step 2:

The next step consists of pattern matching to be done in the lex program. This is done in the rules section and it is bracketed between %% and %%.The first %% is always required, as there must always be a rules section. However, if we don’t specify any rules, then the default action is to match everything and copy it to output. Defaults for input and output are stdin and stdout, respectively.

{O-9]+ - Regular expression for strings of digits as numbers.

yytext contains the text that matched the pattern. The string of digits is taken and converted to integers and the value is stored in yylval. The value is returned as NUMBER. A newline returns an end of input token (number zero) to tell the parser that there is no more to read.

Step 3:

The next step is the YACC part of the program. It consists of definition section in which all the definitions, header files are stored .The second section is the rules section in which the pattern matching is done .It resembles the BNF grammar The left-hand side of a production, or non terminal, is entered left-justified, followed by a colon. This is followed by the Right-hand side of the production. Actions associated with a rule are entered in braces.
By utilizing left-recursion, we have specified that a program consists of zero or more expressions. Each expression terminates with a new line. When a new line is detected, we print the value of the expressions. When we apply the rule

expr: expr’ ‘+’ expr { $$ = $1 + $3;}

$$ à Contains the value on left-hand side.

$1 and $3 à Contains the value of first and second operand

$2 à contains the value of operator.

We replace the right-hand side of the production in the parse stack with the left-hand side of the same production. In this case, we pop ‘expr ‘+‘ expr” and push “expr”. We have reduced the stack by popping three terms off the stack, and pushing back one term. We may reference positions in the value stack in our C code by specifying “$1” for the first term on the right-hand Side of the production, “$2” for the second, and so on. “$$” designates the top of the stack after Reduction has taken place. The above action adds the value associated with two expressions, Pops three terms off the value stack, and pushes back a single sum. Thus, the parse and value Stacks remain synchronized. Numeric values are initially entered on the stack when we reduce from NUMBER to expr. After NUMBER is shifted to the stack, we apply the rule

Expr: NUMBER {$$ = $1;}

The NUMBER token is popped off the parse stack, followed by a push of expr. For the value Stack, we pop the integer value off the stack, and then push it back on again. In other words, we do nothing. In fact, this is the default action, and need not be specified. Finally, when a new line is encountered, the value associated with expr is printed

Step 4:

The next step is the actual implementation of the program. The entry point to a yacc generated parser is function main() and yyparse()

yyparse():
When program calls yyparse() the parser attempts to parse an input stream. The parser returns a value of zero if the parse succeeds and non-zero if not. Every time you call yyparse() the parser starts parsing a new forgetting whatever state have been in the last time it returned. This is similar to yylex() which picks up where it left off each time you call it.

The function yyparse() parses an input expression. lf it founds valid then print as a valid expression else execute yyerror() function with string as parameter to print invalid expression.

yyerror( ):

In the event of syntax errors, yacc calls the user-supplied function yyerror. If you need to modify the interlace to yyerror, you can alter the canned file that yacc includes to fit your needs.

2. Yacc program to recognizes nested IF control statement and

displays the number of levels of nesting

Step 1:

Initialize yylval which is the value associated with the token and include the header file “y.tab.h” in the definition section. The initializations are done in the definition section because the definitions section is composed of substitutions, code, and start states. Code in the definitions section is simply copied as-is to the top of the generated C file, and must be bracketed with “%{“ and “%}“ markers. Substitutions simplify pattern-matching rules.

Step 2:

Regular expressions for matching valid nested if statement:

[\t\n]+; - If whitespace and newline are found do not do anything.

If {return IF;} à In a pattern lex replaces the name inside the braces { } with substitution. If “if” keyword is found, the action is return IF to yacc.

Similarly for “(“return OPP(opening paranthesis)

“)“return CLP (closing paranthesis)

“{“return OPB (opening brace)

“}“return CLB( return closing brace)

“<“||”>“ return RELOP(relational operator)

I “W” return LOGOP(logical operator)

I “-” return AROP (arithmatic operator)

[0-9]+ {return NUMBER} - matches one or more occurrences of the (nonempty) digits.

“;”à return SEM(semicolon).

[a-z]+ à return ID à Any combination of letters is an identifier.

.:à matches any single character except the newline character

(“\n)ignore it.

Step 3:

The next step is the YACC part of the program.

It consists of definition section in which all the definitions, header files are stored.

The numbers of levels are initialized to zero.

All the tokens are defined.

The second section is the rules section in which the pattern matching is done. It resembles the BNF grammar. The left-hand side of a production, or non terminal, is entered left- justified, followed by a colon. This is followed by the right-hand side of the production. Actions associated with a rule are entered in braces. The program the pattern matching is done to find whether the Nested IF stateme1s are encountered and if they are encountered then number of levels are incremented for every if statement found.

If OPP COND CLP OPB explicit callifcond CLB is found then increment level by 1.

Callifcond is called as many levels.

Check for COND matching pattern for ex. COND: exp LOGOP exp

Explist indicates simply expression followed by simicolon for ex. I++:

Step 4:

The next step is the actual implementation of the program in user subroutine section.

Call the function yyparse() and printf to print the number of levels.

In the event of syntax errors, yacc calls the user-supplied function yyerror. If you need to modify the interface to yyerror, you can alter the canned file that yacc includes to fit your needs

4. Yacc program to recognize the validity of a variable which starts

with a letter, followed by any number of letters of digits.

Step 1:

Explicity declare yylval which is the value associated with the token and include the header file “y.tab.h” in the definition section. The definition section is bracketed with “%{“and “%}” markers.

Step 2:

The next step consists of pattern matching to be done in the lex program. This is done in the rules section and it is bracketed between %% and %%. In this section pattern matching is done to return the letter or digit.

[\t\n]+; ignore white space and newline

[a-z] à If any character is found return a letter token to yacc.

[0-9] à If any digit is found return a digit token to yacc.

Step 3:

The next step is the YACC part of the program. It consists of definition section in which all the definitions, header files are declared. The second section is the rules section in which the pattern matching is done. It resembles the BNF grammer the left-hand side of a production or non terminal is entered left justified, followed by a colon. This is followed by the right-hand side of the production. In this program the pattern matching to check whether a variable name is always a letter is followed by number.

Step 4:

The next step is the actual implementation of the program.

The yyparse() function is called the parser attempts to parse an input stream. In the event of syntx errors, yacc calls the user-supplied function yyerror. If you needto modify the interface to yyerror, you can alter the canned file that yacc includes to fit your needs. In the main part if the variable name is valid then valid value is returned else a yyerror is done.

5. Yacc program to evaluate an arithmetic expression involving

operators +, and /.

Step 1:

Explicitly declare yylval which is the value associated with the token and include the header file “y.tab.h” in the definition section.

Step 2:

The next step consists of pattern matching to be done in the lex program. This is done in the rules section and it is bracketed between %% and %%.The first %% is always required, as there must always be a rules section. In this program all the numbers and the strings are bracketed in [].When the pattern matching is found then the any string found should be converted into a number and returned.

All the tokens for (, ), +, -, b, / are returned to yacc.

Step 3:

All the tokens are declared.

% left ‘+‘ ‘-‘

% right ‘*‘

Each of these declarations defines a level of precedence. They tell yacc that ‘+’ and are left associative and at the lowest precedence level. ‘*’ and ‘I’ are right associative and at a higher precedence level.

It resembles the BNF grammar The left-hand side of a production, or non terminal, is entered left-justified, followed by a colon. This is followed by the right-hand side of the production. Actions associated with a rule are entered in braces. By utilizing left-recursion, we have specified that a program consists of zero or more expressions. Each expression terminates with a new line. When a new line is detected, we print the value of the Expression. When we apply the rule

expr: expr ‘+‘ expr { $$ = $1 + $3;}

We replace the right-hand side of the production in the parse stack with the left-hand side of the same production. In this case, we pop ‘expr ‘+‘ expr” and push “expr”. We have reduced the stack by popping three terms off the stack, and pushing back one term. We may reference positions in the value stack in our C code by specifying ‘$1’ for the first term on the right-hand Side of the production, “$2” for the second, and so on. “$$“ designates the top of the stack after Reduction has taken place. The above action adds the value associated with two expressions, Pops three terms off the value stack, and pushes back a single sum. Thus, the parse and value Stacks remain synchronized. Numeric values are initially entered on the stack when we reduce from NUMBER to expr.

Step 4:

The next step is the actual implementation of the program. yyparse() will parse the input expressions. In the event of syntax errors, yacc calls the user-supplied function yyerror. If you need to modify the interface, you can alter the canned file that yacc includes to fit your needs.

6. Yaac program to recognize strings ‘aaab ‘,’ abbb’, ‘ab’ and ‘a’.

Using the grammar (a^n b^n, n>=0).

Step 1:

Explicitly declare the yylval variable which holds the value associated with the token and include the header file “y.tab.h” in the definition section. The definition section must be bracketed with “%{“and “%}” markers.

Step 2:

The next step consists of pattern matching to be done in the lex program. This is done in the rules section and it is bracketed between %% and %%. The first %% is always required, as there must always be a rules section. In this section pattern matching is done to return the letters a or b.

Step 3:

The next step is the YAAC part of the program. It consists of definition section in which all the definitions, header files are stored. The second section is the rules section in which the pattern matching is done. It resembles the BNF grammar. The left-hand side of a production, or non terminal, is entered left-justified, followed by a colon. This is followed by the right-hand side of the production. In this program the pattern matching to check whether the strings matches the given strings. The production rules for a^nb^n is written in this section.

Step 4:

The next step is the actual implementation of the program. Yyparse() function is invoked to parse the input string. In the event of syntax errors, yacc calls the user-supplied function yyerror. If you need to modify the interface to yyerror, you can alter the canned file that yacc includes to fit your needs. In the main part if the string is valid then valid string is displayed else yyerror is called to print invalid string.

7. Yacc program to recognize the grammar {aⁿ .b. n ³ 0}

Step 1:

Initialize yylval which is the value associated with the token and include the header file “y.tab.h” in the definition section. The initializations are done in the definition section because the definitions section is composed of substitutions, code, and start states. Code in the definitions section is simply copied as-is to the tope of the generated C file, and must be bracketed with “%{“ and “%}” markers. Substitutions simplify pattern matching rules.

Step 2:

The next step consists of pattern matching to be done in the lex program. This is done in the rules section and it is bracketed between %% and %%. The first %% is always required as there must always be a rules section. However, if we don’t specify any rules, then the default action is to match everything and copy it to output. Defaults for input and output are stdin and stdout, respectively. In this section pattern matching is done to return the letters a or b.

Step 3:

The next step is the YACC part of the program. It consist of definition section in which all the definitions, header files are stored. The second section is the rules section in which the pattern matching is done. It resembles the BNF grammar. The left-hand side of production, or non terminal, is entered left-justified, followed by a colon. This is followed by the right-hand side of the production. In this program the pattern matching to check whether the given strings matches the given grammar a^nb. The production rules for a^nb is written in rules section. Whenever the token values for a and b with input string is passed from lex to yacc, the input is evaluated as per rules.

Step 4:

The next step is the actual implementation of the program. Yyparse() parse the input string. In the event of syntax errors, yacc calls the user-supplied function yyerror. If you need to modify the interface to yyerrror, you can alter the canned file that yacc includes to fit your needs. In the main part if the grammer is valid then valid string is displayed else yyerror is called to print invalid.

Search This Blog

LM World

Use two UNIX utilities, lex and yacc, in program development

Comments

Post a Comment

Popular posts from this blog

Chemical test for Tragacanth

BLUE MATCHING & WEDGE MATCHING OF CORE & CAVITY

Chemical test for Benzoin