SE250:HTTB:Parsing:Lexical Analysis

From Marks Wiki
Jump to navigation Jump to search
Previous Page Contents Next Page



While Lexical Analysis is not strictly part of Parsing, the two are quire closely linked, as the output of a Tokeniser(Which performs Lexical Analysis) is what a Parser works on.

Tokens

Tokens are simply blocks of text that are recognized to be important to a program. Tokens can be specific constant combinations of characters, ie, "for", "+", ")", etc, or they can be a combination of characters that conform to a particular defined pattern. Ie, "42" would be recognized as a number, with the "value" 42.

Example Token implimentation in C

Token structure:

typedef struct {
	token_t type;
	union {
		int symval;
		char* strval;
		int intval;
		double fltval;
	} val;
} Token;

Token types:

typedef enum {
T_SYMBOL,
T_IDENT,
T_INTEGER,
T_FLOAT,
T_STRING,
T_END,
T_NOTHING
} token_t;

Tokeniser

A Tokeniser performs the process of converting a sequence of characters (e.g. in a C source file) into a sequence of tokens. This process usually has two components - the scanner and the tokeniser. The scanner function searches for sequences of characters that conform to predefined patterns. The tokeniser then takes the resulting sections of characters, classifies them, and creates their tokens.

The screencast below shows the basic overview of the Tokeniser (Lexical analyser):

<html>

           <object id="csSWF" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="640" height="498" codebase="http://active.macromedia.com/flash7/cabs/ swflash.cab#version=9,0,28,0">
               <param name="src" value="http://www.rajithaonline.com/SE250/httb/parsing/rwan064_1/rwan064_1.swf"/>
               <param name="bgcolor" value="#1a1a1a"/>
               <param name="quality" value="best"/>
               <param name="allowScriptAccess" value="always"/>
               <param name="allowFullScreen" value="true"/>
               <param name="scale" value="showall"/>
               <param name="flashVars" value="autostart=false"/>
               <embed name="csSWF" src="http://www.rajithaonline.com/SE250/httb/parsing/rwan064_1/rwan064_1.swf" width="640" height="498" bgcolor="#1a1a1a" quality="best" allowScriptAccess="always" allowFullScreen="true" scale="showall" flashVars="autostart=false" pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash"></embed>
           </object>

</html>

Intranet Link

Regular expressions

Regular expressions are used to describe regular grammars. Put simply, a grammar is regular if it's not recursive. Regexs (often also called regular expressions) have an extended syntax, making them a bit more expressive and slower to parse. They are often used in Tokenisers to define what should be tokenised.

Most text in a regular expression is treated as a literal. For example, cat will match any occurrence of the word cat. Here are some fairly standard operators:

<html>

operatorDescription
.Matches any one character
*Matches the preceding element 0 or more times.
[ ]Matches one of the characters between the brackets.

</html>


Previous Page Contents Next Page