SE250:lab-9:rwan064
Tokeniser
I chose to do the first option, the tokeniser. It seemed interesting to me and also seemed the easiest out of all the options.
I first downloaded the tokenise.h file from the web and had a look at the tokeniser interface that I will have to implement. I then brainstormed basic ideas of how I can make the tokeniser. I came up with one idea which I think will be easy to implement. From that I made the files below:
- Main program for testing
- SE250:lab-9:rwan064:main.c
- SE250:lab-9:rwan064:test.c - Sample file for testing.
- String type - used in the tokeniser
- Scanner interface
- Tokeniser interface
- SE250:lab-9:rwan064:tokenise.c
- SE250:lab-9:rwan064:tokenise.h - Added data types and some function declarations to the given version.
- Makefile
Compiled using GUN make. And run using the command:
bin/main.exe test.c output.c
The main file just uses the functions in tokenise.h and string.h to take the input C source file (test.c) and separate the program code into identifiers (including keywords) and operators. Using the above run command, it puts the output into the file (output.c). The main program also prints every different identifier and operator to the screen as shown below.
When I ran my main.exe program I got the output:
word: / word: * word: word: NULL word: Comments word: * word: / word: word: word: int word: main word: ( word: NULL word: void word: ) word: word: { word: word: NULL word: int word: x word: ; word: word: NULL word: / word: / word: NULL word: comment word: NULL word: return word: 0 word: ; word: word: } word: word: NULL Num of lines: 11
This output is unexpected because there are "words" (the string printed out after "word:") that has nothing. e.g. in the third line. But from my code, if the word is "nothing" then "NULL" should be printed, as it has been printed in some cases above. To see why this weird thing is happening, I changed a line in my code from:
printf( "word: %s\n", word );
To:
printf( "word: %d%s\n", word[0], word );
The output then came out as:
word: 47/ word: 42* word: 13 word: NULL word: 67Comments word: 42* word: 47/ word: 13 word: 13 word: 105int word: 109main word: 40( word: NULL word: 118void word: 41) word: 13 word: 123{ word: 13 word: NULL word: 105int word: 120x word: 59; word: 13 word: NULL word: 47/ word: 47/ word: NULL word: 99comment word: NULL word: 114return word: 480 word: 59; word: 13 word: 125} word: 13 word: NULL Num of lines: 11
From this I can see that the "nothing" words have an ASCII code of 13. Then I looked up the ASCII codes using asciitable.com and found that the ASCII code of 13 stands for a Carriage Return. But I remember that UNIX format only uses a New Line feed but the Windows format uses both a Carriage Return AND a New Line Feed. But I haven't specified a Carriage Return to be a "separator" in my code. So this is why I got that unexpected output.
So to solve this problem I just had to convert my test source file from Windows to UNIX format. Then I got the correct output shown below:
word: / word: * word: NULL word: NULL word: Comments word: * word: / word: NULL word: NULL word: int word: main word: ( word: NULL word: void word: ) word: NULL word: { word: NULL word: NULL word: int word: x word: ; word: NULL word: NULL word: / word: / word: NULL word: comment word: NULL word: return word: 0 word: ; word: NULL word: } word: NULL word: NULL Num of lines: 11
And this is the output I want!
How the Tokeniser works
The basic idea behind the tokeniser is to take program code and split it into identifiers, symbols and constants, etc... A very basic explanation of my method is to read a character a time from the input file till a "separator" (a space, table, new line, etc...) or an operator (+,-,/,;, etc...) is found and copy all the characters read till that point. Then start reading from the stopped point again till another separator or operator is found. Keep doing this until the end of the file and I will have the source code split into all the different types of tokens I need.
What I then need to do is create a new Token type variable for each of these tokens. Maybe use an array. After doing this I can easily implement the tokeniser interface as given in the tokenise.h file.
Diagram
<html> <img src="http://studwww.cs.auckland.ac.nz/~rwan064/lab9/pic.bmp" width="488" height="399" alt="se250-lab9" /> </html>
Conclusions
- I found that planning things before programming really helps when you do bigger projects
- Also drawing pictures and seperating the program into different parts which are simpler, makes coding much easier.
- Writing specifications for what different functions do and what the data types made are used for also helps.
In most of the previous labs which was shorter, I can pretty much just start implementing functions without pre-planning much. In this lab I found that if I just start implementing things by just figuring things out in my head, it does not work. Writing things down and drawing pictures on my lab book made things much easier.