SE250:lab-9:rwan064

Tokeniser

I chose to do the first option, the tokeniser. It seemed interesting to me and also seemed the easiest out of all the options.

I first downloaded the tokenise.h file from the web and had a look at the tokeniser interface that I will have to implement. I then brainstormed basic ideas of how I can make the tokeniser. I came up with one idea which I think will be easy to implement. From that I made the files below:

Main program for testing
- SE250:lab-9:rwan064:main.c
- SE250:lab-9:rwan064:test.c - Sample file for testing.
String type - used in the tokeniser
- SE250:lab-9:rwan064:string.c
- SE250:lab-9:rwan064:string.h
Scanner interface
- SE250:lab-9:rwan064:scanner.c
- SE250:lab-9:rwan064:scanner.h
Tokeniser interface
- SE250:lab-9:rwan064:tokenise.c
- SE250:lab-9:rwan064:tokenise.h - Added data types and some function declarations to the given version.
Makefile
- SE250:lab-9:rwan064:Makefile

Compiled using GUN make. And run using the command:

bin/main.exe test.c output.c

The main file just uses the functions in tokenise.h and string.h to take the input C source file (test.c) and separate the program code into identifiers (including keywords) and operators. Using the above run command, it puts the output into the file (output.c). The main program also prints every different identifier and operator to the screen as shown below.

When I ran my main.exe program I got the output:

word: /
word: *
word:
word: NULL
word: Comments
word: *
word: /
word:
word:
word: int
word: main
word: (
word: NULL
word: void
word: )
word:
word: {
word:
word: NULL
word: int
word: x
word: ;
word:
word: NULL
word: /
word: /
word: NULL
word: comment
word: NULL
word: return
word: 0
word: ;
word:
word: }
word:
word: NULL
Num of lines: 11

This output is unexpected because there are "words" (the string printed out after "word:") that has nothing. e.g. in the third line. But from my code, if the word is "nothing" then "NULL" should be printed, as it has been printed in some cases above. To see why this weird thing is happening, I changed a line in my code from:

printf( "word: %s\n", word );

To:

printf( "word: %d%s\n", word[0], word );

The output then came out as:

word: 47/
word: 42*
word: 13
word: NULL
word: 67Comments
word: 42*
word: 47/
word: 13
word: 13
word: 105int
word: 109main
word: 40(
word: NULL
word: 118void
word: 41)
word: 13
word: 123{
word: 13
word: NULL
word: 105int
word: 120x
word: 59;
word: 13
word: NULL
word: 47/
word: 47/
word: NULL
word: 99comment
word: NULL
word: 114return
word: 480
word: 59;
word: 13
word: 125}
word: 13
word: NULL
Num of lines: 11

From this I can see that the "nothing" words have an ASCII code of 13. Then I looked up the ASCII codes using asciitable.com and found that the ASCII code of 13 stands for a Carriage Return. But I remember that UNIX format only uses a New Line feed but the Windows format uses both a Carriage Return AND a New Line Feed. But I haven't specified a Carriage Return to be a "separator" in my code. So this is why I got that unexpected output.

So to solve this problem I just had to convert my test source file from Windows to UNIX format. Then I got the correct output shown below:

word: /
word: *
word: NULL
word: NULL
word: Comments
word: *
word: /
word: NULL
word: NULL
word: int
word: main
word: (
word: NULL
word: void
word: )
word: NULL
word: {
word: NULL
word: NULL
word: int
word: x
word: ;
word: NULL
word: NULL
word: /
word: /
word: NULL
word: comment
word: NULL
word: return
word: 0
word: ;
word: NULL
word: }
word: NULL
word: NULL
Num of lines: 11

And this is the output I want!

How the Tokeniser works

The basic idea behind the tokeniser is to take program code and split it into identifiers, symbols and constants, etc... A very basic explanation of my method is to read a character a time from the input file till a "separator" (a space, table, new line, etc...) or an operator (+,-,/,;, etc...) is found and copy all the characters read till that point. Then start reading from the stopped point again till another separator or operator is found. Keep doing this until the end of the file and I will have the source code split into all the different types of tokens I need.

What I then need to do is create a new Token type variable for each of these tokens. Maybe use an array. After doing this I can easily implement the tokeniser interface as given in the tokenise.h file.

Diagram

Conclusions

I found that planning things before programming really helps when you do bigger projects
Also drawing pictures and seperating the program into different parts which are simpler, makes coding much easier.
Writing specifications for what different functions do and what the data types made are used for also helps.

In most of the previous labs which was shorter, I can pretty much just start implementing functions without pre-planning much. In this lab I found that if I just start implementing things by just figuring things out in my head, it does not work. Writing things down and drawing pictures on my lab book made things much easier.

SE250:lab-9:rwan064

Contents

Tokeniser

How the Tokeniser works

Diagram

Conclusions

Navigation menu

SE250:lab-9:rwan064

Tokeniser

How the Tokeniser works

Diagram

Conclusions

Navigation menu

Search