I built a tokenizer in Python in 90 code lines, including comments.
It is based on the regular expression (re) module of Python.
I started with the code example of a scanner in Python Library Reference, 6.2 re , 220.127.116.11, Writing a tokenizer.
First i studied the example carefully, read about regular expressions and tried to figure out how the example code works. The nice thing about this particular code example is that it brings you in one big leap at an intermediate level of working with regular expressions. You can change the code very easily and EXPERIMENT with it.
I hacked the code example step by step into a tokenizer for the Jack lanuage.
After every step i had something like a working tokenizer for an expanding part of the set of Jack lexical elements.
For testing : i just scribbled down random words , symbols , numbers , and strings and watched the tokenizer happily spit out the corresponding tokens. The syntax of statement , expressions etc.. plays no role at all in tokenizing . That ran completely against my intuition, my head whirled.
My steps were:
1. Get the code in a class
2 Get it work together with the compilationengine , or in the beginning just a skeleton of it
3. Figure out how to skip over whitespace and comments in an elegant way
4. Figure out how to go back 2 steps in the tokenstream to facilitate parsing of Jack terms in the compilation engine
The exact division of work between the jacktokenizer methods "has_more_tokens" and "advance".
You cannot check for a token without "getting" that token. So by checking the "has_more_tokens" method does already half the work that is assigned to "advance" method. The Tokenizer API text is too loose to decide this issue.
Stepping back in the token stream in a safely manner, so that a former situation is restored completely in all cases, is not that easy to figure out. Although there is a simple solution.
The possibility of the tokenizer to put tokens back in the token stream makes the compilation of Jack terms in the engine much more straight forward.
The result is a general tokenizer with 1 lookahead , beyond the current token, usable as a starting point for tokenizing other languages.