Jack compiler written in Jack compiles itself!

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Jack compiler written in Jack compiles itself!

cadet1620
Administrator
This post was updated on .
[1-14-2015 -- New version 1.1 here]

My goal when I wrote the host file system class for the VM emulator was to write all the tools -- Assembler, VM translator and Jack compiler -- in Jack and be able to run them on the VM emulator.

I just got my Jack Jack compiler running. Here some screen shots of it running on the VME, followed by some shell screen shots showing that it produced identical output to my Python Jack compiler.

Using the FilePicker to select the JackCompiler subdirectory

Compiling the Compiler source files

 

Here's what it looks like from a command line shell on the host:

Before compiling, the only files in the directory are the compiler sources. (And Memory.vm which is my Memory class that has some instrumentation in it to help me find memory leaks. It snuck in there when I copied the sources from my development directory into this test directory; it's not required for this test.)

After compiling, all the expected VM and XML files are there. (My compilers can optionally generate the Chapter 10 XML files -- very handy for regression testing.)

The directory listing shows it took about 2 1/2 minutes for the Jack Jack compiler to compile itself. It's significantly faster if it's not generating the XML files.

I made a 'cmp' subdirectory and moved all the VM and XML files that the Jack Jack compiler generated into it.

Next I compiled the same sources with my Python Jack compiler.

Once again, all the VM and XML files, but this time it only took about 2 seconds to compile everything.

Comparing the current directory with the 'cmp' directory that has the Jack Jack compiled files show that all the VM and XML files are identical. (Ignore that wayward Memory.vm file!)

 

I'll get the Jack Jack compiler's VM files uploaded here in the next day or two so that you can play with them. It's been a long couple of days getting this debugged.

 

--Mark

Reply | Threaded
Open this post in threaded view
|

Re: Jack compiler written in Jack compiles itself!

ybakos
WOW!
Reply | Threaded
Open this post in threaded view
|

Re: Jack compiler written in Jack compiles itself!

ivant
ybakos wrote
WOW!
Exactly!
Reply | Threaded
Open this post in threaded view
|

Re: Jack compiler written in Jack compiles itself!

cadet1620
Administrator
In reply to this post by cadet1620
This zip has the VM files and the new File built in class that you need to run the Jack Jack Compiler.
JackCompiler-1-0.zip

From the README.TXT


The compiler accesses the host file system using a new VM Emulator built-in class, "File.class". File.class must be copied into the VME's built-in directory, "nand2tetris/tools/builtInVMCode".

This will allow the VME to access files in ".../builtInVMCode/VmeFileSystem". See this forum post for more about File.class:
  File class for VMEmulator

The compiler can generate both .vm and .xml files, so you can test it against any of the Chapter 10 or 11 files.

Alas, it will never run in the CPU Emulator, even if I figure out how to add built-in I/O devices so that it could have a virtual hard disk. My optimizing VM translator generated 68K of code for the compiler's VM files, without including the OS .vm files (83K with the OS)!

The compiler is rather slow. On my somewhat long-in-the-tooth laptop it runs at about 33 lines per second generating just .vm files. If you turn on the .xml option, it drops to about 13 lines per second.

There's an amazing amount of memory allocation/deallocation and function calling going on.


Symbolic Constants

One of the first challenges writing this code was how to deal with symbolic constants. For instance, the tokenizer defines values for return values like TK_KEYWORD and KW_WHILE and the VM writer defines symbols like SEG_LOCAL and OP_NEG. I didn't want to scatter integer constants like 1, 16, 2 and 5 throughout the code; that could be a debugging nightmare!

So how can one Jack class get to constants defined in another class? For that matter, how can a Jack class define constants in the first place (no fair extending the language).

The only way one class can get to another class's data is by calling one of its subroutines. There need to be functions that return the constants. For example, from VmWriter.jack:

    function int SEG_TEMP()     { return  5; }
    function int SEG_THAT()     { return  6; }
    function int SEG_THIS()     { return  7; }      // MUST be last
(SEG_THIS must be last because it is used to define an array size.)

These constants can then be use like this example from compileSubroutine():

    if (subroutineType = JackTokenizer.KW_METHOD()) {
        // 'this' is hidden argument 0
        let str = "~this~";
        do symbolTable.define (str, className, SymbolTable.ARG());
        do str.dispose();
    }

String Management

Consider the movement of an identifier name from the tokenizer to the compiler to the symbol table when a new variable is defined.

The tokenizer allocates a new String to hold the identifier when it is parsed. The Tokenizer.identifier() method may be called multiple times for the same identifier, so the returned String must be considered owned by the tokenizer. The compiler gets this String from the tokenizer, passes it to the Symbol table which then creates a new String copied from the tokenizer's String. The copies string is then added to the hash table.

The next time that the tokenizer parses an identifier, the existing identifier String will be destroyed, and a new one will be allocated for the new identifier.

I have a StrUtil class that is a collection of useful functions like duplicate() that does the allocation and copy used in the above example.

Another interesting issue is string concatenation. For instance a subroutine name is formed as full_name = class_name + '.' + subroutine_name.

More StrUtil functions to the rescue:

    // Target name = objectName.subroutineName
    let objectName = StrUtil.appendCharGrow(objectName, 46);    // '.'
    let objectName = StrUtil.appendGrow(objectName, subroutineName);
    do vmWriter.writeCall(objectName, numArgs);
The reason for assigning the return value back to objectName is that if objectName is too small to hold the result, the appendGrow() functions will allocate a new String that is large enough to hold the result and destroy the original objectName String. The new String has a different address than the original objectName, so the local variable must be updated.

(This sort of thing goes on automatically behind the scenes when you use strings in a language like Java or Python. Your string variables are pointers to pointers so that library routines can change the indirect pointers when they need to reallocate the string.)

There are lots of printed error messages in the code like "Expected something, got whatever" that need to print lots of string constants. OutUtil has printStringConst() that automatically destroys its argument. Here's another example from the compiler:

/**
 *  Expect the current token to be a symbol.  Print error message
 *  and abort compile if this is not so.
 *
 *  Returns true if a matching symbol is found.
 *  Returns false to abort compile.
 */
    method boolean _expectSymbol(char symbol) {
        if ((token.tokenType() = JackTokenizer.TK_SYMBOL()) & (token.symbol() = symbol)) {
            return true; }

        do _printErrorLine();
        do OutUtil.printStringConst("Expected ");
        do _printSymbol(symbol, false);
        do OutUtil.printStringConst(", got ");
        do _printToken();
        do Output.println();
        return false;
    }
[Enough for now. Next installment will be dealing with error returns without memory leaks.]

--Mark

Reply | Threaded
Open this post in threaded view
|

Re: Jack compiler written in Jack compiles itself!

cadet1620
Administrator

Preventing memory leaks in abnormal returns

The grammars for the statements can be fairly complex. Consider the if statement
  ' if '  ' ( '  expression  ' ) '  ' { '  statements  ' } '  (  ' else '  ' { '  statements  ' } '  ) ?
An error can occur if any symbol is missing or if any of the sub-elements contain an error.

When an error occurs, a message needs to be printed, the current file being compiled needs to be aborted, and the next file in the directory needs to be compiled.

The normal way to handle this in a language like Java or Python is to use exceptions. Objects that have been allocated on the stack will be automatically destroyed as the stack is unwound up to the exception handler. The problem is that Jack has neither exceptions nor automatic variables.

In Jack, it makes sense to use nested if/then/else to handle this if the nesting isn't too deep. In the if statement compiler it looks like there are about a dozen places parsing could go wrong; that's going to be pretty deep nesting.

I originally wrote the compiler by transliterating the Python version to Jack, replacing the exceptions with cleanup code and returns. The tricky bit was ensuring that all the Strings that are allocated are deallocated in the error returns.

let trueLabel = _uniqueLabel();     // Returns a new String
let elseLabel = _uniqueLabel();
...
do xml.writeToken(token);
if ( ~ _expectSymbol(41)) {     // ')'
    do trueLabel.dispose();
    do elseLabel.dispose();
    return false;
}
do token.advance();

do xml.writeToken(token);
if ( ~ _expectSymbol(123)) {    // '{'
    do trueLabel.dispose();
    do elseLabel.dispose();
    return false;
}
do token.advance();
...
let skipLabel = _uniqueLabel();
...
if ( ~ _expectKeyword(JackTokenizer.KW_ELSE())) {
    do trueLabel.dispose();
    do elseLabel.dispose();
    do skipLabel.dispose();
    return false;
}
...
do trueLabel.dispose();
do elseLabel.dispose();
do skipLabel.dispose();
return true;
But there's a problem in the above code. skipLabel is only allocated if there is an "else" clause. My Memory.Jack doesn't care if this is NULL, but that is not a specified behavior for the Hack OS. The dispose() call should be protected.
do trueLabel.dispose();
do elseLabel.dispose();
if (~(skipLabel = nul)) {
    do skipLabel.dispose(); }
return true;
While I was looking through the parsing code checking for other places I might be deallocating unallocated Strings, I found multiple places where Strings were not deallocated in error returns, and one place where I could double deallocate a String in an error path. (Double deallocation is very bad; it tends to result in corrupted heap structure!)

As I was thinking about how to inspect the code and ensure that all error cases result in correct deallocation, I came to the conclusion that the only effective way to ensure this in Jack was to rewrite the parsing routines using single exit point code structure.

The new code is fairly clean:

method boolean compileIf() {
    var boolean ok, hasElse;
    var String trueLabel;   // = null
    var String elseLabel;   // = null
    var String skipLabel;   // = null

    do xml.start("ifStatement");

    do xml.writeToken(token);
    let ok = _expectKeyword(JackTokenizer.KW_IF());

    if (ok) {
        do token.advance();

        do xml.writeToken(token);
        let ok = _expectSymbol(40);     // '('

    } if (ok) {
        do token.advance();

        let ok = compileExpression();

    } if (ok) {
        let trueLabel = _uniqueLabel();
        let elseLabel = _uniqueLabel();
// trueLabel allocated
// elseLabel allocated
...
    if (~(trueLabel = null)) {
        do trueLabel.dispose(); }
    if (~(elseLabel = null)) {
        do elseLabel.dispose(); }
    if (~(skipLabel = null)) {
        do skipLabel.dispose(); }
    return ok;
}
There are a couple of things to watch out for using this style.
 • While loops need to include ok in their condition so that they break on errors.
 • While loops that contain multiple calls that can fail need to wrap succeeding calls with if(ok) because there is no way to break in the middle of a loop.
while (ok & ( ~ break)) {
    let ok = _expectIdentifier();
    if (ok) {
        // Add class variable to symbol table
        let ok = symbolTable.define(token.identifier(), type, kind);
    } if (ok) {
...
        // End of list
        let break = true;
    }
}
 • When a String is moved from one variable to another, the source variable needs to be set to NULL to prevent double deallocation.
// subroutineName allocated
...
    if (sym = 46) { // '.'
        // Explicit object/class name. Subroutine name will follow '.'.
        let objectName = subroutineName;
        let subroutineName = null;
// objectName allocated
// subroutineName deallocated
...
        let subroutineName = StrUtil.duplicate(token.identifier());
// subroutineName allocated

Conclusions

Jack encourages single exit point code structure to manage string allocation and deallocation.

Jack encourages small subroutines to keep error handling code manageable.

  --Mark

Reply | Threaded
Open this post in threaded view
|

Re: Jack compiler written in Jack compiles itself! (Version 1.1 files)

cadet1620
Administrator
Version 1.1 with [hopefully] leak-proof CompilationEngine, and a couple other bug fixes:

    Fix missing abort after undefined symbol error.
    Add duplicate symbol error.
    Stay in same directory after compiling a single file.
    Fix 'more' after compiling a single file.

JackCompiler-1-1.zip

If you want to see the source code, email me.

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Jack compiler written in Jack compiles itself! (Version 1.1 files)

tungsten
This is incredible!