Hi, I want to share with you my progress on compiler construction. Right now, I am at the stage where my compiler produces following image from a definition of function Math.multiply() given in Jack language.
To this end, I made up another intermediate representation, inspired by the one used by LLVM compiler. It is basically a list of inctructions. I have implemented several procedures that operate on this list. One of them translates the IR to minimal SSA, others perform dead code elimination and copy propagation. I am somewhat able to translate out of SSA form, producing correct code, but containing lots of unnecessary copies. Also, I have not attempted any register allocation or code generation yet.
I have experimented with subroutine inlining, strength reduction and redundant load elimination, but results are not correct so far.
I will try to keep you updated as I implement more features, which may take me several years :)
Little by little, I made some progress and I want to share it. We start where we left off in previous post, apply out-of-ssa translation on control flow graph (CFG).
You can see that my compiler generated some unnecessary copies at the end of blocks L7,L9 and L10. These can be coalesced by better algorithm, but I do not want to waste too much time on it right now. Also note that empty nodes are removed (L6 is not empty, it contains a branching instruction).
Without PHI instructions in the way, we are left with code that might actually be executable on real hardware... if only it did not use so many registers. Yup, 30 registers are too many, even though the HACK computer can "emulate" registers by using fixed memory locations. However those registers must be pushed/poped at subroutine boundary, so we clearly have to reduce their count.
This is where register allocation kicks in. First, let me show you the result of register allocation if I limit register count to 7 and 5, respectively.
You can see that 7 real registers are enough to keep all 30 virtual registers from interfering. If the compiler is constrained to use only 5 real registers, it is forced to "spill" the registers, i.e. generate load&store instructions. My register allocator is pretty dumb at the moment. It does not calculate spill cost (partly because my basic blocks do not know if they are part of a loop) and does not attempt to split live ranges. In particular, spilling virtual register %8 would be much better choice. I won't bother with those issues right know and rather dive into generating assembler code.
For those interested in register allocator, I attach interference.pdf file showing how it works internally. Connected virtual registers cannot share the same real register. White registers are undecided, gray registers get spilled. Algorithm iterates until all registers are decided.
As for the runtime speed, both files run at surprisingly similar pace, despite their difference in size. SSA version is larger by about 20%. Similar runtime performance is thus a good sign for me, as I have not yet employed any serious oprimizations in jack2ssa2asm translator.
The graph shows differences in instruction distribution. You can see that current SSA version uses more labels than VM version. The reason is an optimization in my VM-based translator which is sometimes able to coalesce comparison with following branching.
Another difference lies in ratio between C-instructions and A-instructions. "Optimal" code is expected to have ratio around 1:1, alternating between loading the data and performing computations on data. I am thus quite satisfied with SSA-based translator so far.
Next graph shows frequencies of C-instructions in logarithmic scale. I divided the graph into 4 parts.
First part shows branching instructions. Unconditional jumps count is directly proportional to count of labels, as discussed before. VM-based translator is sometimes able to coalesce negation with comparison, therefore it generates jumps not generated by SSA-based translator. Difference in relative jump is due to my VM-based compiler generating different versions of "call helper" for different arguments count.
Second part of the graph shows non-jump instructions generated by both translators, last two parts show instructions generated by solely SSA or VM based translator, respectively. Striking feature is that SSA-based translator does not utilize instructions containing constant zero and one as much as VM-based translator does. Zeros and ones are instead loaded through A-instruction. This is intentional. For my VM-based translator, I included ad-hoc generation of special code for different constants. In my new SSA-based translator I decided to generate straightforward code and have a second pass run on this code, which should propagate simple constants. This second pass is not implemented yet.
Instructions with 100 and more instances are likely participating on transfering data between subroutines on stack. I am not going to describe my calling convention yet, as I would like to change it later. I intend to pass some arguments and the return value in dedicated registers.
To sum up, I am quite happy with results so far. Runtime speed is comparable and I already see lots of improvements to implement. Plus, it is great to see the code finally running on "real" hardware -- CPU emulator.