My optimizing VM translator

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

My optimizing VM translator

After writing a post that collects links to posts about VM optimization, I thought it might be worthwhile to share the results of the optimizations I've added to my VM translator over the years. (I'm up to version 3 of the translator. Most of this work has been done when I've been between contracts and needed something mentally stimulating.)

My motivation for adding better size optimization to my VM translator was to be able to run the demonstration program for my floating point class in the CPU emulator. My original VM translator with only call, return, gt, lt and eq was good enough to get my project 9 game (minus the intro text page) running on the CPU emulator, but Float was too big, about 44K. Dead code removal was the answer to get Float.asm below 32K.

I later wrote a Trigonometry class that uses Float, and it has become my test program of choice for my VM translator optimization. My original translator generates about 47K for Trig. My current translator generates 26K for speed optimization and 22K for size optimization. Here's a quick review of my optimization journey.

46832 words — The journey begins.
This is my current translator with all optimization turned off. Call, return, and comparisons use simple ASM subroutines.

39803 words — Add dead code removal.
15% size reduction

Dead code removal works by scanning the VM source code before any ASM code is generated. The scan builds a list of all functions in the program and all the functions that they call. By traversing this list starting with Sys.init all the called functions can be identified, and the uncalled functions can be discarded. (Forum discussion)

34394 words — Improve call code.
14% size reduction
1% execution speed penalty

Just like all return VM commands generate identical ASM code, all call function N commands generate identical code for identical values of function and N. Repeated calls to the same function can be replaced with a jump. (Forum discussion)

26016 words — Optimize VM to use memory-to-memory operations.
24% size reduction
24% execution speed improvement

My implementation does a pass on all the source VM files translating them to a meta-VM language that includes commands like move argument 0 pointer 0 and add local 3 constant 1. (Forum discussion)

This is the fastest executing code since the ASM code for the meta-VM commands is still inline, just shorter. No new ASM subroutines are introduced by this optimization.

25314 ...  24170 ...  22037 — Add automatic common ASM code detection.
2.7% ...  7.1% ...  15%size reduction
10% ...  25% ...  35%execution speed penalty

All CodeWriter routines that generate sequences longer than 4 instructions can generate common code if there are enough instance of the command to result in shorter total code. To make the inline/common decision, the optimizer needs to know the count for every command and its argument variations. (Forum discussion)

An N instruction sequence can be replaced with a 4 instruction call to an N+5 common sequence. A large number of a specific 5-instruction sequences are required before commoning results in smaller code, the execution speed is greatly reduced since that command is now executing 14 instructions instead of 5. The size numbers shown above are the results achieved by varying the minimum size of sequences that are allowed to use common code. This allows for size/speed tradeoff optimization settings.

What's next?
I've thought about replacing my current optimizing CodeWriter with an interpreting CodeWriter.

This new code writer would handle all commands (VM and meta-VM) using parameterized ASM routines in a similar manner to the way I currently handle call commands using multi-level common tail code. This would result in a bit more than 4 instructions per VM/meta-VM command on average. The "bit more" being the amortized size of the parameter setting and interpreter code. The memory-to-memory meta-command interpreter will be rather long since it has 6 parameters (operator, source and destination segments and indices, and return address).

Also, the additional jumps to get into the common code are expected to substantially slow execution speed.

My current translator achieves about 4.75 instructions per command. If I achieve 4.25 IPC that will result in an additional 10% size savings. This does not seem worth the effort.

It might be time to set the VM translator aside and look at compiler optimization.