Recently I completed part II of the NAND2Tetris course. What a trip! As a computer professional for over 30 years this course has been a chance to reconnect with what I love about computers, the basic fundamentals. It has been a fun and challenging 12 weeks and I was kind of sad when it all ended. But here's hoping part III comes along so I can get into some FPGA.
In the meantime I have taken Shimon's excellent advice and revisited the hardware and am in the process of extending the functionality. Today, I completed the Mult16 Chip and tested it using the Hardware Emulator. I have uploaded the code if anybody is interested to grab it and have a play. You will need to also build the Add15 to Add2 chips, but these are just cut down versions of the Add16 chip we did in week 2. The reason for using narrowed Add chips was to lower the overall hardware cost of the Mult16 chip. I have uploaded the source for these chips to save you some typing.
Anyhow, I play to move onto Div16 next and n-way left and right shift chips following that. Ultimately I'd like to extend the ALU and the machine code of the HACK computer to utilize these new hardware units. Obviously I would been to modify the CPUEmulator to make proper use of them down the track. Lots of fun awaits.
Modern hardware compilers are good at optimizing away unused and unnecessary logic. For example Mux16(a=false, b=whatever, ...) will most likely be reduced to 16 And gates. Unused bits in an adder will also be eliminated. This means that you can use Add16(a[0..3]=..., b[0..3]=..., out[0..3]=...); when you need a 4 bit adder and assume that the logic for the remaining 12 bits will not be generated. This will let you focus on the overall structure of your multiplier rather than the busy-work of writing a bunch of various width adders.
Also note that there is duplication in each stage of the multiplier. You can combine the Ands and the Adder into a new part and use that new part in each stage.
Note that most hardware multipliers generate the double length result. For Hack, that would mean you would need to add a register to hold the high result and instructions to access it, or have 2 instructions: one that returned the low word and one that returned the high word.
Another application of parallel addition similar to that used in multiplication is bit counting, sometimes called POPCNT (population count). This circuit/instruction computes the number of one bits in its input.
The obvious implementation is to build a tree of adders. For 16-bit count:
8 1-bit adders generate 8 2-bit numbers that need to be summed.
4 2-bit adders generate 3 3-bit numbers that need to be summed.
2 3-bit adders generate 2 4-bit numbers that need to be summed.
1 4-bit adder generates the final 5-bit population count.
That requires 11 full-adders and 15 half-adders
My size-optimized 16-bit popcnt uses 11 full-adders and 4 half-adders.