I went ahead and implemented a carry lookahead 4bit adder and I wanted to know if I am on the right track. I tried to include helpful comments, for example the boolean calculations for each carry bit calculation. The last calculation is quite verbose, and I find it hard to understand why this would be quicker than the ripple chip containing "less code".
First I created a chip called a partial full adder that will take an input of x, y, and a carry bit. It will output a sum bit, a generate bit, and a propagate bit. CHIP PFA { IN a, b, c; OUT s, g, p; PARTS: //Put your code here. //Calculate the Generate Bit And(a=a, b=b, out=g); //Calculate the Propagation Bit Or(a=a, b=b, out=p); //Calculate the SumBit Xor(a=a, b=c, out=s1); Xor(a=s1, b=b, out=s); } Next I created the 4bit look forward adder which i called "Add4" for simplicities sake. Maybe I should rename it to Add4CLA. After each partial full add, I went ahead and calculated the carry bit, including the boolean logic in a comment for your convenience. It is that step that truly baffles me. I would love some further explanation as to why, although this seems to involve more calculation, mathematically it should result in fewer gates and generation in a constant time (rather than linear as ripple is done). CHIP Add4 { IN a[4], b[4], c0; OUT out[4], carry; PARTS: //Put your code here. PFA(a=a[0], b=b[0], c=c0, s=out[0], g=g0, p=p0); //C1 = g0+p0c0 And(a=p0, b=c0, out=pand1); Or(a=pand1, b=g0, out=c1); PFA(a=a[1], b=b[1], c=c1, s=out[1], g=g1, p=p1); //C2 = g1 + p1g0 + p1p0c0 And(a=p1, b=pand1, out=pand2); And(a=p1, b=g0, out=gand1); Or(a=g1, b=pand2, out=or1); Or(a=or1, b=pand2, out=c2); PFA(a=a[2], b=b[2], c=c2, s=out[2], g=g2, p=p2); //C3 = g2 + p2g1 + p2p1g0 + p2p1p0c0 And(a=p2, b=pand2, out=pand3); And(a=p2, b=gand1, out=gand2); And(a=p2, b=g1, out=ggand1); Or(a=pand3, b=gand2, out=or2); Or(a=or2, b=ggand1, out=or3); Or(a=or3, b=g2, out=c3); PFA(a=a[3], b=b[3], c=c3, s=out[3], g=g3, p=p3); //c4 = g3 + p3g2 + p2p3g1 + p1p2p3g0 + p0p1p2p3c0 And(a=p3, b=pand3, out=pand4); And(a=p3, b=gand2, out=gand3); And(a=p3, b=ggand1, out=ggand2); And(a=p3, b=g2, out=ppand1); Or(a=g3, b=ppand1, out=or4); Or(a=or4, b=ggand2, out=or5); Or(a=or5, b=gand3, out=or6); Or(a=or6, b=pand4, out=carry); } Finally I have my Add16 chip. Since it is difficult to string long chains of carry look ahead together, I am wondering if the speedup will be found not in the number of adders strung together, but rather it will be realized when thousands of iterations have been run over each adder type. (E.G.  Ripple versus Carry Lookahead). Your thoughts are appreciated. Note: I left my old implementation commented out in the Add16 for comparison's sake. CHIP Add16 { IN a[16], b[16]; OUT out[16]; PARTS: //Second version with 4bit CLA and ripple carry between 4bit adders; Add4(a=a[0..3], b=b[0..3], c0=false, out=out[0..3], carry=c1); Add4(a=a[4..7], b=b[4..7], c0=c1, out=out[4..7], carry=c2); Add4(a=a[8..11], b=b[8..11], c0=c2, out=out[8..11], carry=c3); Add4(a=a[12..15], b=b[12..15], c0=c3, out=out[12..15], carry=c4); // Put your code here. //First Version with ripple carry /* HalfAdder(a=a[0], b=b[0], sum=out[0], carry=c1); FullAdder(a=a[1], b=b[1], c=c1, sum=out[1], carry=c2); FullAdder(a=a[2], b=b[2], c=c2, sum=out[2], carry=c3); FullAdder(a=a[3], b=b[3], c=c3, sum=out[3], carry=c4); FullAdder(a=a[4], b=b[4], c=c4, sum=out[4], carry=c5); FullAdder(a=a[5], b=b[5], c=c5, sum=out[5], carry=c6); FullAdder(a=a[6], b=b[6], c=c6, sum=out[6], carry=c7); FullAdder(a=a[7], b=b[7], c=c7, sum=out[7], carry=c8); FullAdder(a=a[8], b=b[8], c=c8, sum=out[8], carry=c9); FullAdder(a=a[9], b=b[9], c=c9, sum=out[9], carry=c10); FullAdder(a=a[10], b=b[10], c=c10, sum=out[10], carry=c11); FullAdder(a=a[11], b=b[11], c=c11, sum=out[11], carry=c12); FullAdder(a=a[12], b=b[12], c=c12, sum=out[12], carry=c13); FullAdder(a=a[13], b=b[13], c=c13, sum=out[13], carry=c14); FullAdder(a=a[14], b=b[14], c=c14, sum=out[14], carry=c15); FullAdder(a=a[15], b=b[15], c=c15, sum=out[15], carry=c16);*/ } 
Administrator

The speed improvement in the Carry Lookahead Adder comes into play as the adder gets wider.
For a Ripple Carry Adder, the worst case delay for an Nbit adder is from C_{IN} in to Sum_{N} through all the bittobit carries. The delay time is k_{1} N for some constant k_{1}. For a CLA Adder, the worst case delay path gets from C_{IN} in to Sum_{N} by going up the tree of Lookahead Generators and back down to the Sum_{N} adder. The delay time is k_{2} log_{2}N. k_{2} is typically larger than k_{1} so there is a tradeoff point between the two. Often, the smallest building blocks of the CLA adder, usually 4 bits wide, are implemented with ripple carry add and parallel G, P computation. Mark 
Ok, this makes a lot more sense. Thank you for that clarification.
Oh by the way I just realized I can simplify the Add4 program above drastically. I worked some more on the algebra and then laughed at what I was doing. I actually recalculate previous carry pins when they are already available. With your reply and my corrected math I can now understand the elegance of this method. I'll post an update version. CHIP Add4 { IN a[4], b[4], c0; OUT out[4], carry; PARTS: //Put your code here. PFA(a=a[0], b=b[0], c=c0, s=out[0], g=g0, p=p0); //C1 = g0+p0c0 And(a=p0, b=c0, out=pand1); Or(a=pand1, b=g0, out=c1); PFA(a=a[1], b=b[1], c=c1, s=out[1], g=g1, p=p1); //C2 = g1 + p1g0 + p1p0c0 > g1 + p1(g0+p0c0) > g1 + p1c1 And(a=p1, b=c1, out=pand2); Or(a=g1, b=pand2, out=c2); PFA(a=a[2], b=b[2], c=c2, s=out[2], g=g2, p=p2); //C3 = g2 + p2g1 + p2p1g0 + p2p1p0c0 > g2+p2(g1+p1(g0+p0c0)) > g2+p2c2 And(a=p2, b=c2, out=pand3); Or(a=g2, b=pand3, out=c3); PFA(a=a[3], b=b[3], c=c3, s=out[3], g=g3, p=p3); //c4 = g3 + p3g2 + p2p3g1 + p1p2p3g0 + p0p1p2p3c0 And(a=p3, b=c3, out=pand4); Or(a=g3, b=pand4, out=carry); } 
Administrator

In reply to this post by Shjanzey
If you are interested in other types of adder optimization, check out this post on Carry Select Adders
Mark 
Has anyone had any success implementing a full 16 bit carry lookahead adder? I've successfully implemented the 4bit carry lookahead adder in .hdl code as visualized here: http://en.wikipedia.org/wiki/Lookahead_Carry_Unit but when I attempt to scale the chip up to a 16bit lookahead carry unit the Nand2Tetris hardware simulator gives me the error that I have a "circle in my parts connection".
It's basically the group propagate and group generate outputs from four 4bit carry lookahead chips piped into a 5th carry lookahead chip (therefore accepting 16 bits as input). I've narrowed down the error to the carry bits from the 5th carry lookahead chip, since they are fed in to the four upper level carry lookahead chips. I'm at a bit of a loss since I don't believe that my logic diagrams nor implementation are wrong (I've built/rebuilt it in it's entirety twice). Does any have any experience or suggestions for this? Thanks in advance! 
Administrator

I'm assuming that you are trying to implement the second figure on the Wiki page you referenced. You are running into a limitation of the Hardware Simulator. It sees that there are outputs p4 and g4 from the 4bit CLA adder that connect to the 16bit CLA chip, and the C4 output from the 16bit CLA that connects back to the same 4bit CLA adder. It doesn't analyze the internals of the CLA adder so it doesn't know that the C4 input doesn't affect the p4 and g4 outputs. A quick and dirty kludge is to use two CLA adders for each 4bit group. One to generate the p and g and the other to add A, B and C. From my CLAadd16.hdl: ClaAdd4(a=a[0..3], b=b[0..3], c=c, sum=out[0..3]); ClaAdd4(a=a[0..3], b=b[0..3], g=g0, p=p0);(There's a comment in my HDL explaining this kludge. The comment is longer than the implementation!) If you want to explore more complicated logic I highly recommend Logisim, a fairly powerful schematicbased logic simulator. Mark 
Thanks for the advice! I'm not sure why I didn't occur to me before, but I never tried to make a modified 1Bit Full Adder to handle the propagate/generate bits internally and then plug them directly in to the ClaAdd4 chip.
As it stands right now, my 4Bit CLA chip accepts only a C0 (carryin bit) and two 4 bit numbers to add. It outputs the 4Bit sum, the C4 bit to shift to the next group, and the GP and GG bits. I just scaled it up to a 16Bit chip and it works! Working on the test scripts for it now. I really appreciate the help and I'm excited to check out Logisim. Jason 
Free forum by Nabble  Edit this page 