Thursday 27 March 2014

Semaphores Practice

Problem 1. The following is a set of three interacting processes that can access two shared semaphores:
semaphore U = 3;
semaphore V = 0;

[Process 1]     [Process 2]    [Process 3]

L1:wait(U)      L2:wait(V)     L3:wait(V)
   type("C")       type("A")      type("D")
   signal(V)       type("B")      goto L3
   goto L1         signal(V)
                   goto L2
Within each process the statements are executed sequentially, but statements from different processes can be interleaved in any order that's consistent with the constraints imposed by the semaphores. When answering the questions below assume that once execution begins, the processes will be allowed to run until all 3 processes are stuck in a wait() statement, at which point execution is halted.
  1. Discussed in section Assuming execution is eventually halted, how many C's are printed when the set of processes runs?
    Exactly 3. Each time Process 1 executes the "wait(U)" statement, the value of semaphore U is decremented by 1. Since there are no "signal(U)" statements, the loop in Process 1 will execute only 3 times (ie, the initial value of U) and then stall the fourth time "wait(U)" is executed.
  1. Discussed in section Assuming execution is eventually halted, how many D's are printed when this set of processes runs?
    Exactly 3. Process 1 will execute its loop three times (see the answer to the previous question), incrementing "signal(V)" each time through the loop. This will permit "wait(V)" to complete three times. For every "wait(V)" Process 2 executes, it also executes a "signal(V)" so there is no net change in the value of semaphore V caused by Process 2. Process 3 does decrement the value of semaphore V, typing out "D" each time it does so. So Process 3 will eventually loop as many times as Process 1.
  1. Discussed in section What is the smallest number of A's that might be printed when this set of processes runs?
    0. If Process 3 is scheduled immediately after Process 1 executes "signal(V)", then Process 2 might continue being stalled at its "wait(V)" statement and hence never execute its "type" statements.
  1. Discussed in section Is CABABDDCABCABD a possible output sequence when this set of processes runs?
    No. Here are the events implied by the sequence above:
    start: U=3 V=0
    type C: U=2 V=1
    type A: U=2 V=0
    type B: U=2 V=1
    type A: U=2 V=0
    type B: U=2 V=1
    type D: U=2 V=0
    type D: oops, impossible since V=0
    
  1. Discussed in section Is CABACDBCABDD a possible output sequence when this set of processes runs?
    Yes:
    start: U=3 V=0
    type C: U=2 V=1
    type A: U=2 V=0
    type B: U=2 V=1
    type A: U=2 V=0
    type C: U=1 V=1
    type D: U=1 V=0
    type B: U=1 V=1
    type C: U=0 V=2
    type A: U=0 V=1
    type B: U=0 V=2
    type D: U=0 V=1
    type D: U=0 V=0
    
  1. Discussed in section Is it possible for execution to be halted with either U or V having a non-zero value?
    No. If U has a non-zero value, Process 1 will be able to run. If V has a non-zero value, Process 3 will be able to run.


Problem 2. The following pair of processes share a common variable X:
Process A         Process B
    int Y;             int Z;
A1: Y = X*2;      B1:  Z = X+1;
A2: X = Y;        B2:  X = Z;
X is set to 5 before either process begins execution. As usual, statements within a process are executed sequentially, but statements in process A may execute in any order with respect to statements in process B.
  1. Discussed in section How many different values of X are possible after both processes finish executing?
    There are four possible values for X. Here are the possible ways in which statements from A and B can be interleaved.
    A1 A2 B1 B2: X = 11
    A1 B1 A2 B2: X = 6
    A1 B1 B2 A2: X = 10
    B1 A1 B2 A2: X = 10
    B1 A1 A2 B2: X = 6
    B1 B2 A1 A2: X = 12
    
  1. Discussed in section Suppose the programs are modified as follows to use a shared binary semaphore S:
    Process A        Process B
        int Y;           int Z;
        wait(S);         wait(S);
    A1: Y = X*2;     B1: Z = X+1;
    A2: X = Y;       B2: X = Z;
        signal(S);       signal(S);
    
    S is set to 1 before either process begins execution and, as before, X is set to 5.
    Now, how many different values of X are possible after both processes finish executing?
    The semaphore S ensures that, once begun, the statements from either process execute without interrupts. So now the possible ways in which statements from A and B can be interleaved are:
    A1 A2 B1 B2: X = 11
    B1 B2 A1 A2: X = 12
    
  1. Discussed in section Finally, suppose the programs are modified as follows to use a shared binary semaphore T:
    Process A         Process B
        int Y;            int Z;
    A1: Y = X*2;      B1: wait(T);
    A2: X = Y;        B2: Z = X+1;
        signal(T);        X = Z;
    
    T is set to 0 before either process begins execution and, as before, X is set to 5.
    Now, how many different values of X are possible after both processes finish executing?
    The semaphore T ensures that all the statements from A finish execution before B begins. So now there is only one way in which statements from A and B can be interleaved:
    A1 A2 B1 B2: X = 11
    


Problem 3. The following pair of processes share a common set of variables: "counter", "tempA" and "tempB":
Process A                            Process B
...                                  ...
A1: tempA = counter + 1;             B1: tempB = counter + 2;
A2: counter = tempA;                 B2: counter = tempB;
...                                  ...
The variable "counter" initially has the value 10 before either process begins to execute.
  1. Discussed in section What different values of "counter" are possible when both processes have finished executing? Give an order of execution of statements from processes A and B that would yield each of the values you give. For example, execution order A1, A2, B1, B2 would yield the value 13.
    There are three possible values for X. Here are the possible ways in which statements from A and B can be interleaved.
    A1 A2 B1 B2: X = 13
    A1 B1 A2 B2: X = 12
    A1 B1 B2 A2: X = 11
    B1 A1 B2 A2: X = 11
    B1 A1 A2 B2: X = 12
    B1 B2 A1 A2: X = 13
    
  1. Discussed in section Modify the above programs for processes A and B by adding appropriate signal and wait operations on the binary semaphore "sync" such that the only possible final value of "counter" is 13. Indicate what should be the initial value of the semaphore "sync".
    We need to ensure that A and B run uniterrupted, but it doesn't matter which runs first.
    semaphore sync = 1;
    
    Process A                            Process B
        wait(sync);                          wait(sync);
    A1: tempA = counter + 1;             B1: tempB = counter + 2;
    A2: counter = tempA;                 B2: counter = tempB;
        signal(sync);                        signal(sync);
    
  1. Discussed in section Draw a precedence graph that describes all the possible orderings of executions of statements A1, A2, B1 and B2 that yield the a final value of 11 for "counter".
       A1     B1
         \  /
          B2
          |
          A2
    
  1. Discussed in section Modify the original programs for processes A and B by adding binary semaphores and signal and wait operations to guarantee that the final result of executing the two processes will be "counter" = 11. Give the initial values for every semaphore you introduce. Try to put the minimum number of constraints on the ordering of statements. In other words, don't just pick one ordering that will yield 11 and enforce that one by means of semaphores; instead, enforce only the essential precedence constraints marked in your solution to question 3.
    semaphore s1 = 0;
    semaphore s2 = 0;
    
    Process A                            Process B
    A1: tempA = counter + 1;             B1: tempB = counter + 2;
        signal(s1);                          wait(s1);
        wait(s2);                        B2: counter = tempB;
    A2: counter = tempA;                     signal(s2);
    


Problem 4. The figure below shows two processes that must cooperate in computing N2 by taking the sum of the first N odd integers.
  Process P  Process Q
  N = 5;
  Sqr = 0;
 loopP: loopQ:

  if (N==0)  Sqr = Sqr + 2*N + 1;
    goto endP;
  N = N - 1;  goto loopQ;
  goto loopP;
 endP:
  print(Sqr);
  1. Add appropriate semaphore declarations and signal and wait statements to these programs so that the proper value of Sqr (i.e., 25) will be printed out. Indicate the initial value of every semaphore you add. Insert the semaphore operations so as to preserve the maximum degree of concurrency between the two processes; do not put any nonessential constraints on the ordering of operations. Hint: Two semaphores suffice for a simple and elegant solution.
    Looking at the code for process Q, we see that we want it to execute with N = 4, 3, 2, 1, and 0. So we want the decrement of N to happen before the first execution of Q. We can use two semaphores to control the production and consumption of "N values" by the two loops. To achieve maximum concurrency, we'll signal the availability of a new N value as soon as it's ready and then do as much as possible (i.e., branch back to the beginning of the loop and check the end condition) before waiting for the value to be consumed:
      Semaphore P = 1;  // P gets first shot at execution
      Semaphore Q = 0;  // Q has to wait for first N value
      int N, Sqr;
    
      Process P  Process Q
      N = 5;
      Sqr = 0;
     loopP: loopQ:
      if (N==0)  wait(Q);
        goto endP;  Sqr = Sqr + 2*N + 1;
      wait(P);                signal(P);
      N = N - 1;  goto loopQ;
      signal(Q);
      goto loopP;
     endP:
      wait(P);  // wait for last Q iteration!
      print(Sqr);
    
    Optionally, you could also split Sqr = Sqr + 2*N + 1 into Sqr = Sqr + 1 and Sqr = Sqr + 2*N to slightly improve concurrency.


Problem 5. A computer has three commonly used resources designated A, B and C. Up to three processes designated X, Y and Z run on the computer and each makes periodic use of two of the three resources.
  • Process X acquires A, then B, uses both and then releases both.
  • Process Y acquires B, then C, uses both and then releases both.
  • Process Z acquires C, then A, uses both and then releases both.
  1. If two of these processes are running simultaneously on the machine, can a deadlock occur? If so, describe the deadlock scenario.
    No deadlock is possible since one of the two processes will always be able to run to completion.
  1. Describe a scenario in which deadlock occurs if all three processes are running simultaneously on the machine.
    All three processes make their first acquisition then hang waiting for a resource that will never become available.
  1. Modify the algorithm for acquiring resources so that deadlock cannot occur with three processes running.
    If we change Z to acquire A then C, no deadlock can occur.


Problem 6. The following is a question about dining computer scientists. There are 6 computer scientists seated at a circular table. There are 3 knives at the table and 3 forks. The knives and forks are placed alternately between the computer scientists. A large bowl of food is placed at the center of the table. The computer scientists are quite hungry, but require both a fork and a knife to eat.
Consider the following policies for eating and indicate for each policy if it can result in deadlock.
    • Attempt to grab the fork that sits between you and your neighbor until you are successful.
    • Attempt to grab the knife that sits between you and your neighbor until you are successful.
    • Eat
    • Return the fork.
    • Return the knife.
    No deadlock is possible since the number of available forks limits the number of knives that can be acquired, i.e., if you have a fork, a knife is guaranteed to be available.
    • Attempt to grab any fork on the table until you are successful (if there are many forks grab the closest one).
    • Attempt to grab any knife on the table until you are succesful (if there are many forks grab the closest one).
    • Eat
    • Return the knife.
    • Return the fork.
    No deadlock is possible for the same reason as above.
    • Flip a coin to decide if you are going to first try for a fork or a knife.
    • Attempt to grab your choice until you are successful (if there are many of that utensil grab the closest one).
    • Attempt to grab the other type of utensil until you are successful (if there are many of that utensil grab the closest one).
    • Eat
    • Return the knife.
    • Return the fork.
    Deadlock is possible since now it is possible all the philosophers to acquire a utensil and then stall indefinately waiting for the other utensil to appear.


Problem 7. Gerbitrail is a manufacturer of modular gerbil cages. A Gerbitrail cage is assembled using a catalog of modules, including gerbil "rooms" of various sizes and shapes as well as sections of tubing whose diameter neatly accommodates a single gerbil. A typical cage contains several intercon- nected rooms, and may house a community of gerbils:

The Gerbitrail cages are immensely successful, except for one tragic flaw: the dreaded GERBILOCK. Gerbilock is a situation that arises when multiple gerbils enter a tube going in opposite directions, meeting within the tube as shown below:

Since each gerbil is determined to move forward and there is insufficient room to pass, both gerbils remain gerbilocked forever.
There is, however, hope on the horizon. Gerbitrail has developed little mechanical ggates that can be placed at the ends of each tube, and which can both sense and lock out gerbil crossings. All ggates are controlled by a single computer. Each ggate X has two routines, X_Enter and X_Leave, which are called when a gerbil tries to enter or leave respectively the tube via that ggate; the gerbil is not allowed to procede (ie, to enter or leave) until the Enter or Leave call returns. Gerbitrail engi- neers speculate that these routines can solve Gerbilock using semaphores.
They perform the following experiment:

where each of the ggates A and B are controlled by the following code:
semaphore S=???;        /* Shared semaphore. */

A_Enter()              /* Handle gerbil entering tube via A */
{ wait(S); }
A_Leave()              /* Handle gerbil leaving tube via A */
{ signal(S); }
B_Enter()              /* Handle gerbil entering tube via B */
{ wait(S); }
B_Leave()              /* Handle gerbil leaving tube via B */
{ signal(S); }
  1. What is the proper initial value of the semaphore S?
    S = 1 since one gerbil is allowed in the tube.
  1. An argument among the Gerbitrail technical staff develops about how the above solution should be extended to more complex cages with multiple tubes. The question under debate is whether separate semaphores should be allocated to each ggate, to each tube, or whether a single semaphore should be shared for all gates. Help resolve the argument. For each proposal, indicate OK if it works, SLOW if it works but becomes burdensome in complex cages, and BAD if it doesn't prevent gerbilock.Single semaphore shared among all ggates: OK -- SLOW -- BAD
    Semaphore for each tube: OK -- SLOW -- BAD
    Semaphore for each ggate: OK -- SLOW -- BAD
    Single semaphore shared among all ggates: SLOW
    Semaphore for each tube: OK
    Semaphore for each ggate: BAD
  1. Gerbitrail management decides to invest heavily in the revolutionary technology which promises to wipe out the gerbilock threat forever. They hire noted computer expert Nickles Worth to evaluate their ggate approach and suggest improvements. Nickles looks at the simple demonstration above, involving only 2 gerbils and a single tube, and immediately objects. "You are enforcing non-essential constraints on the behavior of these two gerbils", he exclaims.What non-essential constraint is imposed by the ggate solution involving only 2 gerbils and one tube? Give a specific scenario.
    Since only 1 gerbil is allowed in the tube at a time, both gerbils can't go through the tube simultaneously in the same direction even though that would work out okay.
  1. Nickles proposes that a new synchronization mechanism, the gerbiphore, be defined to handle the management of Gerbitrail tubes. His proposal involves the implementation of a gerbiphore as a C data structure, and the allocation of a single gerbiphore to each tube in a Gerbitrail cage configuration.Nickles's proposed implementation is:
    semaphore mutex=1;
    
    struct gerbiphore {          /* definition of "gerbiphore" structure*/
     int dir;                    /* Direction: 0 means unspecified. */
     int count;                  /* Number of Gerbils in tube */
    } A, B, ...;                 /* one gerbiphore for each tube */
    
    int Fwd=1, Bkwd=2;           /* Direction codes for tube travel */
    
    /* genter(g, dir) called with gerbiphore pointer g and direction dir
       whenever a gerbil wants to enter the tube attached to g
       in direction dir */
    
    genter(struct gerbiphore *g, int dir) {
     loop:
       wait(mutex);
       if (g->dir == 0)
         g->dir = dir;    /* If g->dir unassigned, grab it! */
       if (g->dir == dir) {
         g->count = 1 + g->count;   /* One more gerbil in tube. */
         *********;                 /* MISSING LINE! */
         return;
       }
       signal(mutex);
       goto loop;
    }
    
    /* gleave(g, dir) is called whenever a gerbil leaves the tube
       attached to gerbiphore g in direction dir. */
    
    gleave(struct gerbiphore *g, int dir) {
       wait(mutex);
       g->count = g->count - 1;
       if (g->count == 0) g->dir = 0;
       signal(mutex);
    }
    
    Unfortunately a blotch of red wine obscures one line of Nickles's code. The team of stain analysts hired to decode the blotch eventually respond with a sizable bill and the report that "it appears to be Ch. Petrus, 1981". Again, your services are needed.
    What belongs in place of ********* at the line commented "MISSING LINE"?
    signal(mutex). We have to release the mutual exclusion lock before returning from genter.
  1. Ben Bitdiddle has been moonlighting at Gerbitrail Research Laboratories, home of the worlds largest gerbil population. Ben's duties include computer related research and shoveling. He is anxious to impress his boss with his research skills, in the hopes that demonstrating such talent will allow him to spend less of his time with a shovel.Ben observes that the GRL Gerbitrail cage, housing millions of gerbils and involving tens of millions of tubes, is a little sluggish despite its use of Nickles Worth's fancy gerbiphore implementation. Ben studies the problem, and finds that each gerbil spends a surprisingly large amount of time in genter and gleave calls, despite the fact that tubes are infrequently used due to their large number. Ben suspects some inefficiency in the code. Ben focuses on the gleave code, and collects statistics about which line of code in the gleave definition is taking the most CPU time.
    Which line of the (4-line) gleave body do you expect to be the most time consuming?
    wait(mutex) since that may hang while waiting for other instances of genter or gleave to complete.
  1. Identify a class of inessential precedence constraints whose imposition by Nickles's code causes the performance bottleneck observed by Ben.
    The mutex semaphore implements a global lock, so you can't manipulate more than one 1 semaphore at a time.
  1. Briefly sketch a change to Nickles's code which mitigates the performance bottleneck.
    If we use a separate mutex for each gerbifore than the mutual exclusion each mutex provides will only apply to operations on that gerbifore.
  1. Encouraged by his success (due to your help), Ben decides to try more aggressive performance improvements. He edits the code to gleave, eliminating entirely the calls to wait and signal on the mutex semaphore. He then tries the new code on a 2-gerbil, 1-tube cage.Will Ben's change work on a 2-gerbil, 1-tube cage? Choose the best answer.

    1. It still works fine. Nice work, Ben!
    2. Gerbilock may be caused by two nearly simultaneous genter calls.
    3. Gerbilock may be caused by two nearly simultaneous gleave calls.
    4. Gerbilock may be caused by nearly simultaneous gleave and genter calls, followed by another genter call.
    5. Gerbilock can't happen, although the system may fail in other ways.
    D.


Problem 8. Similar Software is a software startup whose 24 employees include you and 23 lawyers. Your job is to finish the initial product: the Unicks timesharing system for a single-processor Beta system. Unicks is very similar to the OS code we've seen in lecture, and to a popular workstation OS. To avoid legal entanglements, it incorporates a distinguishing feature: an inter-process communication mechanism call the tube.
A tube provides a flow-controlled communication channel among processes. The system supports at most 100 different tubes, each identified by a unique integer between 0 and 99. The system primitives for communicating via tubes are the Beta SVCs WTube, which writes the nonzero integer in R1 to the tube whose number is in R0, and RTube, which reads a nonzero datum from the tube whose number is passed in R0. Note that tubes can only be used to pass nonzero integers between processes.
The Unicks handlers for WTube and RTube are shown below:
int Tubes[100]; /* max of 100 tubes */

WTube_handler() {
  int TubeNumber = User.R0; /* which tube to write */
  int Datum = User.R1; /* the data to write */

  if (Tubes[TubeNumber] != 0) { /* wait until Tube is empty */
    User.XP = User.XP - 4;
    return;
  } else { 
    Tubes[TubeNumber] = Datum; /* tube empty, fill it up! */
  }
}

RTube_handler() {
  int TubeNumber = User.R0; /* which tube to read */

  if (Tubes[TubeNumber] == 0) { /* wait until there's data */
    User.XP = User.XP - 4;
    return;
  } else {
    User.R1 = Tubes[TubeNumber]; /* read the datum */
    Tubes[TubeNumber] = 0; /* mark the tube as empty */
  }
}
The handlers run as part of the Unicks kernel and will not be interrupted, i.e., handling of interrupts is postponed until the current process returns to user mode. Note that the initial values in the Tubes array are zero, and keep in mind that only nonzero data is to be written to (and read from) each tube.
  1. Let Wi be the ith write on a tube, and Rj be the jth read. What precedence constraint(s) does the above implementation enforce between completion of the Wi and Rj?
    The code requires that the With write must precede the Rith read, that the Rith read must precede the Wi+1th write. All other precedence relationships can be derived from these.
  1. You observe that a process that waits once will waste the remainder of its quantum looping. Suggest a one-line improvement to each of the handlers which will waste less time synchronizing the communication processes.
    Each handler could call the scheduler before returning on a read/write fail.
  1. Assume, for the remaining questions, that your improvement HAS NOT been implemented; the original code, as shown, in being used.Since tubes are advertised as a general mechanism for communication among a set of Unix processes, it is important that they work reliably when several processes attempt to read and/or write the simultaneously. S. Quire, the ex-lawyer CEO, has been unable to figure out just what the semantics of tubes are under these circumstances. He finally asks you for help.
    Describe what will happen if a process writes an empty tube while multiple processes are waiting to read it. How many processes will read the new value?
    Since RTube_handler() is not interruptible, exactly one process will get the new value.
  1. Describe what will happen if multiple processes write different values to a tube at about the same time that another process is doing successive reads from that tube. Will each value be read once? Will at least one value be read, but some may be lost? Will garbage (values other than those written) be read?
    Since WTube_handler() is not interruptible, no values will be lost. Each value will be read correctly.
  1. S. Quire suggests that the interrupt hardware be modified so that timer interupts (which mark the end of the quantum for the currently running process) are allowed to happen in both user and kernel mode. How would this modification change your answer to parts (C) and (D)?
    If the handlers are interruptible, multiple processes may read the same value or garbage (zero). Multiple processes may write over other values before they are read, but will not be read twice if there is only a single reader.
  1. Customers have observed that Unicks seems to favor certain processes under some circumstances. In particular, when one process writes data to a tube while processes A, B, and C are waiting to read it, it is typically the case that the amount of data read by A, B, and C will be dramatically different.Briefly explain the cause for this phenomenon.
    If process D writes data and the scheduler calls A, B, C, and D round-robin (in that order), process A has the best chance of having its Rtube call succeed since it runs right after process D has finished calling Wtube.
  1. Sketch, in a sentence or two, a plausible strategy for treating the processes more equitably.
    If processes have numbers, each tube could remember the last process that read it. When a process writes to the tube again, it could reorder scheduling so that the waiting process with the next higher process number is called next. Another approach would be to use a randomized scheduler, but this creates the possibility that some processes will not be run for a long period of time.
  1. Nickles Worth, a consultant to Similar Software, claims that tubes can be used to implement mutual exclusion-that is, to guarantee that at most one process is executing code within a critical section at all times. He offers an example template:
    int TubeNumber = 37; /* tube to be used for mutual exclusion */
    
    WTube(TubeNumber,1); /* one-time-only initialization */
    
    while () { /* loop with critical section */
       
        /* LOCK ACCESS */
       
       <critical section>
       
        /* UNLOCK ACCESS */
       
    }
    
    where the regions marked LOCK ACCESS and UNLOCK ACCESS use the tube TubeNumber to ensure exclusive access to the code marked <critical section>. These two regions may contain the forms datum=RTube(TubeNumber) and Wtube(TubeNumber,datum) as a C interface to the tube SVCs.Fill in the LOCK and UNLOCK code above.
    LOCK: datum=RTube(TubeNumber)
    UNLOCK: Wtube(TubeNumber,datum)

C Operator Precedence Table

This page lists C operators in order of precedence (highest to lowest). Their associativity indicates in what order operators of equal precedence in an expression are applied.

Operator
Description
Associativity
( )
[ ]
.
->
++ --
Parentheses (function call) (see Note 1)
Brackets (array subscript)
Member selection via object name
Member selection via pointer
Postfix increment/decrement (see Note 2)
left-to-right
++ --
+ -
! ~
(type)
*
&
sizeof
Prefix increment/decrement
Unary plus/minus
Logical negation/bitwise complement
Cast (convert value to temporary value of type)
Dereference
Address (of operand)
Determine size in bytes on this implementation
right-to-left
*  /  % Multiplication/division/modulus left-to-right
+  - Addition/subtraction left-to-right
<<  >> Bitwise shift left, Bitwise shift right left-to-right
<  <=
>  >=
Relational less than/less than or equal to
Relational greater than/greater than or equal to
left-to-right
==  != Relational is equal to/is not equal to left-to-right
& Bitwise AND left-to-right
^ Bitwise exclusive OR left-to-right
| Bitwise inclusive OR left-to-right
&& Logical AND left-to-right
| | Logical OR left-to-right
? : Ternary conditional right-to-left
=
+=  -=
*=  /=
%=  &=
^=  |=
<<=  >>=
Assignment
Addition/subtraction assignment
Multiplication/division assignment
Modulus/bitwise AND assignment
Bitwise exclusive/inclusive OR assignment
Bitwise shift left/right assignment
right-to-left
,
Comma (separate expressions) left-to-right
Note 1:
Parentheses are also used to group sub-expressions to force a different precedence; such parenthetical expressions can be nested and are evaluated from inner to outer.
Note 2:
Postfix increment/decrement have high precedence, but the actual increment or decrement of the operand is delayed (to be accomplished sometime before the statement completes execution). So in the statement y = x * z++; the current value of z is used to evaluate the expression (i.e., z++ evaluates to z) and z only incremented after all else is done.

Wednesday 26 March 2014

Pipelining Explained with some examples

Problem 1. Consider the following combinational logic circuit constructed from 6 modules. In the diagram below, each combinational component is marked with its propagation delay in seconds; contamination delays are zero for each component.

  1. Discussed in section What is the latency and throughput of this combinational circuit?
    latency = longest path from X to C(X) = 1 + 30 + 20 + 2 = 53
    throughput = 1/latency for combinational circuits = 1/53
  1. Discussed in section Place the smallest number of ideal (zero delay, zero setup/hold time) pipeline registers in the circuit above so as to maximize its throughput. Remember to place a register on the output.
    We need 4 registers:
  1. Discussed in section What is the latency and throughput of your pipelined circuit?
    throughput = 1/(max pipeline stage delay) = 1/30
    latency = (1/throughput)*(number of pipeline stages) = 30 * 3 = 90
  1. Discussed in section We can simulate a pipelined version of a slow component by replicating the critical element and alternating inputs between the various copies. Complete the circuit diagram below to create a 2-way interleaved version of the "30" component:
  1. Discussed in section Substituting the interleaved implementation for the original "30" module as shown in the diagram below, place the smallest number of ideal (zero delay, zero setup/hold time) pipeline registers in the circuit below so as to maximize its throughput. Remember to place a register on the output. (Draw the pipeline registers in the diagram below.)

  1. Discussed in section What is the latency and throughput of your newly pipelined circuit?
    throughput = 1/(max pipeline stage delay) = 1/20
    latency = (1/throughput)*(number of pipeline stages) = 20 * 4 = 80
  1. Discussed in section In general, if we take a combinational circuit and pipeline it using ideal (zero delay, zero setup/hold time) registers, which one of the following statements best describes the resulting change in the circuit's latency and throughput:
      A. The throughput may improve but the latency definitely does not.
      B. Both the throughput and latency may improve.
      C. The throughput definitely improves and the latency definitely gets worse.
      D. The latency may improve but the throughput definitely does not.
      E. The throughput definitely improves and the latency definitely does not.
    A.


Problem 2. Partial Products, Inc., has hired you as its vice president in charge of marketing. Your immediate task is to determine the sale prices of three newly announced multiplier modules. The top-of-the-line Cooker is a pipelined multiplier. The Sizzler is a combinational multiplier. The Grunter is a slower sequential multiplier. Their performance figures are as follows (T is some constant time interval):
        Throughput    Latency
Cooker    1/T            5T
Sizzler   1/4T           4T
Grunter   1/32T         32T
Customers follow a single principle: Buy the cheapest combination of hardware that meets my performance requirements. These requirements may be specified as a maximum allowable latency time, a minimum acceptable throughput, or some combination of these. Customers are willing to try any paralleled or pipelined configuration of multipliers in an attempt to achieve the requisite performance. You may neglect the cost (both financial and as a decrease in performance) of any routing, latching, or other hardware needed to construct a configuration. Concentrate only on the inherent capabilities of the arrangement of multipliers itself.It has been decided that the Cooker will sell for $1000. The following questions deal with determining the selling prices of Sizzlers and Grunters.
  1. Discussed in section How much can you charge for Sizzlers and still sell any? That is, is there some price for Sizzlers above which any performance demands that could be met by a Sizzler could also be met by some combination of Cookers costing less? If there is no such maximum price, indicate a performance requirement that could be met by a Sizzler but not by any combination of Cookers. If there is a maximum selling price, give the price and explain your reasoning.
    If there is a performance requirement for the latency to be <= 4T, then there is no combination of Cookers that will meet this performance requirement. So it is in theory possible to sell some Sizzlers at any price. Using multiple Cookers can further improve the overall multiplier throughput, but their latency cannot be shortened.
  1. Discussed in section How little can you charge for Sizzlers and still sell any Cookers? In other words, is there a price for the Sizzler below which every customer would prefer to buy Sizzlers rather than a Cooker? Give and explain your answer, as above.
    The minimum price for a Sizzler is $250.01 if we want to continue to sell Cookers. If the price of a Sizzler is less than that, 4 Sizzlers could be used in parallel to achieve the same throughput as a Cooker with a better latency in the bargain.
  1. Discussed in section Is there a maximum price for the Grunter above which every customer would prefer to buy Cookers instead? As before, give the price, if it exists, and explain your reasoning in either case.
    The maximum price for the Grunter is $999.99 since for applications that can accept long latencies (>= 32T) it's worth buying a Grunter if it saves any money at all.
  1. Discussed in section Is there a minimum price for the Grunter below which every customer would prefer to buy Grunters rather than a Cooker? Once again, give the price, if it exists, and explain your reasoning in either case.
    There is no minimum price for a Grunter that would cause every customer to buy Grunters instead of Cookers. The latency of the Grunter will always be 32T, so when performance requirements demand latencies < 32T, Grunters won't do the job.
  1. Suppose that, as a customer, you have an application in which 64 pairs of numbers appear all at once, and their 64 products must be generated in as short a time as practicable. You have $1000 to spend. At what price would you consider using Sizzlers? At what price would you consider using Grunters?
    Sizzlers will be considered when they cost $250 or less. Grunters may be considered when they cost $124.93 or less. To see this, consider the case when Sizzlers cost $125.01. Buying seven Sizzlers would yield a latency of 40T at a cost of $875.07. The customer cannot afford another Sizzler, but adding a single Grunter for $124.93 will reduce the latency to 36T. All optimal configurations are explored below:


Problem 3. Peculiar Peripherals, Inc. Builds a combinational encryption device constructed of nine modules as follows:

The device takes an integer value X and computes an encrypted version C(X). In the diagram above each combinational component is marked with its propagation delay in microseconds; contamination delays are zero for each component.
  1. What is the latency and throughput of the combinational encryption device?
    Latency = 5 + 3 + 1 + 5 + 3 + 3 + 5 = 25us. Throughput = 1/25us.
  1. Redraw the diagram marking the locations for ideal (zero-delay) registers that will pipeline the device for maximal throughput. Ensure a register at the output and use the minimum number of registers necessary.
    We can remove some of the registers implied by contours (eg, those shown with dotted lines) without decreasing the throughput. There are several equivalent variations of this diagram.
  1. Give the latency and throughput of your pipelined version. Again assume ideal registers.
    There a six registers in each input-output path and the clock period is 5, so latency = 30 and throughput = 1/5.


Problem 4. Consider the following combinational encryption device constructed from six modules:

The device takes an integer value, X, and computes an encrypted version C(X). In the diagram above, each combinational component is marked with its propagation delay in seconds; contamination delays are zero for each component.
In answering the following questions assume that registers added to the circuit introduce no additional delays (i.e., the registers have a contamination and propagation delay of zero, as well as zero setup and hold times). Any modifications must result in a circuit that obeys our rules for a well-formed pipeline and that computes the same results as the combinational circuit above. Remember that our pipeline convention requires that every pipeline stage has a register on its output.
When answering the questions below, if you add a register to one of the arrows in the diagram, count it as a single register. For example, it takes two registers to pipeline both inputs to the rightmost module (the one with latency 4).
  1. What is the latency of the combinational encryption device?
    Latency = delay along longest path from input to output = 2 + 3 + 4 = 9.
  1. If we want to increase the throughput of the encryption device, what is the minimum number of registers we need to add?
    Three. Playing by our pipelining rules, we always add a register to the output. The increase the throughput we need to add other register that bisect the circuit. The cheapest place to do this is just before the "4" module, requiring two additional registers.
  1. If we are required to add exactly 5 registers, what is the best throughput we can achieve?
    The best throughput we can achieve with 5 registers is 1/5: place 3 (!) registers on the output and two registers on the arcs leading to the "4" module. If we use 4 registers to divide the circuit between the "2" and "3" modules, the resulting throughput is 1/7.
  1. If we can add as many registers as we like, is there an upper bound on the throughput we can achieve?
    Yes: 1/4, because the best we can do by just adding registers is to segregate the "4" module into its own pipeline stage.
  1. If we can add as many registers as we like, is there a lower bound on the latency we can achieve?
    Lower bound on latency = 9. We can never make the latency less by adding pipeline registers; usually the latency increases.


Problem 5. Consider the following pipelined circuit: The number written on top of each combinational element indicates its propagation delay in nanoseconds. Assume that the pipeline registers shown are ideal (they have a propagation delay, contamination delay, hold-time and a set-up time of 0 ns).

  1. What is the minimum clock period for which we can expect the given circuit to operate correctly?
    8ns since we have to leave time for the logic between registers A and C to do its stuff.
  1. What is the minimum latency of the circuit as shown?
    32ns = 4 pipeline stages at 8ns clock period.
  1. If the registers labeled F and G are removed, describe the resulting circuit's behavior.
    Removing F and G combines the last two pipeline stages into a single pipeline stage. The latency improves to 24ns and the throughput stays 1/8ns.
  1. Assume you were to redesign the pipelining of the given circuit to achieve the maximum possible throughput with minimum latency. What is the minimum number of pipeline registers required (including register H)?
    We can do it with four registers if we allow ourselves to use only a single register on values that go to multiple inputs:
  1. If the pipeline registers in the given circuit were all replaced with non-ideal pipeline registers having propagation delays of 2 ns, set-up times of 1 ns, and hold times of 0 ns, what would be the maximum throughput achievable with the supplied six combinational modules?
    The clock period would be set by the delay of the pipeline stage containing the "8" module: tCLK = tPD,REG + 8 + tS,REG = 11ns. So the throughput would be 1/11ns.
  1. If the pipeline registers in the given circuit were all replaced with non-ideal pipeline registers having propagation delays of 2 ns, set-up times of 1 ns, and hold times of 0 ns, how long before the system clock must the input x be set-up to assure that the pipeline registers A and B do not go into a metastable state?
    tS,X = 7 + tS,REG = 8ns.
  1. Suppose that a second output, g(x), is desired from the given circuit. It provides the partial result present at the output of the pipeline register labeled C. If we wish the outputs f(x) and g(x) to correspond to the same input after each clock, how many new pipeline registers should be added to the circuit shown?
    We need to add 2 new registers:


Problem 6. You have the task of pipelining a combinational circuit consisting entirely of 2-input NAND gates with 1ns propagation delay by adding registers having tS=1ns, tH=1 ns, tPD=2 ns and tCD=1 ns. The propagation delay of the original combinational circuit is 20 ns.
  1. Assuming you add the minimum number of registers to pipeline the circuit for maximal throughput, what is the latency of the resulting pipelined circuit?
    If the combinational circuit has a tPD of 20ns when built from 1ns components, there must be an input-output path involving 20 components. To get maximal throughput, we'd place each component in its own pipeline stage for a total of 20 stages. Each stage requires tPD,REG + tPD,NAND + tS,REG = 2 + 1 + 1 = 4ns to do its job. So the latency of the circuit pipelined for maximal throughput is 80ns.


Problem 7. Circuits Maximus, Inc. makes circuits which compute the maximum of two unsigned binary numbers. They are constructed using combinational 1-bit Maximizes modules which are cascaded to deal with longer words, as shown below:

This diagram show a 4-bit Maximizer chain which computes at the M outputs the larger of the A or B input operands. Each Maximizer module takes the Ith bit of each of two binary operands, A and B, as well as comparison outputs from the next higher-order Maximizer module in a chain, as shown below:

A "1" on either of the inputs AGin and BGin from the next higher-order module signals that A or B, respectively, is greater; both inputs are zero if the higher-order bits are identical. The M module computes the output values AGout and BGout from AGin, BGin, Ai and Bi and sends these outputs values to the next lower-order M module. It also passes either Ai or Bi as the Mi output, denoting the Ith bit of the maximum of A and B.
An implementation has been developed for the M module that has 10ns propagation delay and a 2ns contamination delay.
  1. Assuming that use of ideal registers, mark the previous diagram to show a 4-bit Maximizer pipelined for maximum throughput.
  1. To compute the maximum value of N inputs (N > 2), the following structure is proposed:
    In this circuit, the maximum of four 4-bit unsigned numbers is computed and appears at the output M3..M0. What is the latency and throughput of this combinational circuit, assuming that each M module has a propagation delay of 10ns?
    The longest path from inputs to outputs passes through 6 M modules, so the latency is 60 and the throughput is 1/60.
  1. Show how this circuit can be pipelined from maximum throughput using a minimum number of pipeline stages. Remember to include a register at each output.
    The solution below uses a different technique for pipelining a circuit. Start by labeling each module with its maximum "distance" from the inputs, i.e., the largest number of components that have to be traversed on some path from the inputs to the module in question. Label all outputs with the length (in modules) of the longest path from input to output label each input with "0". The number of pipeline registers required on each wire is the difference between the label at the start of the arrow and the end of the arrow.A common mistake: forgetting to add the necessary pipeline registers on the input and output arrows.


Problem 8. The following combinational circuit takes a single input and produces a visual output by lighting the light on the center component module.

Consider the result of pipelining the above circuit for maximum throughput, using the minimum number of registers necessary. The result would be a pipeline such that input asserted during clock period I produces the proper output on the light during clock period I+K (we want minimal K which gives maximal throughput).
  1. How many registers should appear on the bold wire (marked X) in the pipelined version of the above circuit?
    Using the pipelining technique described in the previous problem, we can see from the labels that 7 registers would be required on the wire marked X:

Synthesis of combinational logic

Problem 1. A certain function F has the following truth table:

A  B  C | F
========|===
0  0  0 | 1
0  0  1 | 0
0  1  0 | 0
0  1  1 | 1
1  0  0 | 1
1  0  1 | 1
1  1  0 | 0
1  1  1 | 1
  1. Discussed in section Write a sum-of-products expression for F.
        _ _ _   _         _ _     _
    F = A*B*C + A*B*C + A*B*C + A*B*C + A*B*C
    
  1. Discussed in section Write a minimal sum-of-products expression for F. Show a combinational circuit that implements F using only INV and NAND gates.
    First construct a Karnaugh map for F:
    To cover all the 1's in the map we have to use 3 of the 4 patches:
        _ _     _    
    F = B*C + A*B + B*C
    
    One possible schematic diagram is shown below. Note that the final 3-input NAND gate has been drawn in it's Demorganized form, i.e., an OR gate with inverting inputs.
  1. Discussed in section Implement F using one 4-input MUX and inverter.
    If we use A and B as the select inputs for the MUX then the four data inputs of the MUX should be tied to one of "0" (ground), "1" (Vdd), "C" or "not C". For this function the following is the correct schematic. Note that by changing the connections on the data inputs we could implement any function of A, B and C.
  1. Discussed in section Write a minimal sum-of-products expression for NOT(F).
    We can use the Karnaugh map above to answer this question too: just write equations for patches that cover the 0's:
Problem 2.
  1. Give a minimal sum-of-products expression that is equivalent to the follow logic expressions:
        _____
    (A) A + B
    
                _       _ _     _   _     _         _
    (B) A*B*C + A*B*C + A*B*C + A*B*C + A*B*C + A*B*C
        _ _
    (A) A*B
    
    (B) B + C
    
  1. Multiple choice: A Karnaugh map can't represent more than 2 variables along a single dimension because
    1. Gray code counts can't go beyond 2 bits.
    2. Each value v along a dimension must be adjacent to all values that are Hamming distance 1 from v.
    3. Three is not a power of two.
    4. No reason. You can represent 3 variables along a dimension. You couldn't make 5-variable K-maps otherwise.
    2. The 2-bit gray code cycle 00-01-11-10-... handles two variables just fine, but can't be extended to 3 bits. There are 3-bit gray codes but they involve non-adjacent encodings that are Hamming distance 1 apart.
  1. Discussed in section What is the maximum number of product terms in a minimal sum-of-products expressions with three variables?
    The most complicated K-map that doesn't permit any sort of simplification is a checkerboard of 1's where 50% of the K-map cells can be 1. So in a 3-input K-map, there are eight cells of which four can be 1 in a checkerboard pattern. Each 1 in the checkboard corresponds to a separate product term since it can't be combined with it's neighbors and herefore the maximal number of product terms is 4.
  1. Discussed in section True or false: A boolean function of N variables with greater than 2N-1 product terms canalways be simplified to an expression using fewer product terms.
    True. If there are more than 2N-1 product terms, the K-map of the function would be more than half full and there would be 1's in at least two adjacent cells, leading to a possible simplification.
  1. Discussed in section Suppose the stock room is very low on components and has only five NAND gates on hand. Would we be able to build an implementation of any arbitary 2-input boolean function?
    Yes. 3 of the NAND gates could be arranged in a tree, computing the sum-of-products of 2 product terms. (We know from the previous problem that any function with 2 inputs can be simplified down to just 2 product terms.) The other two NAND gates could be used as inverters to produce negations of the inputs which may be needed for the product terms.


Problem 3. In the Karnaugh maps below the use of "X" in a cell indicates a "don't care" situation where the value of the function for those inputs can be chosen to minimize the size of the overall expression.


  1. Circle the prime implicants in the Karnaugh maps and write a minimal sum-of-products expression for each of the maps.


Problem 4. The following diagram shows the pulldown circuitry for a static CMOS logic gate that implements the function F(A,B,C,D):

  1. Draw a circuit diagram for the pullup circuitry that would complete the static CMOS implementation of F(A,B,C,D).
  1. Assuming the correct pullup circuitry is added to the diagram above, give a minimal sum-of-products expression for F(A,B,C,D).
         _   _   _   _ _    _ _   _ _   _ _   _ _ _   _ _   _ _   _ _
    F = (C + D)*(A + B*C) = A*C + A*D + B*C + B*C*D = A*C + A*D + B*C
    
  1. Which of the following changes will decrease the propagation time of the above circuit?
      (A) decreasing the power supply voltage
      (B) increasing the power supply voltage
      (C) increasing the operating temperature
      (D) redefining VOL to provide increased noise margins
    (B). From Lab #1 we know that gates run faster (ie, have decreased propagation times) when the power supply is increased or the operating temperature is lowered. Redefining VOL to provide increased noise margins would require giving a lower value, making the output have to make a longer transition before it reached VOL.


Problem 5. The following PLA implements H(A,B,C):

  1. Discussed in section Write a minimal sum-of-products expression for H.
    The two horizontal signals in the PLA are product terms:
                                 _                        
    top term is pulled down when A = 1 => product term is A
                                         _    _                         _
    bottom term is pulled down when A or B or C is 1 => product term is A*B*C
    
    The rightmost column is just a distributed 2-input NOR gate that combines the product terms, and the final inverter converts this into a OR of the product terms. So:
            _
    H = A + A*B*C = A + B*C
    


Problem 6. The following diagram shows a ROM implementation of a 3-input Boolean function: Give the Boolean function represented at the output, Z.

  1. For a particular set of input values, how many of the horizontal term lines carry a logical one?
    Each horizontal term decodes one of the eight possible product terms formed from A, B and C. For example:
                                 _    _    _
    top term is pulled down when A or B or C = 1 => product term is A*B*C
                                  _    _                                  _
    next term is pulled down when A or B or C is 1 => product term is A*B*C
    
    ...
    
    So, for a particular set of input values only one of these products will be true.
  1. Discussed in section Give the Boolean function represented at the output, Z.
    The rightmost column is a distributed NOR gate that combines product terms which are connected to one of the pulldowns of that NOR gate. The final inverter converts the NOR into an OR. So working from the top down the five product terms are:
                    _     _     _       _   _ 
    Z = A*B*C + A*B*C + A*B*C + A*B*C + A*B*C
    
  1. Discussed in section Is there an example involving two valid inputs that demonstrates that the above device is not lenient?
    Yes. When A = B = 1 and C changes, both the topmost horizontal signal and the next-to-topmost horizontal signal will change value. Depending on the relative timing of these changes there may be a moment when both signals are 0 and the rightmost column momentarily ceases to be pulled down, causing the output of the final inverter to momentarily go to zero. This output glitch means the device is not lenient.
  1. Suppose you undertake to reduce the preceding ROM implementation by eliminating any components (pulldowns, wires, and inverters) that are unused or redundant. Components are eliminated -- replaced by open circuits -- until no further components can be removed without changing the function computed by the circuit. How many word lines (horizontal lines that consitute outputs of the original decoder section of the ROM) are left after this reduction has been performed?
    The function computed by this ROM reduces to B+AC. Eliminating selected pulldowns from the top two horizontal (word) lines yields two word lines corresponding to each term of this result; all the others are redundant and can be eliminated. We're left with two word lines, each of which pulls down the bit line that drives the output inverter.
  1. Which of the following is the most accurate comparison of the reduced ROM circuit with a direct, full-complementary CMOS implementation of the same function?
    1. The reduced ROM implementation uses fewer transistors.
    2. The reduced ROM implementation is faster.
    3. The reduced ROM implementation has a faster output rise time.
    4. The reduced ROM implementation is essentially identical to the direct CMOS implementation.
    3. Since the ROM output is driven by an inverter (which has a single pullup PFET) it will transition faster than the output of a complex gate (which has multiple PFETs connected in series in the pullup circuitry). In terms of size and speed, the ROM is generally slower and larger than the equivalent gate.


Problem 7.
  1. Add the necessary pulldown NFETs to the F and G circuitry in the ROM circuit diagram below to implement the indicated logic functions for F and G.
    First we need to figure out the equations for each of the product terms. From top to bottom the terms are:
            _                 _               _
    term 1: A*C     term 2: B*C     term 3: A*B     term 4: A*B
    
    G is simply the OR of terms 1 and 3. With a little thought we can rewrite F as:
                    _           _    _             _   _
    F = A + C = A + A*C = A(B + B) + A*C = A*B + A*B + A*C
    
    which is the OR of terms 1, 3 and 4. We can now fill in the necessary pulldowns:


Problem 8. A certain 3-input function G(A,B,C) has the following implementation:

  1. Give a minimal sum-of-products expression for G.
        _ _         _
    G = A*C + A*C + B
    
  1. Design a ROM implementation for G.


Problem 9. Digital Widgets Co. has introduced a new logic IC consisting of two comparator cells in a 14-pin package. A comparator cell, as drawn below, has four inputs and two outputs.

The inputs are labeled An, Bn, Gn+1, and Ln+1, and the outputs are labeled Gn and Ln. The G and L signals have the meanings "A greater than B" and "A less than B," respectively. If both G and L are false, the meaning is A = B. G and L are never both true. Two k-bit numbers A and B may be compared using a circuit such as the following:

The most significant bits are supplied as Ak-1 and Bk-1, and the least significant bits are A0 and B0.
The output of a comparison is taken from the G and L outputs of the lowest-order cell (G0 and L0). Gn+1 and Ln+1 of the highest-order cell are connected to logical 0 to indicate that the numbers are assumed to be equal until some difference is found between a pair of bits Ai and Bi.
If the Gn+1 and Ln+1 inputs indicate that higher-order bits have established A > B or A < B, then cell n must propagate that result to Gn and Ln. However, if Gn+1 and Ln+1 indicate that the higher-order bits are equal, then cell n must compare its bit of A and B to determine if A > B, A < B, or A = B and must signal that result appropriately at Gn and Ln.
  1. Draw a logic diagram for an implementation of the Digital Widgets comparator cell.
    The equations for Gn and Ln are
                ____    __
    Gn = Gn+1 + Ln+1*An*Bn
                ____ __
    Ln = Ln+1 + Gn+1*An*Bn
    
    If we construct a schematic using INV, AND and OR, the resulting circuit would have a Tpd of 3 gate delays.
  1. Since there is delay associated with the propagation of the G and L signals through each cell, we could make the comparator work faster by redesigning the basic cell to compare two bits at a time, halving the number of stages through which the G and L signals will need to propagate.
    Work out expressions for Gn and Ln as functions of Gn+2, Ln+2, An+1, Bn+1, An, and Bn. Express your answers in the form
                ____       ____                ____ ____     __
    Gn = Gn+2 + Ln+2*(An+1*Bn+1 + (An+1*Bn+1 + An+1*Bn+1)*An*Bn)
                ____  ____                     ____ ____  __
    Ln = Ln+2 + Gn+2*(An+1*Bn+1 + (An+1*Bn+1 + An+1*Bn+1)*An*Bn)
    
  1. Given a reasonable implementation of the equations for Gn and Ln derived in part B, how does the delay from a change in Gn+2 and Ln+2 to the appearance of correct outputs at Gn and Ln compare with the corresponding delay for a circuit composed of a cascade of two of the cells developed in part A? Assume that all A and B inputs remain unchanged throughout.Note: The reason for our interest in the propagation delay of the G and L signals, specifically, is that in a chain of N comparators, every extra gate delay in the G--L path will penalize total performance by N gate delays. The time it takes for a change in an A or B input to be reflected in the corresponding G or L output is also important, but improvements here can at best result in decreasing total delay by some constant amount.
    If we expand out either equation for the two-bit-at-a-time cell, we end up with 4 product terms, two of which involve 5 inputs. If we have 5-input gates in our library, we can implement the two-bit cell with the same Tpd as the one-bit cell. So Gn and Ln are produced by a two-bit cell with half the delay as from a cascade of two one-bit cells.If we restrict ourselves to 4-input gates, this adds one gate delay to Tpd for the two-bit cell (Tpd = 4), still an improvement over two one-bit cells (Tpd = 6).