Different instructions utilize different hardware blocks in the basic single-cycle implementation. The next three problems in this exercise refer to the following instruction:
4.1.1 [5] <4.1> What are the values of control signals generated by the control in Figure 4.2 for this instruction?
4.1.2 [5] <4.1> Which resources (blocks) perform a useful function for this instruction?
4.1.3 [10] <4.1> Which resources (blocks) produce outputs, but their outputs are not used for this instruction? Which resources produce no outputs for this instruction?
Different execution units and blocks of digital logic have different latencies (time needed to do their work). In Figure 4.2 there are seven kinds of major blocks. Latencies of blocks along the critical (longest-latency) path for an instruction determine the minimum latency of that instruction. For the remaining three problems in this exercise, assume the following resource latencies:
Get 4.1.3 exercise solution
4.1.4 [5] <4.1> What is the critical path for an MIPS AND instruction?
Exercise 4.2
The basic single-cycle MIPS implementation in Figure 4.2 can only implement some instructions. New instructions can be added to an existing ISA, but the decision whether or not to do that depends, among other things, on the cost and complexity such an addition introduces into the processor datapath and control. The first three problems in this exercise refer to this new instruction:
4.2.1 [10] <4.1> Which existing blocks (if any) can be used for this instruction?
When processor designers consider a possible improvement to the processor datapath, the decision usually depends on the cost/performance trade-off. In the following three problems, assume that we are starting with a datapath from Figure 4.2, where I-Mem, Add, Mux, ALU, Regs, D-Mem, and Control blocks have latencies of 400ps, 100ps, 30ps, 120ps, 200ps, 350ps, and 100ps, respectively, and costs of 1000, 30, 10, 100, 200, 2000, and 500, respectively. The remaining three problems in this exercise refer to the following processor improvement:
Exercise 4.3
Problems in this exercise refer to the following logic block:
4.3.1 [5] <4.1, 4.2> Does this block contain logic only, flip-flops only, or both?
4.3.3 [10] <4.1, 4.2> Repeat Problem 4.3.2, but the AND and OR gates you use must all be 2-input gates.
Cost and latency of digital logic depends on the kinds of basic logic elements (gates) that are available and on the properties of these gates. The remaining three problems in this exercise refer to these gates, latencies, and costs:
Compare the cost and latency of these two optimized designs.
Exercise 4.4
When implementing a logic expression in digital logic, one must use the available logic gates to implement an operator for which a gate is not available. Problems in this exercise refer to the following logic expressions:
4.4.1 [5] <4.2> Implement the logic for the Control signal 1. Your circuit should directly implement the given expression (do not reorganize the expression to “optimize” it), using NOT
gates and 2-input AND, OR, and XOR gates.
For the remaining three problems in this exercise, we assume that the following basic digital logic elements are available, and that their latency and cost are as follows:
4.4.5 [10] <4.2> What is the cost of your circuit from 4.4.3?
4.4.6 [10] <4.2> What fraction of the cost was saved in your circuit from 4.4.3 by implementing these two control signals together instead of separately?
Exercise 4.5
The goal of this exercise is to help you familiarize yourself with the design and operation of sequential logical circuits. Problems in this exercise refer to this ALU operation:
ALU Operation
a. Add (X+Y)
b. Subtract-one (X–1) in 2’s complement
4.5.1 [20] <4.2> Design a circuit with 1-bit data inputs and a 1-bit data output that accomplishes this operation serially, starting with the least-significant bit. In a serial implementation, the circuit is processing input operands bit by bit, generating output bits one by one. For example, a serial AND circuit is simply an AND gate; in cycle N we give it the Nth bit from each of the operands and we get the Nth bit of the result. In addition to data inputs, the circuit has a Clk (clock) input and a “Start” input that is set to 1 only in the very first cycle of the operation. In your design, you can use D Flip-Flops and NOT, AND, OR, and XOR gates.
Get 4.5.1 exercise solution
The time given for a D-element is its setup time. The data input of a flip-flop must have the correct value one setup-time before the clock edge (end of clock cycle) that stores that value into the flip-flop.
4.5.3 [10] <4.2> What is the cycle time for the circuit you designed in 4.5.1? How long does it take to perform the 32-bit operation?
4.5.5 [10] <4.2> Compute the cost for the circuit you designed in 4.5.1, and then for the circuit you designed in 4.5.2.
Exercise 4.6 Problems in this exercise assume that logic blocks needed to implement a processor’s datapath have the following latencies:
4.6.1 [10] <4.3> If the only thing we need to do in a processor is fetch consecutive instructions (Figure 4.6),
what would the cycle time be?
4.6.2 [10] <4.3> Consider a datapath similar to the one in Figure 4.11, but for a processor that only has one type of instruction: unconditional PC-relative branch. What would the cycle time be for this datapath?
The remaining three problems in this exercise refer to the following logic block (resource) in the datapath:
given latency of this resource affect the cycle time of the processor.
Assume that the latencies of other resources do not change.
Exercise 4.7 In this exercise we examine how latencies of individual components of the datapath affect the clock cycle time of the entire datapath, and how these components are utilized by instructions. For problems in this exercise, assume the following latencies for logic blocks in the datapath:
4.7.1 [10] <4.3> What is the clock cycle time if the only types of instructions we need to support are ALU instructions (ADD, AND, etc.)?
For the remaining problems in this exercise, assume that there are no pipeline stalls and that the breakdown of executed instructions is as follows:
4.7.4 [10] <4.3> In what fraction of all cycles is the data memory used?
When silicon chips are fabricated, defects in materials (e.g., silicon) and manufacturing errors can result in defective circuits. A very common defect is for one wire to affect the signal in another. This is called a cross-talk fault. A special class of cross-talk faults is when a signal is connected to a wire that has a constant logical value (e.g., a power supply wire). In this case we have a stuck-at-0 or a stuck-at-1 fault, and the affected signal always has a logical value of 0 or 1, respectively.
The following problems refer to the following signal from Figure 4.24:
Signal
a. Registers, input Write Register, bit 0
b. Add unit in upper right corner, ALU result, bit 0
4.8.1 [10] <4.3, 4.4> Let us assume that processor testing is done by filling the PC, registers, and data and instruction memories with some values (you can choose which values), letting a single instruction execute, then reading the PC, memories, and registers. These values are then examined to determine if a particular fault is present. Can you design a test (values for PC, memories, and registers) that would determine if there is a stuck-at-0 fault on this signal?
4.8.2 [10] <4.3, 4.4> Repeat 4.8.1 for a stuck-at-1 fault. Can you use a single test for both stuck-at-0 and stuck-at-1? If yes, explain how; if no, explain why not.
4.8.3 [60] <4.3, 4.4> If we know that the processor has a stuck-at-1 fault on this signal, is the processor still usable? To be usable, we must be able to convert any program that executes on a normal MIPS processor into a program that works on this processor. You can assume that there is enough free instruction memory and data memory to let you make the program longer and store additional data. Hint: the processor is usable if every instruction “broken” by this fault can be replaced with a sequence of “working” instructions that achieve the same effect.
The following problems refer to the following fault:
Fault
a. Stuck-at-0
b. Becomes 0 if RegDst control signal is 0, no fault otherwise
4.8.4 [10] <4.3, 4.4> Repeat 4.8.1, but now the fault to test for is whether the “MemRead” control signal has this fault.
4.8.5 [10] <4.3, 4.4> Repeat 4.8.1, but now the fault to test for is whether the “Jump” control signal has this fault.
4.8.6 [40] <4.3, 4.4> Using a single test described in 4.8.1, we can test for faults in several different signals, but typically not all of them. Describe a series of tests to look for this fault in all Mux outputs (every output bit from each of the five Muxes). Try to do this with as few single-instruction tests as possible.
Exercise 4.9
In this exercise we examine the operation of the single-cycle datapath for a particular instruction. Problems in this exercise refer to the following MIPS instruction:
Instruction
a. SW R4,–100(R16)
b. SLT R1,R2,R3
4.9.1 [10] <4.4> What is the value of the instruction word?
4.9.2 [10] <4.4> What is the register number supplied to the register file’s “Read register 1” input? Is this register actually read? How about “Read register 2”?
4.9.3 [10] <4.4> What is the register number supplied to the register file’s “Write register” input? Is this register actually written?
Different instructions require different control signals to be asserted in the datapath. The remaining problems in this exercise refer to the following two control signals from Figure 4.24:
4.9.4 [20] <4.4> What is the value of these two signals for this instruction?
4.9.5 [20] <4.4> For the datapath from Figure 4.24, draw the logic diagram for the part of the control unit that implements just the first signal. Assume that we only need to support LW, SW, BEQ, ADD, and J (jump) instructions.
4.9.6 [20] <4.4> Repeat 4.9.5, but now implement both of these signals.
Exercise
4.10 In this exercise we examine how the clock cycle time of the processor affects the design of the control unit, and vice versa. Problems in this exercise assume that the logic blocks used to implement the datapath have the following latencies:
4.10.1 [10] <4.2, 4.4> To avoid lengthening the critical path of the datapath shown in Figure 4.24, how much time can the control unit take to generate the MemWrite signal?
The remaining problems in this exercise assume that the time needed by the control unit to generate individual control signals is as follows
slowing it down?
Exercise 4.11
In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor fetches the following instruction word:
Instruction word
a. 10101100011000100000000000010100
b. 00000000100000100000100000101010
4.11.1 [5] <4.4> What are the outputs of the sign-extend and the jump “Shift left 2” unit (near the top of Figure 4.24) for this instruction word?
4.11.2 [10] <4.4> What are the values of the ALU control unit’s inputs for this instruction?
4.11.3 [10] <4.4> What is the new PC address after this instruction is executed? Highlight the path through which this value is determined.
The remaining problems in this exercise assume that data memory is all zeros and that the processor’s registers have the following values at the beginning of the cycle in which the above instruction word is fetched:
Exercise 4.12
In this exercise, we examine how pipelining affects the clock cycle time of the processor. Problems in this exercise assume that individual stages of the datapath have the following latencies:
4.12.1 [5] <4.5> What is the clock cycle time in a pipelined and non-pipelined processor?
Get 4.12.1 exercise solution
The remaining problems in this exercise assume that instructions executed by the processor are broken down as follows:
Get 4.12.3 exercise solution
Exercise 4.13
In this exercise, we examine how data dependences affect execution in the basic 5-stage pipeline described in Section 4.5. Problems in this exercise refer to the following sequence of instructions:
Instruction Sequence
a.
SW R16,–100(R6)
LW R4,8(R16)
ADD R5,R4,R4
b.
OR R1,R2,R3
OR R2,R1,R4
OR R1,R1,R2
4.13.1 [10] <4.5> Indicate dependences and their type.
4.13.6 [10] <4.5> What is the total execution time of this instruction sequence with only ALU-ALU forwarding? What is the speedup over a no-forwarding pipeline?
Exercise 4.14
In this exercise, we examine how resource hazards, control hazards, and ISA design can affect pipelined execution. Problems in this exercise refer to the following fragment of MIPS code:
4.14.1 [10] <4.5> For this problem, assume that all branches are perfectly predicted (this eliminates all control
hazards) and that no delay slots are used. If we only have one memory (for both instructions and data), there is a structural hazard every time we need to fetch an instruction in the same cycle in which another instruction accesses data. To guarantee forward progress, this hazard must always be resolved in favor of the instruction that accesses data. What is the total execution time of this instruction sequence in the 5-stage pipeline that only has one memory? We have seen that data hazards can be eliminated by adding NOPs to the code. Can you do the same with this structural hazard? Why?
The remaining problems in this exercise assume that individual pipeline stages have the following latencies:
Exercise 4.15
In this exercise, we examine how the ISA affects pipeline design. Problems in this exercise refer to the following new instruction:
4.15.1 [20] <4.5> What must be changed in the pipelined datapath to add this instruction to the MIPS ISA?
Are stalls due to existing hazards made worse?
4.15.4 [10] <4.5, 4.13> Give an example of where this instruction might be useful and a sequence of existing MIPS instructions that are replaced by this instruction.
4.15.5 [10] <4.5, 4.11, 4.13> If this instruction already exists in a legacy ISA, explain how it would be executed in a modern processor like AMD Barcelona.
The last problem in this exercise assumes that each use of the new instruction replaces the given number of original instructions, that the replacement can be made once in the given number of original instructions, and that each time the new instruction is executed the given number of extra stall cycles is added to the program’s execution time:
Exercise 4.16
The first three problems in this exercise refer to the following MIPS instruction:
Instruction
a. SW R16,–100(R6)
b. OR R2,R1,R0
4.16.1 [5] <4.6> As this instruction executes, what is kept in each register located between two pipeline stages?
4.16.2 [5] <4.6> Which registers need to be read, and which registers are actually read?
4.16.3 [5] <4.6> What does this instruction do in the EX and MEM stages?
The remaining three problems in this exercise refer to the following loop. Assume that perfect branch prediction is used (no stalls due to control hazards), that there are no delay slots, and that the pipeline has full forwarding support. Also assume that many iterations of this loop are executed before the loop exits.
Exercise 4.17 Problems in this exercise assume that instructions executed by a pipelined processor are broken down as follows:
4.17.1 [5] <4.6> Assuming there are no stalls and that 60% of all conditional branches are taken, in what percentage of clock cycles does the branch adder in the EX stage generate a value that is actually used?
Get 4.17.2 exercise solution
Each pipeline stage in Figure 4.33 has some latency. Additionally, pipelining introduces registers between stages (Figure 4.35), and each of these adds an additional latency. The remaining problems in this exercise assume the following latencies for logic within each pipeline stage and for each register between two stages:
Exercise 4.18
The first three problems in this exercise refer to the execution of the following instruction in the pipelined datapath from Figure 4.51, and assume the following clock cycle time, ALU latency, and Mux latency:
4.18.1 [10] <4.6> For each stage of the pipeline, what are the values of the control signals asserted by this instruction in that pipeline stage?
Exercise 4.19
This exercise is intended to help you understand the cost/complexity/performance trade-offs of forwarding in a pipelined processor. Problems in this exercise refer to pipelined datapaths from Figure 4.45. These problems assume that, of all the instructions executed in a processor, the following fraction of these instructions have a particular type of RAW data dependence. The type of RAW data dependence is identified by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both). We assume that the register write is done in the first half of the clock cycle and that register reads are done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependences are not counted because they cannot result in data hazards. Also, assume that the CPI of the processor is 1 if there are no data hazards.
4.19.1 [10] <4.7> If we use no forwarding, what fraction of cycles are we stalling due to data hazards?
are we staling due to data hazards?
The remaining three problems in this exercise refer to the following latencies for individual pipeline stages. For the EX stage, latencies are given separately for a processor without forwarding and for a processor with different kinds of forwarding.
Exercise 4.20
Problems in this exercise refer to the following instruction sequences:
4.20.1 [5] <4.7> Find all data dependences in this instruction sequence.
The remaining three problems in this exercise assume that, before any of the above is executed, all values in data memory are zeroes and that registers R0 through R3 have the following initial values:
4.20.4 [5] <4.7> Which value is the first one to be forwarded and what is the value it overrides?
4.20.6 [10] <4.7> For the design described in 4.20.5, add NOPs to this instruction sequence to ensure correct execution in spite of missing support for forwarding.
Exercise 4.21
This exercise is intended to help you understand the relationship between forwarding, hazard detection, and ISA design. Problems in this exercise refer to the following sequences of instructions, and assume that it is executed on a 5-stage pipelined datapath:
4.21.1 [5] <4.7> If there is no forwarding or hazard detection, insert NOPs to ensure correct execution.
which signals are asserted in each cycle by hazard detection and forwarding units in Figure 4.60.
Exercise 4.22
This exercise is intended to help you understand the relationship between delay slots, control hazards, and branch execution in a pipelined processor. In this exercise, we assume that the following MIPS code is executed on a pipelined processor with a 5-stage pipeline, full forwarding, and a predict-taken branch predictor:
4.22.1 [10] <4.8> Draw the pipeline execution diagram for this code, assuming there are no delay slots and that branches execute in the EX stage.
4.22.2 [10] <4.8> Repeat 4.22.1, but assume that delay slots are used. In the given code, the instruction that follows the branch is now the delay slot instruction for that branch.
4.22.3 [20] <4.8> One way to move the branch resolution one stage earlier is to not need an ALU operation in conditional branches. The branch instructions would be “BEZ Rd,Label” and “BNEZ Rd,Label”, and it would branch if the register has and does not have a zero value, respectively. Change this code to use these branch instructions instead of BEQ. You can assume that register R8 is available for you to use as a temporary register, and that an SEQ (set if equal) R-type instruction can be used.
Section 4.8 describes how the severity of control hazards can be reduced by moving branch execution into the ID stage. This approach involves a dedicated comparator in the ID stage, as shown in Figure 4.62. However, this approach potentially adds to the latency of the ID stage, and requires additional forwarding logic and hazard detection.
4.22.4 [10] <4.8> Using the first branch instruction in the given code as an example, describe the hazard detection logic needed to support branch execution in the ID stage as in Figure 4.62. Which type of hazard is this new logic supposed to detect?
4.22.5 [10] <4.8> For the given code, what is the speedup achieved by moving branch execution into the ID stage? Explain your answer. In your speedup calculation, assume that the additional comparison in the ID stage does not affect clock cycle time.
4.22.6 [10] <4.8> Using the first branch instruction in the given code as an example, describe the forwarding support that must be added to support branch execution in the ID stage. Compare the complexity of this new forwarding unit to the complexity of the existing forwarding unit in Figure 4.62.
Exercise 4.23 The importance of having a good branch predictor depends on how often conditional branches are executed. Together with branch predictor accuracy, this will determine how much time is spent stalling due to mispredicted branches. In this exercise, assume that the breakdown of dynamic instructions into various instruction categories is as follows:
4.23.1 [10] <4.8> Stall cycles due to mispredicted branches increase the CPI. What is the extra CPI due to mispredicted branches with the always-taken predictor? Assume that branch outcomes are determined in the EX stage, that there are no data hazards, and that no delay slots are used.
4.23.2 [10] <4.8> Repeat 4.23.1 for the “always-not-taken” predictor.
4.23.3 [10] <4.8> Repeat 4.23.1 for the 2-bit predictor.
4.23.4 [10] <4.8> With the 2-bit predictor, what speedup would be achieved if we could convert half of the branch instructions in a way that replaces a branch instruction with an ALU instruction? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced.
4.23.5 [10] <4.8> With the 2-bit predictor, what speedup would be achieved if we could convert half of the branch instructions in a way that replaced each branch instruction with two ALU instructions? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced.
4.23.6 [10] <4.8> Some branch instructions are much more predictable than others. If we know that 80% of all executed branch instructions are easy-to-predict loop-back branches that are always predicted correctly, what is the accuracy of the 2-bit predictor on the remaining 20% of the branch instructions?
Exercise 4.24
This exercise examines the accuracy of various branch predictors for the following repeating pattern (e.g., in a loop) of branch outcomes:
4.24.1 [5] <4.8> What is the accuracy of always-taken and always-not-taken predictors for this sequence of branch outcomes?
4.24.2 [5] <4.8> What is the accuracy of the two-bit predictor for the first 4 branches in this pattern, assuming that the predictor starts off in the bottom left state from Figure 4.63 (predict not taken)?
4.24.3 [10] <4.8> What is the accuracy of the two-bit predictor if this pattern is repeated forever?
4.24.4 [30] <4.8> Design a predictor that would achieve a perfect accuracy if this pattern is repeated forever. You predictor should be a sequential circuit with one output that provides a prediction (1 for taken, 0 for not taken) and no inputs other than the clock and the control signal that indicates that the instruction is a conditional branch.
4.24.5 [10] <4.8> What is the accuracy of your predictor from 4.24.4 if it is given a repeating pattern that is the exact opposite of this one?
Get 4.24.5 exercise solution
4.24.6 [20] <4.8> Repeat 4.24.4, but now your predictor should be able to eventually (after a warm-up period during which it can make wrong predictions) start perfectly predicting both this pattern and its opposite. Your predictor should have an input that tells it what the real outcome was. Hint: this input lets your predictor determine which of the two repeating patterns it is given.
Exercise 4.25
This exercise explores how exception handling affects pipeline design. The first three problems in this exercise refer to the following two instructions:
4.25.1 [5] <4.9> Which exceptions can each of these instructions trigger? For each of these exceptions, specify the pipeline stage in which it is detected.
4.25.2 [10] <4.9> If there is a separate handler address for each exception, show how the pipeline organization must be changed to be able to handle this exception. You can assume that the addresses of these handlers are known when the processor is designed.
4.25.3 [10] <4.9> If the second instruction from this table is fetched right after the instruction from the first table, describe what happens in the pipeline when the first instruction causes the first exception you listed in 4.25.1. Show the pipeline execution diagram from the time the first instruction is fetched until the time the first instruction of the exception handler is completed.
The remaining three problems in this exercise assume that exception handlers are located at the following addresses:
4.25.4 [5] <4.9> What is the address of the exception handler in 4.25.3? What happens if there is an invalid instruction at that address in instruction memory?
4.25.5 [20] <4.9> In vectored exception handling, the table of exception handler addresses is in data memory at a known (fixed) address. Change the pipeline to implement this exception handling mechanism. Repeat 4.25.3 using this modified pipeline and vectored exception handling.
4.25.6 [15] <4.9> We want to emulate vectored exception handling (described in 4.25.5) on a machine that has only one fixed handler address. Write the code that should be at that fixed address. Hint: this code should identify the exception, get the right address from the exception vector table, and transfer execution to that handler.
Exercise 4.26 This exercise explores how exception handling affects control unit design and processor clock cycle time. The first three problems in this exercise refer to the following MIPS instruction that triggers an exception:
4.26.1 [10] <4.9> For each stage of the pipeline, determine the values of exception-related control signals from Figure 4.66 as this instruction passes through that pipeline stage.
4.26.2 [5] <4.9> Some of the control signals generated in the ID stage are stored into the ID/EX pipeline register, and some go directly into the EX stage. Explain why, using this instruction as an example.
4.26.3 [10] <4.9> We can make the EX stage faster if we check for exceptions in the stage after the one in which the exceptional condition occurs. Using this instruction as an example, describe the main disadvantage of this approach.
The remaining three problems in this exercise assume that pipeline stages have the following latencies:
4.26.5 [20] <4.9> Can we generate exception control signals in EX instead of in ID? Explain how this will work or why it will not work, using the “BNE R4,R5,Label” instruction and these pipeline stage latencies as an example.
4.26.6 [10] <4.9> Assuming that each Mux has a latency of 40ps, determine how much time does the control unit have to generate the flush signals? Which signal is the most critical?
Exercise 4.27
This exercise examines how exception handling interacts with branch and load/ store instructions. Problems in this exercise refer to the following branch instruction and the corresponding delay slot instruction:
4.27.1 [20] <4.9> Assume that this branch is correctly predicted as taken, but then the instruction at “Label” is an undefined instruction. Describe what is done in each pipeline stage for each cycle, starting with the cycle in which the branch is decoded up to the cycle in which the first instruction of the exception handler is fetched.
4.27.2 [10] <4.9> Repeat 4.27.1, but this time assume that the instruction in the delay slot also causes a hardware error exception when it is in MEM stage.
4.27.3 [10] <4.9> What is the value in the EPC if the branch is taken but the delay slot causes an exception? What happens after the execution of the exception handler is completed?
The remaining three problems in this exercise also refer to the following store instruction:
Get 4.27.4 exercise solution
4.27.5 [10] <4.9> If LD/ST address computation can overflow, can we delay overflow exception detection into the MEM stage? Use the given store instruction to explain what happens.
4.27.6 [10] <4.9> For debugging, it is useful to be able to detect when a particular value is written to a particular memory address. We want to add two new registers, WADDR and WVAL. The processor should trigger an exception when the value equal to WVAL is about to be written to address WADDR. How would you change the pipeline to implement this? How would this SW instruction be handled by your modified datapath?
Exercise 4.28
In this exercise we compare the performance of 1-issue and 2-issue processors, taking into account program transformations that can be made to optimize for 2-issue execution. Problems in this exercise refer to the following loop (written in C):
When writing MIPS code, assume that variables are kept in registers as follows, and that all registers except those indicated as Free are used to keep various variables, so they cannot be used for anything else.
4.28.1 [10] <4.10> Translate this C code into MIPS instructions. Your translation should be direct, without rearranging instructions to achieve better performance.
4.28.2 [10] <4.10> If the loop exits after executing only two iterations, draw a pipeline diagram for your MIPS code from 4.28.1 executed on a 2-issue processor shown in Figure 4.69. Assume the processor has perfect branch prediction and can fetch any two instructions (not just consecutive instructions) in the same cycle.
4.28.3 [10] <4.10> Rearrange your code from 4.28.1 to achieve better performance on a 2-issue statically scheduled processor from Figure 4.69.
4.28.4 [10] <4.10> Repeat 4.28.2, but this time use your MIPS code from 4.28.3.
4.28.5 [10] <4.10> What is the speedup of going from a 1-issue processor to a 2-issue processor from Figure 4.69? Use your code from 4.28.1 for both 1-issue and 2-issue, and assume that 1,000,000 iterations of the loop are executed. As in 4.28.2, assume that the processor has perfect branch predictions, and that a 2-issue processor can fetch any two instructions in the same cycle.
4.28.6 [10] <4.10> Repeat 4.28.5, but this time assume that in the 2-issue processor one of the instructions to be executed in a cycle can be of any kind, and the other must be a non-memory instruction.
Exercise 4.29 In this exercise, we consider the execution of a loop in a statically scheduled superscalar processor. To simplify the exercise, assume that any combination of instruction types can execute in the same cycle, e.g., in a 3-issue superscalar, the three instructions can be 3 ALU operations, 3 branches, 3 load/store instructions, or any combination of these instructions. Note that this only removes a resource constraint, but data and control dependences must still be handled correctly. Problems in this exercise refer to the following loop:
4.29.1 [10] <4.10> If many (e.g., 1,000,000) iterations of this loop are executed, determine the fraction of all register reads that are useful in a 2-issue static superscalar processor.
4.29.2 [10] <4.10> If many (e.g., 1,000,000) iterations of this loop are executed, determine the fraction of all register reads that are useful in a 3-issue static superscalar processor. Compare this to your result for a 2-issue processor from 4.29.1.
4.29.3 [10] <4.10> If many (e.g., 1,000,000) iterations of this loop are executed, determine the fraction of cycles in which two or three register write ports are used in a 3-issue static superscalar processor.
4.29.4 [20] <4.10> Unroll this loop once and schedule it for a 2-issue static superscalar processor. Assume that the loop always executes an even number of iterations. You can use registers R10 through R20 when changing the code to eliminate dependences.
4.29.5 [20] <4.10> What is the speedup of using your code from 4.29.4 instead of the original code with a 2-issue static superscalar processor? Assume that the loop has many (e.g., 1,000,000) iterations.
4.29.6 [10] <4.10> What is the speedup of using your code from 4.29.4 instead of the original code with a pipelined (1-issue) processor? Assume that the loop has many (e.g., 1,000,000) iterations.
Exercise 4.30
In this exercise, we make several assumptions. First, we assume that an N-issue superscalar processor can execute any N instructions in the same cycle, regardless of their types. Second, we assume that every instruction is independently chosen, without regard for the instruction that precedes or follows it. Third, we assume that there are no stalls due to data dependences, that no delay slots are used, and that branches execute in the EX stage of the pipeline. Finally, we assume that instructions executed in the program are distributed as follows:
4.30.1 [5] <4.10> What is the CPI achieved by a 2-issue static superscalar processor on this program?
4.30.2 [10] <4.10> In a 2-issue static superscalar whose predictor can only handle one branch per cycle, what speedup is achieved by adding the ability to predict two branches per cycle? Assume a stall-on-branch policy for branches that the predictor cannot handle.
4.30.3 [10] <4.10> In a 2-issue static superscalar processor that only has one register write port, what speedup is achieved by adding a second register write port?
4.30.4 [5] <4.10> For a 2-issue static superscalar processor with a classic 5-stage pipeline, what speedup is achieved by making the branch prediction perfect?
4.30.5 [10] <4.10> Repeat 4.30.4, but for a 4-issue processor. What conclusion can you draw about the importance of good branch prediction when the issue width of the processor is increased?
4.30.6 <4.10> Repeat 4.30.5, but now assume that the 4-issue processor has 50 pipeline stages. Assume that each of the original 5 stages is broken into 10 new stages, and that branches are executed in the first of ten new EX stages. What conclusion can you draw about the importance of good branch prediction when the pipeline depth of the processor is increased?
Exercise 4.31
Problems in this exercise refer to the following loop, which is given as x86 code and also as an MIPS translation of that code. You can assume that this loop executes many iterations before it exits. When determining performance, this means that you only need to determine what the performance would be in the “steady state,” not for the first few and the last few iterations of the loop. Also, you can assume full forwarding support and perfect branch prediction without delay slots, so the only hazards you have to worry about are resource hazards and data hazards. Note that most x86 instructions in this problem have two operands each. The last (usually second) operand of the instruction indicates both the first source data value and the destination. If the operation needs a second source data value, it is indicated by the other operand of the instruction. For example, “sub (edx),eax” reads the memory location pointed by register edx, subtracts that value from register eax, and puts the result back in register eax.
4.31.1 [20] <4.11> What CPI would be achieved if the MIPS version of this loop is executed on a 1-issue processor with static scheduling and a 5-stage pipeline?
4.31.2 [20] <4.11> What CPI would be achieved if the X86 version of this loop is executed on a 1-issue processor with static scheduling and a 7-stage pipeline? The stages of the pipeline are IF, ID, ARD, MRD, EXE, and WB. Stages IF and ID are similar to those in the 5-stage MIPS pipeline. ARD computes the address of the memory location to be read, MRD performs the memory read, EXE executes the operation, and WB writes the result to register or memory. The data memory has a read port (for instructions in the MRD stage) and a separate write port (for instructions in the WB stage).
4.31.3 [20] <4.11> What CPI would be achieved if the X86 version of this loop is executed on a processor that internally translates these instructions into MIPS-like micro-operations, then executes these micro-operations on a 1-issue 5-stage pipeline with static scheduling. Note that the instruction count used in CPI computation for this processor is the X86 instruction count.
4.31.4 [20] <4.11> What CPI would be achieved if the MIPS version of this loop is executed on a 1-issue processor with dynamic scheduling? Assume that our processor is not doing register renaming, so you can only reorder instructions that have no data dependences.
4.31.5 [30] <4.10, 4.11> Assuming that there are many free registers available, rename the MIPS version of this loop to eliminate as many data dependences as possible between instructions in the same iteration of the loop. Now repeat 4.31.4, using your new renamed code.
4.31.6 [20] <4.10, 4.11> Repeat 4.31.4, but this time assume that the processor assigns a new name to the result of each instruction as that instruction is decoded, and then renames registers used by subsequent instructions to use correct register values.
Exercise 4.32
Problems in this exercise assume that branches represent the following fraction of all executed instructions, and the following branch predictor accuracy. Assume that the processor is never stalled by data and resource dependences, i.e., the processor always fetches and executes the maximum number of instructions per cycle if there are no control hazards. For control dependences, the processor uses branch prediction and continues fetching from the predicted path. If the branch has been mispredicted, when the branch outcome is resolved the instructions fetched after the mispredicted branch are discarded, and in the next cycle the processor starts fetching from the correct path.
4.32.1 [5] <4.11> How many instructions are expected to be executed between the time one branch misprediction is detected and the time the next branch misprediction is detected?
The remaining problems in this exercise assume the following pipeline depth and that the branch outcome is determined in the following pipeline stage (counting from stage 1):
4.32.3 [5] <4.11> How many instructions are fetched from the wrong path for each branch misprediction in a 4-issue processor?
4.32.4 [10] <4.11> What is the speedup achieved by changing the processor from 4-issue to 8-issue? Assume that the 8-issue and the 4-issue processor differ only in the number of instructions per cycle, and are otherwise identical (pipeline depth, branch resolution stage, etc.).
4.32.5 [10] <4.11> What is the speedup of executing branches 1 stage earlier in a 4-issue processor?
4.32.6 [10] <4.11> What is the speedup of executing branches 1 stage earlier in an 8-issue processor? Discuss the difference between this result and the result from 4.32.5.
Exercise 4.33 This exercise explores how branch prediction affects performance of a deeply pipelined multiple-issue processor. Problems in this exercise refer to a processor with the following number of pipeline stages and instructions issued per cycle:
4.33.1 [10] <4.11> How many register read ports should the processor have to avoid any resource hazards due to register reads?
4.33.2 [10] <4.11> If there are no branch mispredictions and no data dependences, what is the expected performance improvement over a 1-issue processor with the classical 5-stage pipeline? Assume that the clock cycle time decreases in proportion to the number of pipeline stages
4.33.3 [10] <4.11> Repeat 4.33.2, but this time every executed instruction has a RAW data dependence to the instruction that executes right after it. You can assume that no stall cycles are needed, i.e., forwarding allows consecutive instructions to execute in back-to-back cycles.
For the remaining three problems in this exercise, unless the problem specifies otherwise, assume the following statistics about what percentage of instructions are branches, predictor accuracy, and performance loss due to branch mispredictions:
4.33.5 [20] <4.11> If we want to limit stalls due to mispredicted branches to no more than the given percentage of the ideal (no stalls) execution time, what should
be our branch prediction accuracy? Ignore the given predictor accuracy number.
4.33.6 [10] <4.11> What should the branch prediction accuracy be if we are willing to have a speedup of 0.5 (one half) relative to the same processor with an ideal branch predictor?
Exercise 4.34
This exercise is designed to help you understand the discussion of the “Pipelining is easy” fallacy from Section 4.13. The first four problems in this exercise refer to the following MIPS instruction:
4.34.1 [10] <4.13> Describe a pipelined datapath needed to support only this instruction. Your datapath should be designed with the assumption that the only instructions that will ever be executed are instances of this instruction.
4.34.2 [10] <4.13> Describe the requirements of forwarding and hazard detection units for your datapath from 4.34.1.
4.34.5 [10] <4.13> Repeat 4.34.2 for your extended datapath from 4.34.4.
4.34.6 [10] <4.13> Repeat 4.34.3 for your extended datapath from 4.34.4.
Exercise 4.35
This exercise is intended to help you better understand the relationship between ISA design and pipelining. Problems in this exercise assume that we have a multiple-issue pipelined processor with the following number of pipeline stages, instructions issued per cycle, stage in which branch outcomes are resolved, and branch predictor accuracy:
4.35.1 [5] <4.8, 4.13> Control hazards can be eliminated by adding branch delay slots. How many delay slots must follow each branch if we want to eliminate all control hazards in this processor?
4.35.2 [10] <4.8, 4.13> What is the speedup that would be achieved by using four branch delay slots to reduce control hazards in this processor? Assume that there are no data dependences between instructions and that all four delay slots can be filled with useful instructions without increasing the number of executed instructions. To make your computations easier, you can also assume that the mispredicted branch instruction is always the last instruction to be fetched in a cycle, i.e., no instructions that are in the same pipeline stage as the branch are fetched from the wrong path.
4.35.3 [10] <4.8, 4.13> Repeat 4.35.2, but now assume that 10% of executed branches have all four delay slots filled with useful instruction, 20% have only three useful instructions in delay slots (the fourth delay slot is an NOP), 30% have only two useful instructions in delay slots, and 40% have no useful instructions in their delay slots.
The remaining four problems in this exercise refer to the following C loop:
4.35.4 [10] <4.8, 4.13> Translate this C loop into MIPS instructions, assuming that our ISA requires one delay slot for every branch. Try to fill delay slots with non-NOP instructions when possible. You can assume that variables a, b, c, i, and j are kept in registers r1, r2, r3, r4, and r5.
4.35.5 [10] <4.7, 4.13> Repeat 4.35.4 for a processor that has two delay slots for every branch.
4.35.6 [10] <4.10, 4.13> How many iterations of your loop from 4.35.4 can be “in flight” within this processor’s pipeline? We say that an iteration is “in flight” when at least one of its instructions has been fetched and has not yet been committed.
Exercise 4.36
This exercise is intended to help you better understand the last pitfall from Section 4.13—failure to consider pipelining in instruction set design. The first four problems in this exercise refer to the following new MIPS instruction:
4.36.1 [10] <4.11, 4.13> Translate this instruction into MIPS micro-operations.
4.36.2 [10] <4.11, 4.13> How would you change the 5-stage MIPS pipeline to add support for micro-op translation needed to support this new instruction?
4.36.3 [20] <4.13> If we want to add this instruction to the MIPS ISA, discuss the changes to the pipeline (which stages, which structures in which stage) that are needed to directly (without micro-ops) support this instruction.
4.36.4 [10] <4.13> How often do you expect this instruction can be used? Do you think that we would be justified if we added this instruction to the MIPS ISA?
The remaining two problems in this exercise are about adding a new ADDM instruction to the ISA. In a processor to which ADDM has been added, these problems assume the following breakdown of clock cycles according to which instruction is completed in that cycle (or which stall is preventing an instruction from completing):
4.36.6 [10] <4.13> Repeat 4.36.5, but now assume that ADDM was supported by adding a pipeline stage. When ADDM is translated, this extra stage can be removed and, as a result, half of the existing data stalls are eliminated. Note that the data stall elimination applies only to stalls that existed before ADDM translation, not to stalls added by the ADDM translation itself.
Exercise 4.37
This exercise explores some of the tradeoffs involved in pipelining, such as clock cycle time and utilization of hardware resources. The first three problems in this exercise refer to the following MIPS code. The code is written with an assumption that the processor does not use delay slots.
4.37.1 [5] <4.3, 4.14> Which parts of the basic single-cycle datapath are used by all of these instructions? Which parts are the least utilized?
4.37.2 [10] <4.6, 4.14> What is the utilization for the read and for the write port of the data memory unit?
4.37.3 [10] <4.6, 4.14> Assume that we already have a single-cycle design. How many bits in total do we need for pipeline registers to implement the pipelined design?
The remaining three problems in this exercise assume that components of the datapath have the following latencies:
4.37.5 [10] <4.3, 4.5, 4.14> Repeat 4.37.4, but now assume that we only want to support ADD instructions.
4.37.6 [20] <4.3, 4.5, 4.14> If it costs $1 to reduce the latency of a single component of the datapath by 1ps, what would it cost to reduce the clock cycle time by 20% in the single-cycle and in the pipelined design?
Exercise 4.38
This exercise explores energy efficiency and its relationship with performance. Problems in this exercise assume the following energy consumption for activity in Instruction memory, Registers, and Data memory. You can assume that the other components of the datapath spend a negligible amount of energy.
4.38.1 [10] <4.3, 4.6, 4.14> How much energy is spent to execute an ADD instruction in a single-cycle design and in the 5-stage pipelined design?
4.38.2 [10] <4.6, 4.14> What is the worst-case MIPS instruction in terms of energy consumption, and what is the energy spent to execute it?
4.38.3 [10] <4.6, 4.14> If energy reduction is paramount, how would you change the pipelined design? What is the percentage reduction in the energy spent by an LW instruction after this change?
The remaining three problems in this exercise assume that components in the datapath have the following latencies. You can assume that the other components of the datapath have negligible latencies.
4.38.4 [10] <4.6, 4.14> What is the performance impact of your changes from 4.38.3?
4.38.5 [10] <4.6, 4.14> We can eliminate the MemRead control signal and have the data memory be read in every cycle, i.e., we can permanently have MemRead=1. Explain why the processor still functions correctly after this change. What is the effect of this change on clock frequency and energy consumption?
4.38.6 [10] <4.6, 4.14> If an idle unit spends 10% of the power it would spend if it were active, what is the energy spent by the instruction memory in each cycle? What percentage of the overall energy spent by the instruction memory does this idle energy represent?
Exercise 4.39 Problems in this exercise assume that, during an execution of the program, processor cycles are spent in the following way. A cycle is “spent” on an instruction if the processor completes that type of instruction in that cycle; a cycle is “spent” on a stall if the processor could not complete an instruction in that cycle because of a stall.
Problems in this exercise also assume that individual pipeline stages have the following latency and energy consumption. The stage expends this energy in order
to do its work within the given latency. Note that no energy is spent in the MEM stage during a cycle in which there is no memory access. Similarly, no energy is spent in the WB stage in a cycle in which there is no register write. In several of the following problems, we make assumptions about how energy consumption changes if a stage performs its work slower or faster than this.
4.39.1 [10] <4.14> What is the performance (in instructions per second)?
4.39.2 [10] <4.14> What is the power dissipated in watts (joules per second)?
4.39.3 [10] <4.6, 4.14> Which pipeline stages can you slow down and by how much, without affecting the clock cycle time?
4.39.4 [20] <4.6, 4.14> It is often possible to sacrifice some speed in a circuit in order to reduce its energy consumption. Assume that we can reduce energy consumption by a factor of X (new energy is 1/X times the old energy) when we increase the latency by a factor of X (new latency is X times the old latency). Using this tradeoff, we can adjust latencies of pipeline stages to minimize energy consumption without sacrificing any performance. Repeat 4.39.2 for this adjusted processor.
4.39.5 [10] <4.6, 4.14> Repeat 4.39.4, but this time the goal is to minimize energy spent per instruction while increasing the clock cycle time by no more than 10%.
4.39.6 [10] <4.6, 4.14> Repeat 4.39.5, but now assume that energy consumption is reduced by a factor of X2 when latency is made X times longer. What are the power savings compared to what you computed for 4.39.2?