Computer Organization and Design - Chapter 7 - Book solutions - 4th edition - Hennessy, Patterson

Exercise 7.1

First, write down a list of your daily activities that you typically do on a weekday.  For instance, you might get out of bed, take a shower, get dressed, eat breakfast, dry  your hair, brush your teeth, etc. Make sure to break down your list so you have a  minimum of 10 activities.


7.1.1 [5] <7.2> Now consider which of these activities is already exploiting some  form of parallelism (e.g., brushing multiple teeth at the same time, versus one at  a time, carrying one book at a time to school, versus loading them all into your  backpack and then carry them “in parallel”). For each of your activities, discuss if  they are already working in parallel, but if not, why they are not.


7.1.2 [5] <7.2> Next, consider which of the activities could be carried out concurrently (e.g., eating breakfast and listening to the news). For each of your activities, describe which other activity could be paired with this activity.


7.1.3 [5] <7.2> For 7.1.2, what could we change about current systems (e.g.,  showers, clothes, TVs, cars) so that we could perform more tasks in parallel? 


7.1.4 [5] <7.2> Estimate how much shorter time it would take to carry out these  activities if you tried to carry out as many tasks in parallel as possible.



Exercise 7.2
Many computer applications involve searching through a set of data and sorting  the data. A number of efficient searching and sorting algorithms have been devised  in order to reduce the runtime of these tedious tasks. In this problem we will  consider how best to parallelize these tasks. 


7.2.1 [10] <7.2> Consider the following binary search algorithm (a classic divide  and conquer algorithm) that searches for a value X in an sorted N­element array A  and returns the index of matched entry: 




Assume that you have Y cores on a multi­core processor to run BinarySearch.  Assuming that Y is much smaller than N, express the speedup factor you might  expect to obtain for values of Y and N. Plot these on a graph.


7.2.2 [5] <7.2> Next, assume that Y is equal to N. How would this affect your  conclusions in your previous answer? If you were tasked with obtaining the best  speedup factor possible (i.e., strong scaling), explain how you might change this  code to obtain it.




Exercise 7.3

Consider the following piece of C code: 
for (j=2;j<1000;j++)
   D[j] = D[j-1]+D[j-2];

The MIPS code corresponding to the above fragment is: 
       DADDIU  r2,r2,999
 loop:  L.D     f1, -16(f1)
       L.D     f2, -8(f1)
       ADD.D   f3, f1, f2
        S.D     f3, 0(r1)
       DADDIU  r1, r1, 8
       BNE     r1, r2, loop

Instructions have the following associated latencies (in cycles):


7.3.1 [10] <7.2> How many cycles does it take for all instructions in a single  iteration of the above loop to execute? 


7.3.2 [10] <7.2> When an instruction in a later iteration of a loop depends upon  a data value produced in an earlier iteration of the same loop, we say that there is  a loop carried dependence between iterations of the loop. Identify the loop­ carried  dependences in the above code. Identify the dependent program variable and  assembly­level registers. You can ignore the loop induction variable j.


7.3.3 [10] <7.2> Loop unrolling was described in Chapter 4. Apply loop unrolling to this loop and then consider running this code on a 2­node distributed memory message passing system. Assume that we are going to use message passing as  described in Section 7.4, where we introduce a new operation send (x, y) that sends to  node x the value y, and an operation receive( ) that waits for the value being sent to it.  Assume that send operations take a cycle to issue (i.e., later instructions on the same  node can proceed on the next cycle), but take 10 cycles be received on the receiving  node. Receive instructions stall execution on the node where they are executed until  they receive a message. Produce a schedule for the two nodes assuming an unroll  factor of 4 for the loop body (i.e., the loop body will appear 4 times). Compute the  number of cycles it will take for the loop to run on the message passing system.


7.3.4 [10] <7.2> The latency of the interconnect network plays a large role in the  efficiency of message passing systems. How fast does the interconnect need to be in  order to obtain any speedup from using the distributed system described in 7.3.3?



Exercise 7.4
Consider the following recursive mergesort algorithm (another classic divide and  conquer algorithm). Mergesort was first described by John Von Neumann in 1945.  The basic idea is to divide an unsorted list x of m elements into two sublists of  about half the size of the original list. Repeat this operation on each sublist, and  continue until we have lists of size 1 in length. Then starting with sublists of length  1, “merge” the two sublists into a single sorted list.
Mergesort(m)
    var list left, right, result
    if length(m) ≤ 1
        return m
 else      
var middle = length(m) / 2
        for each x in m up to middle
            add x to left
        for each x in m after middle
            add x to right
        left = Mergesort(left)
        right = Mergesort(right)
        result = Merge(left, right)
        return result


The merge step is carried out by the following code:
 Merge(left,right)
    var list result
    while length(left) > 0 and length(right) > 0
        if first(left) ≤ first(right)
            append first(left) to result
            left = rest(left)
else          
append first(right) to result
            right = rest(right) 
 if length(left) > 0
         append rest(left) to result  
if length(right) > 0
         append rest(right) to result
    return result

7.4.1 [10] <7.2> Assume that you have Y cores on a multi­core processor to run  MergeSort. Assuming that Y is much smaller than length(m), express the speedup  factor you might expect to obtain for values of Y and length(m). Plot these on a  graph.


7.4.2 [10] <7.2> Next, assume that Y is equal to length(m). How would this affect  your conclusions your previous answer? If you were tasked with obtaining the best  speedup factor possible (i.e., strong scaling), explain how you might change this  code to obtain it.



Exercise 7.5

You are trying to bake 3 blueberry pound cakes. Cake ingredients are as follows:
1 cup butter, softened
1 cup sugar
4 large eggs 
1 teaspoon vanilla extract
1/2 teaspoon salt
1/4 teaspoon nutmeg 
1 1/2 cups flour 
1 cup blueberries

The recipe for a single cake is as follows:
 Step 1: Preheat oven to 325°F (160°C). Grease and flour your cake pan.     
Step 2: In large bowl, beat together with a mixer butter and sugar at  medium speed until light and fluffy. Add eggs, vanilla, salt and nutmeg.  Beat until thoroughly blended. Reduce mixer speed to low and add flour,  1/2 cup at a time, beating just until blended.     
Step 3: Gently fold in blueberries. Spread evenly in prepared baking pan.  Bake for 60 minutes.

7.5.1 [5] <7.2> Your job is to cook 3 cakes as efficiently as possible. Assuming  that you only have one oven large enough to hold one cake, one large bowl, one  cake pan, and one mixer, come up with a schedule to make three cakes as quickly as  possible. Identify the bottlenecks in completing this task.


7.5.2 [5] <7.2> Assume now that you have three bowls, 3 cake pans and 3 mixers.  How much faster is the process now that you have additional resources?


7.5.3 [5] <7.2> Assume now that you have two friends that will help you cook,  and that you have a large oven that can accommodate all three cakes. How will this  change the schedule you arrived at in 7.5.1 above?


7.5.4 [5] <7.2> Compare the cake­making task to computing 3 iterations of a loop  on a parallel computer. Identify data­level parallelism and task­level  parallelism in  the cake­making loop. 
 


Exercise 7.6
Matrix multiplication plays an important role in a number of applications. Two  matrices can only be multiplied if the number of columns of the first matrix is  equal to the number of rows in the second.  Let’s assume we have an m × n matrix A and we want to multiply it by an n × p  matrix B. We can express their product as an m × p matrix denoted by AB (or A · B).  If we assign C = AB, and ci,j denotes the entry in C at position (i, j), then



for each element i and j with 1 ≤ i ≤ m and 1 ≤ j ≤ p. Now we want to see if we can  parallelize the computation of C. Assume that matrices are laid out in memory  sequentially as follows: a1,1, a2,1, a3,1, a4,1, …, etc..



7.6.1 [10] <7.3> Assume that we are going to compute C on both a single core  shared memory machine and a 4­core shared­memory machine. Compute the  speedup we would expect to obtain on the 4­core machine, ignoring any memory  issues.


7.6.2 [10] <7.3> Repeat 7.6.1, assuming that updates to C incur a cache miss due  to false sharing when consecutive elements are in a row (i.e., index i) are updated. 


7.6.3 [10] <7.3> How would you fix the false sharing issue that can occur?



Exercise 7.7
Consider the following portions of two different programs running at the same  time on four processors in a symmetric multi­core processor (SMP). Assume that  before this code is run, both x and y are 0. 

Core 1: x = 2;
Core 2: y = 2;
Core 3: w = x + y + 1;
Core 4: z = x + y;


7.7.1 [10] <7.3> What are all the possible resulting values of w, x, y, and z? For  each possible outcome, explain how we might arrive at those values. You will need  to examine all possible interleavings of instructions.


7.7.2 [5] <7.3> How could you make the execution more deterministic so that  only one set of values is possible?
 


Exercise 7.8
In a CC­NUMA shared memory system, CPUs and physical memory are divided  across compute nodes. Each CPU has local caches. To maintain the coherency of  memory, we can add status bits into each cache block, or we can introduce dedicated memory directories. Using directories, each node provides a dedicated hardware table for managing the status of every block of memory that is “local” to that  node. The size of each directory is a function of the size of the CC­NUMA shared  space (an entry is provided for each block of memory local to a node). If we store  coherency information in the cache, we add this information to every cache in  every system (i.e., the amount of storage space is a function of the number of cache  lines available in all caches).

In the following proplems, assume that all nodes have the same number of CPUs  and the same amount memory (i.e., CPUs and memory are evenly divided between  the nodes of the CC­NUMA machine).


7.8.1 [15] <7.3> If we have P CPU in the system, with T nodes in the CCNUMA system, with each CPU having C memory blocks stored in it, and we  maintain a byte of coherency information in each cache line, provide an equation  that expresses the amount of memory that will be present in the caches in a single  node of the system to maintain coherency. Do not include the actual data storage  space consumed in this equation, only account for space used to store coherency  information.


7.8.2 [15] <7.3>If each directory entry maintains a byte of information for each  CPU, if our CC­NUMA system has S memory blocks, and the system has T nodes,  provide an equation that expresses the amount of memory that will be present in  each directory. 



Exercise 7.9
Considering the CC­NUMA system described in the Exercise 7.8, assume that the  system has 4 nodes, each with a single­core CPU (each CPU has its own L1 data  cache and L2 data cache). The L1 data cache is store­through, though the L2 data  cache is write­back. Assume that system has a workload where one CPU writes to  an address, and the other CPUs all read that data that is written. Also assume that  the address written to is initially only in memory and not in any local cache. Also,  after the write, assume that the updated block is only present in the L1 cache of the  core performing the write.


7.9.1 [10] <7.3>  For a system that maintains coherency using cache­based block  status, describe the inter­node traffic that will be generated as each of the 4 cores  writes to a unique address, after which each address written to is read from by each  of the remaining 3 cores.


7.9.2 [10] <7.3>  For a directory­based coherency mechanism, describe the internode traffic generated when executing the same code pattern. 



7.9.3 [20] <7.3> Repeat 7.9.1 and 7.9.2 assuming that each CPU is now a multicore CPU, with 4 cores per CPU, each maintaining an L1 data cache, but provided  with a shared L2 data cache across the 4 cores. Each core will perform the write,  followed by reads by each of the 15 other cores.


7.9.4 [10] <7.3> Consider the system described in 7.9.3, now assuming that  each core writes to byte stored in the same cache block. How does this impact bus traffic? Explain.




Exercise 7.10
On a CC­NUMA system, the cost of accessing non­local memory can limit our  ability to utilize multiprocessing effectively. The following table shows the costs  associated with access data in local memory versus non­local memory and the  locality of our application expresses as the proportion of access that are local.



Answer the following questions. Assume that memory accesses are evenly distributed through the application. Also, assume that only a single memory operation  can be active during any cycle. State all assumptions about the ordering of local  versus non­local memory operations.


7.10.1 [10] <7.3> If on average we need to access memory once every 75 cycles,  what is impact on our application? 


7.10.2 [10] <7.3> If on average we need to access memory once every 50 cycles,  what is impact on our application?


7.10.3 [10] <7.3> If on average we need to access memory once every 100 cycles,  what is impact on our application?




Exercise 7.11
The dining philosopher’s problem is a classic problem of synchronization and concurrency. The general problem is stated as philosophers sitting at a round table  doing one of two things: eating or thinking. When they are eating, they are not  thinking, and when they are thinking, they are not eating. There is a bowl of pasta  in the center. A fork is placed in between each philosopher. The result is that each  philosopher has one fork to her left and one fork to her right. Given the nature of  eating pasta, the philosopher needs two forks to eat, and can only use the forks on  her immediate left and right. The philosophers do not speak to one another.


7.11.1 [10] <7.4> Describe the scenario where none of philosophers ever eats (i.e.,  starvation). What is the sequence of events that happen that lead up to this problem?


7.11.2 [10] <7.4> Describe how we can solve this problem by introducing the  concept of a priority? But can we guarantee that we will treat all the philosophers  fairly? Explain.
Now assume we hire a waiter who is in charge of assigning forks to philosophers.  Nobody can pick up a fork until the waiter says they can. The waiter has global knowledge of all forks. Further, if we impose the policy that philosophers will  always request to pick up their left fork before requesting to pick up their right  fork, then we can guarantee to avoid deadlock.


7.11.3 [10] <7.4>  We can implement requests to the waiter as either a queue of  requests or as a periodic retry of a request. With a queue, requests are handled in  the order they are received. The problem with using the queue is that we may not  always be able to service the philosopher whose request is at the head of the queue  (due to the unavailability of resources). Describe a scenario with 5 philosophers  where a queue is provided, but service is not granted even though there are forks  available for another philosopher (whose request is deeper in the queue) to eat.
 

7.11.4 [10] <7.4> If we implement requests to the waiter by periodically repeating our request until the resources become available, will this solve the problem  described in 7.11.3? Explain. 



Exercise 7.12
Consider the following three CPU organizations: CPU SS: A 2­core superscalar microprocessor that provides out­of­order issue  capabilities on 2 function units (FUs).  Only a single thread can run on each core  at a time. CPU MT: A fine­grained multithreaded processor that allows instructions from 2  threads to be run concurrently (i.e., there are two functional units), though only  instructions from a single thread can be issued on any cycle. CPU SMT: An SMT processor that allows instructions from 2 threads to be run  concurrently (i.e., there are two functional units), and instructions from either or  both threads can be issued to run on any cycle. Assume we have two threads X and Y to run on these CPUs that include the    following operations:



Assume all instructions take a single cycle to execute unless noted otherwise or they  encounter a hazard.



7.12.1 [10] <7.5> Assume that you have 1 SS CPU. How many cycles will it take  to execute these two threads? How many issue slots are wasted due to hazards?


7.12.2 [10] <7.5> Now assume you have 2 SS CPUs. How many cycles will it take  to execute these two threads? How many issue slots are wasted due to hazards?


7.12.3 [10] <7.5> Assume that you have 1 MT CPU. How many cycles will it take  to execute these two threads? How many issue slots are wasted due to hazards?




Exercise 7.13

Virtualization software is being aggressively deployed to reduce the costs of managing today’s high performance servers. Companies like VMWare, Microsoft and  IBM have all developed a range of virtualization products. The general concept,  described in Chapter 5, is that a hypervisor layer can be introduced between the  hardware and the operating system to allow multiple operating systems to share  the same physical hardware. The hypervisor layer is then responsible for allocating  CPU and memory resources, as well as handling services typically handled by the  operating system (e.g., I/O). Virtualization provides an abstract view of the underlying hardware to the hosted  operating system and application software. This will require us to rethink how  multi­core and multiprocessor systems will be designed in the future to support  the sharing of CPUs and memories by a number of operating systems concurrently.


7.13.1 [30] <7.5> Select two hypervisors on the market today, and compare  and contrast how they virtualize and manage the underlying hardware (CPUs and  memory).  


7.13.2 [15] <7.5> Discuss what changes may be necessary in future multi­core  CPU platforms in order to better match the resource demands placed on these  systems. For instance, can multi­threading play an effective role in alleviating the  competition for computing resources?



Exercise 7.14
We would like to execute the loop below as efficiently as possible. We have two different machines, a MIMD machine and a SIMD machine. 
for (i=0; i < 2000; i++)
  for (j=0; j<3000; j++)
      X_array[i][j] = Y_array[j][i] + 200;



7.14.1 [10] <7.6> For a 4 CPU MIMD machine, show the sequence of MIPS  instructions that you would execute on each CPU. What is the speedup for this  MIMD machine?



7.14.2 [20] <7.6> For an 8­wide SIMD machine (i.e., 8 parallel SIMD functional  units), write an assembly program in using your own SIMD extensions to MIPS  to execute the loop. Compare the number of instructions executed on the SIMD  machine to the MIMD machine.
 


Exercise 7.15
A systolic array is an example of an MISD machine. A systolic array is a pipeline  network or “wavefront” of data processing elements. Each of these elements does  not need a program counter since execution is triggered by the arrival of data.  Clocked systolic arrays compute in “lock­step” with each processor undertaking  alternate compute and communication phases. 


7.15.1 [10] <7.6> Consider proposed implementations of a systolic array (you  can find these in on the Internet or in technical publications). Then attempt to  program the loop provided in Exercise 7.14 using this MISD model. Discuss any  difficulties you encounter.


7.15.2 [10] <7.6> Discuss the similarities and differences between an MISD and  SIMD machine. Answer this question in terms of data­level parallelism.




Exercise 7.16
Assume we want to execute the DAXP loop show on page 651 in MIPS assembly  on the NVIDIA 8800 GTX GPU described in this Chapter. In this problem, we will  assume that all math operations are performed on single­precision floating­point  numbers (we will rename the loop SAXP). Assume that instructions take the following number of cycles to execute.



7.16.1 [20] <7.7> Describe how you will constructs warps for the SAXP loop to  exploit the 8 cores provided in a single multiprocessor.




Exercise 7.17

Download the CUDA Toolkit and SDK from http://www.nvidia.com/object/cuda_ get.html. Make sure to use the “emurelease” (Emulation Mode) version of the code  (you will not need actual NVIDIA hardware for this assignment). Build the example programs provided in the SDK, and confirm that they run on the emulator.

7.17.1 [90] <7.7> Using the “template” SDK sample as a starting point, write a  CUDA program to perform the following vector operations: 
1)  a - b (vector­vector subtraction) 2)    a · b (vector dot product) The dot product of two vectors a = [a1, a2, … , an] and b = [b1, b2, … , bn] is defined as:




Submit code for each program that demonstrates each operation and verifies the  correctness of the results.



7.17.2 [90] <7.7> If you have GPU hardware available, complete a performance  analysis your program, examining the computation time for the GPU and a CPU  version of your program for a range of vector sizes. Explain any results you see.




Exercise 7.18

AMD has recently announced that they will be integrating a graphics processing  unit with their X86 cores in a single package, though with different clocks for each  of the cores. This is an example of a heterogeneous multiprocessor system which  we expect to see produced commericially in the near future. One of the key design  points will be to allow for fast data communication between the CPU and the GPU.  Presently communications must be performed between discrete CPU and GPU  chips. But this is changing in AMDs Fusion architecture. Presently the plan is to use  multiple (at least 16) PCI express channels for facilitate intercommunication. Intel  is also jumping into this arena with their Larrabee chip. Intel is considering to use  their QuickPath interconnect technology.


7.18.1 [25] <7.7> Compare the bandwidth and latency associated with these two  interconnect technologies. 




Exercise 7.19
Refer to Figure 7.7b that shows an n­cube interconnect topology of order 3 that  interconnects 8 nodes. One attractive feature of an n­cube interconnection network topology is its ability to sustain broken links and still provide connectivity. 



7.19.1 [10] <7.8> Develop an equation that computes how many links in the  n­cube (where n is the order of the cube) can fail and we can still guarantee an  unbroken link will exist to connect any node in the n­cube.



7.19.2 [10] <7.8> Compare the resiliency to failure of n­cube to a fully­ connected  interconnection network. Plot a comparison of reliability as a function of the added  number of links for the two topologies.



Exercise 7.20
Benchmarking is field of study that involves identifying representative workloads  to run on specific computing platforms in order to be able to objectively compare  performance of one system to another. In this exercise we will compare two classes  of benchmarks: the Whetstone CPU benchmark and the PARSEC Benchmark  suite. Select one program from PARSEC. All programs should be freely available  on the Internet. Consider running multiple copies of Whetstone versus running  the PARSEC Benchmark on any of systems described in Section 7.11. 


7.20.1 [60] <7.9> What is inherently different between these two classes of workload when run on these multi­core systems?


7.20.2 [60] <7.9, 7.10> In terms of the Roofline Model, how dependent will the  results you obtain when running these benchmarks be on the amount of sharing  and synchronization present in the workload used?


Exercise 7.21
When performing computations on sparse matrices, latency in the memory hierarchy becomes much more of a factor. Sparse matrices lack the spatial locality in the  data stream typically found in matrix operations. As a result, new matrix representations have been proposed.  One the earliest sparse matrix representations is the Yale Sparse Matrix Format. It  stores an initial sparse m×n matrix, M in row form using three one­dimensional  arrays. Let R be the number of nonzero entries in M. We construct an array A  of length R that contains all nonzero entries of M (in left­to­right top­to­bottom  order). We also construct a second array IA of length m + 1 (i.e., one entry per row,  plus one). IA(i) contains the index in A of the first nonzero element of row i. Row  i of the original matrix extends from A(IA(i)) to A(IA(i+1)–1). The third array, JA,  contains the column index of each element of A, so it also is of length R.


7.21.1 [15] <7.9> Consider the sparse matrix X below and write C code that  would store this code in Yale Sparse Matrix Format.


Row 1 [1, 2, 0, 0, 0, 0]
Row 2 [0, 0, 1, 1, 0, 0]
Row 3 [0, 0, 0, 0, 9, 0]
Row 4 [2, 0, 0, 0, 0, 2]
Row 5 [0, 0, 3, 3, 0, 7]
Row 6 [1, 3, 0, 0, 0, 1]
 

7.21.2 [10] <7.9> In terms of storage space, assuming that each element in  matrix X is single precision floating point, compute the amount of storage used to  store the Matrix above in Yale Sparse Matrix Format.


7.21.3 [15] <7.9> Perform matrix multiplication of Matrix X by Matrix Y shown  below. 
[2, 4, 1, 99, 7, 2]
Put this computation in a loop, and time its execution. Make sure to increase the  number of times this loop is executed to get good resolution in your timing measurement.  Compare the runtime of using a naïve representation of the matrix, and  the Yale Sparse Matrix Format.


7.21.4 [15] <7.9> Can you find a more efficient sparse matrix representation (in  terms of space and computational overhead)?




Exercise 7.22
In  future  systems,  we  expect  to  see  heterogeneous  computing  platforms  constructed out of heterogeneous CPUs. We have begun to see some appear in the  embedded processing market in systems that contain both floating point DSPs and  a microcontroller CPUs in a multichip module package. Assume that you have three classes of CPU: CPU A—A moderate speed multi­core CPU (with a floating point unit) that can  execute multiple instructions per cycle. CPU B—A fast single­core integer CPU (i.e., no floating point unit) that can execute a single instruction per cycle. CPU C—A slow vector CPU (with floating point capability) that can execute multiple copies of the same instruction per cycle. Assume that our processors run at the following frequencies:


CPU A can execute 2 instructions per cycle, CPU B can execute 1 instruction per  cycle, and CPU C can execute 8 instructions (though the same instruction) per  cycle. Assume all operations can complete execution in a single cycle of latency  without any hazards.


All three CPUs have the ability to perform integer arithmetic, though CPU B cannot perform floating point arithmetic. CPU A and B have an instruction set similar  to a MIPS processor. CPU C can only perform floating point add and subtract  operations, as well as memory loads and stores. Assume all CPUs have access to  shared memory and that synchronization has zero cost. The task at hand is to compare two matrices X and Y that each contain 1024 × 1024  floating point elements. The output should be a count of the number indices where  the value in X was larger or equal to the value in Y. 


7.22.1 [10] <7.11> Describe how you would partition the problem on the 3 different CPUs to obtain the best performance.


7.22.2 [10] <7.11> What kind of instruction would you add to the vector CPU C  to obtain better performance?



Exercise 7.23
Assume a quad­core computer system can process database queries at a steady state  rate of requests per second. Also assume that each transaction takes, on average, a  fixed amount of time to process. The following table shows pairs of transaction  latency and processing rate.



For each of the pairs in the table, answer the following questions: 7.23.1 [10] <7.11> On average, how many requests are being processed at any  given instant? 


7.23.2 [10] <7.11> If move to an 8­core system, ideally, what will happen to the  system throughput (i.e., how many queries/second will the computer process)?


7.22.3 [10] <7.11> Discuss why we rarely obtain this kind of speedup by simply  increasing the number of cores