Computer Organization and Design - Chapter 7 - Book solutions - 4th edition - Hennessy, Patterson

Exercise 7.1

First, write down a list of your daily activities that you typically do on a weekday.  For instance, you might get out of bed, take a shower, get dressed, eat breakfast, dry  your hair, brush your teeth, etc. Make sure to break down your list so you have a  minimum of 10 activities.


7.1.1 [5] <7.2> Now consider which of these activities is already exploiting some  form of parallelism (e.g., brushing multiple teeth at the same time, versus one at  a time, carrying one book at a time to school, versus loading them all into your  backpack and then carry them “in parallel”). For each of your activities, discuss if  they are already working in parallel, but if not, why they are not.


7.1.2 [5] <7.2> Next, consider which of the activities could be carried out concurrently (e.g., eating breakfast and listening to the news). For each of your activities, describe which other activity could be paired with this activity.


7.1.3 [5] <7.2> For 7.1.2, what could we change about current systems (e.g.,  showers, clothes, TVs, cars) so that we could perform more tasks in parallel? 


7.1.4 [5] <7.2> Estimate how much shorter time it would take to carry out these  activities if you tried to carry out as many tasks in parallel as possible.



Exercise 7.2
Many computer applications involve searching through a set of data and sorting  the data. A number of efficient searching and sorting algorithms have been devised  in order to reduce the runtime of these tedious tasks. In this problem we will  consider how best to parallelize these tasks. 


7.2.1 [10] <7.2> Consider the following binary search algorithm (a classic divide  and conquer algorithm) that searches for a value X in an sorted N­element array A  and returns the index of matched entry: 




Assume that you have Y cores on a multi­core processor to run BinarySearch.  Assuming that Y is much smaller than N, express the speedup factor you might  expect to obtain for values of Y and N. Plot these on a graph.


7.2.2 [5] <7.2> Next, assume that Y is equal to N. How would this affect your  conclusions in your previous answer? If you were tasked with obtaining the best  speedup factor possible (i.e., strong scaling), explain how you might change this  code to obtain it.




Exercise 7.3

Consider the following piece of C code: 
for (j=2;j<1000;j++)
   D[j] = D[j-1]+D[j-2];

The MIPS code corresponding to the above fragment is: 
       DADDIU  r2,r2,999
 loop:  L.D     f1, -16(f1)
       L.D     f2, -8(f1)
       ADD.D   f3, f1, f2
        S.D     f3, 0(r1)
       DADDIU  r1, r1, 8
       BNE     r1, r2, loop

Instructions have the following associated latencies (in cycles):


7.3.1 [10] <7.2> How many cycles does it take for all instructions in a single  iteration of the above loop to execute? 


7.3.2 [10] <7.2> When an instruction in a later iteration of a loop depends upon  a data value produced in an earlier iteration of the same loop, we say that there is  a loop carried dependence between iterations of the loop. Identify the loop­ carried  dependences in the above code. Identify the dependent program variable and  assembly­level registers. You can ignore the loop induction variable j.


7.3.3 [10] <7.2> Loop unrolling was described in Chapter 4. Apply loop unrolling to this loop and then consider running this code on a 2­node distributed memory message passing system. Assume that we are going to use message passing as  described in Section 7.4, where we introduce a new operation send (x, y) that sends to  node x the value y, and an operation receive( ) that waits for the value being sent to it.  Assume that send operations take a cycle to issue (i.e., later instructions on the same  node can proceed on the next cycle), but take 10 cycles be received on the receiving  node. Receive instructions stall execution on the node where they are executed until  they receive a message. Produce a schedule for the two nodes assuming an unroll  factor of 4 for the loop body (i.e., the loop body will appear 4 times). Compute the  number of cycles it will take for the loop to run on the message passing system.


7.3.4 [10] <7.2> The latency of the interconnect network plays a large role in the  efficiency of message passing systems. How fast does the interconnect need to be in  order to obtain any speedup from using the distributed system described in 7.3.3?



Exercise 7.4
Consider the following recursive mergesort algorithm (another classic divide and  conquer algorithm). Mergesort was first described by John Von Neumann in 1945.  The basic idea is to divide an unsorted list x of m elements into two sublists of  about half the size of the original list. Repeat this operation on each sublist, and  continue until we have lists of size 1 in length. Then starting with sublists of length  1, “merge” the two sublists into a single sorted list.
Mergesort(m)
    var list left, right, result
    if length(m) ≤ 1
        return m
 else      
var middle = length(m) / 2
        for each x in m up to middle
            add x to left
        for each x in m after middle
            add x to right
        left = Mergesort(left)
        right = Mergesort(right)
        result = Merge(left, right)
        return result


The merge step is carried out by the following code:
 Merge(left,right)
    var list result
    while length(left) > 0 and length(right) > 0
        if first(left) ≤ first(right)
            append first(left) to result
            left = rest(left)
else          
append first(right) to result
            right = rest(right) 
 if length(left) > 0
         append rest(left) to result  
if length(right) > 0
         append rest(right) to result
    return result

7.4.1 [10] <7.2> Assume that you have Y cores on a multi­core processor to run  MergeSort. Assuming that Y is much smaller than length(m), express the speedup  factor you might expect to obtain for values of Y and length(m). Plot these on a  graph.


7.4.2 [10] <7.2> Next, assume that Y is equal to length(m). How would this affect  your conclusions your previous answer? If you were tasked with obtaining the best  speedup factor possible (i.e., strong scaling), explain how you might change this  code to obtain it.



Exercise 7.5

You are trying to bake 3 blueberry pound cakes. Cake ingredients are as follows:
1 cup butter, softened
1 cup sugar
4 large eggs 
1 teaspoon vanilla extract
1/2 teaspoon salt
1/4 teaspoon nutmeg 
1 1/2 cups flour 
1 cup blueberries

The recipe for a single cake is as follows:
 Step 1: Preheat oven to 325°F (160°C). Grease and flour your cake pan.     
Step 2: In large bowl, beat together with a mixer butter and sugar at  medium speed until light and fluffy. Add eggs, vanilla, salt and nutmeg.  Beat until thoroughly blended. Reduce mixer speed to low and add flour,  1/2 cup at a time, beating just until blended.     
Step 3: Gently fold in blueberries. Spread evenly in prepared baking pan.  Bake for 60 minutes.

7.5.1 [5] <7.2> Your job is to cook 3 cakes as efficiently as possible. Assuming  that you only have one oven large enough to hold one cake, one large bowl, one  cake pan, and one mixer, come up with a schedule to make three cakes as quickly as  possible. Identify the bottlenecks in completing this task.


7.5.2 [5] <7.2> Assume now that you have three bowls, 3 cake pans and 3 mixers.  How much faster is the process now that you have additional resources?


7.5.3 [5] <7.2> Assume now that you have two friends that will help you cook,  and that you have a large oven that can accommodate all three cakes. How will this  change the schedule you arrived at in 7.5.1 above?


7.5.4 [5] <7.2> Compare the cake­making task to computing 3 iterations of a loop  on a parallel computer. Identify data­level parallelism and task­level  parallelism in  the cake­making loop. 
 


Exercise 7.6
Matrix multiplication plays an important role in a number of applications. Two  matrices can only be multiplied if the number of columns of the first matrix is  equal to the number of rows in the second.  Let’s assume we have an m × n matrix A and we want to multiply it by an n × p  matrix B. We can express their product as an m × p matrix denoted by AB (or A · B).  If we assign C = AB, and ci,j denotes the entry in C at position (i, j), then



for each element i and j with 1 ≤ i ≤ m and 1 ≤ j ≤ p. Now we want to see if we can  parallelize the computation of C. Assume that matrices are laid out in memory  sequentially as follows: a1,1, a2,1, a3,1, a4,1, …, etc..



7.6.1 [10] <7.3> Assume that we are going to compute C on both a single core  shared memory machine and a 4­core shared­memory machine. Compute the  speedup we would expect to obtain on the 4­core machine, ignoring any memory  issues.


7.6.2 [10] <7.3> Repeat 7.6.1, assuming that updates to C incur a cache miss due  to false sharing when consecutive elements are in a row (i.e., index i) are updated. 


7.6.3 [10] <7.3> How would you fix the false sharing issue that can occur?



Exercise 7.7
Consider the following portions of two different programs running at the same  time on four processors in a symmetric multi­core processor (SMP). Assume that  before this code is run, both x and y are 0. 

Core 1: x = 2;
Core 2: y = 2;
Core 3: w = x + y + 1;
Core 4: z = x + y;


7.7.1 [10] <7.3> What are all the possible resulting values of w, x, y, and z? For  each possible outcome, explain how we might arrive at those values. You will need  to examine all possible interleavings of instructions.


7.7.2 [5] <7.3> How could you make the execution more deterministic so that  only one set of values is possible?
 


Exercise 7.8
In a CC­NUMA shared memory system, CPUs and physical memory are divided  across compute nodes. Each CPU has local caches. To maintain the coherency of  memory, we can add status bits into each cache block, or we can introduce dedicated memory directories. Using directories, each node provides a dedicated hardware table for managing the status of every block of memory that is “local” to that  node. The size of each directory is a function of the size of the CC­NUMA shared  space (an entry is provided for each block of memory local to a node). If we store  coherency information in the cache, we add this information to every cache in  every system (i.e., the amount of storage space is a function of the number of cache  lines available in all caches).

In the following proplems, assume that all nodes have the same number of CPUs  and the same amount memory (i.e., CPUs and memory are evenly divided between  the nodes of the CC­NUMA machine).


7.8.1 [15] <7.3> If we have P CPU in the system, with T nodes in the CCNUMA system, with each CPU having C memory blocks stored in it, and we  maintain a byte of coherency information in each cache line, provide an equation  that expresses the amount of memory that will be present in the caches in a single  node of the system to maintain coherency. Do not include the actual data storage  space consumed in this equation, only account for space used to store coherency  information.


7.8.2 [15] <7.3>If each directory entry maintains a byte of information for each  CPU, if our CC­NUMA system has S memory blocks, and the system has T nodes,  provide an equation that expresses the amount of memory that will be present in  each directory. 



Exercise 7.9
Considering the CC­NUMA system described in the Exercise 7.8, assume that the  system has 4 nodes, each with a single­core CPU (each CPU has its own L1 data  cache and L2 data cache). The L1 data cache is store­through, though the L2 data  cache is write­back. Assume that system has a workload where one CPU writes to  an address, and the other CPUs all read that data that is written. Also assume that  the address written to is initially only in memory and not in any local cache. Also,  after the write, assume that the updated block is only present in the L1 cache of the  core performing the write.


7.9.1 [10] <7.3>  For a system that maintains coherency using cache­based block  status, describe the inter­node traffic that will be generated as each of the 4 cores  writes to a unique address, after which each address written to is read from by each  of the remaining 3 cores.


7.9.2 [10] <7.3>  For a directory­based coherency mechanism, describe the internode traffic generated when executing the same code pattern. 



7.9.3 [20] <7.3> Repeat 7.9.1 and 7.9.2 assuming that each CPU is now a multicore CPU, with 4 cores per CPU, each maintaining an L1 data cache, but provided  with a shared L2 data cache across the 4 cores. Each core will perform the write,  followed by reads by each of the 15 other cores.


7.9.4 [10] <7.3> Consider the system described in 7.9.3, now assuming that  each core writes to byte stored in the same cache block. How does this impact bus traffic? Explain.




Exercise 7.10
On a CC­NUMA system, the cost of accessing non­local memory can limit our  ability to utilize multiprocessing effectively. The following table shows the costs  associated with access data in local memory versus non­local memory and the  locality of our application expresses as the proportion of access that are local.



Answer the following questions. Assume that memory accesses are evenly distributed through the application. Also, assume that only a single memory operation  can be active during any cycle. State all assumptions about the ordering of local  versus non­local memory operations.


7.10.1 [10] <7.3> If on average we need to access memory once every 75 cycles,  what is impact on our application? 


7.10.2 [10] <7.3> If on average we need to access memory once every 50 cycles,  what is impact on our application?


7.10.3 [10] <7.3> If on average we need to access memory once every 100 cycles,  what is impact on our application?




Exercise 7.11
The dining philosopher’s problem is a classic problem of synchronization and concurrency. The general problem is stated as philosophers sitting at a round table  doing one of two things: eating or thinking. When they are eating, they are not  thinking, and when they are thinking, they are not eating. There is a bowl of pasta  in the center. A fork is placed in between each philosopher. The result is that each  philosopher has one fork to her left and one fork to her right. Given the nature of  eating pasta, the philosopher needs two forks to eat, and can only use the forks on  her immediate left and right. The philosophers do not speak to one another.


7.11.1 [10] <7.4> Describe the scenario where none of philosophers ever eats (i.e.,  starvation). What is the sequence of events that happen that lead up to this problem?


7.11.2 [10] <7.4> Describe how we can solve this problem by introducing the  concept of a priority? But can we guarantee that we will treat all the philosophers  fairly? Explain.
Now assume we hire a waiter who is in charge of assigning forks to philosophers.  Nobody can pick up a fork until the waiter says they can. The waiter has global knowledge of all forks. Further, if we impose the policy that philosophers will  always request to pick up their left fork before requesting to pick up their right  fork, then we can guarantee to avoid deadlock.


7.11.3 [10] <7.4>  We can implement requests to the waiter as either a queue of  requests or as a periodic retry of a request. With a queue, requests are handled in  the order they are received. The problem with using the queue is that we may not  always be able to service the philosopher whose request is at the head of the queue  (due to the unavailability of resources). Describe a scenario with 5 philosophers  where a queue is provided, but service is not granted even though there are forks  available for another philosopher (whose request is deeper in the queue) to eat.
 

7.11.4 [10] <7.4> If we implement requests to the waiter by periodically repeating our request until the resources become available, will this solve the problem  described in 7.11.3? Explain. 



Exercise 7.12
Consider the following three CPU organizations: CPU SS: A 2­core superscalar microprocessor that provides out­of­order issue  capabilities on 2 function units (FUs).  Only a single thread can run on each core  at a time. CPU MT: A fine­grained multithreaded processor that allows instructions from 2  threads to be run concurrently (i.e., there are two functional units), though only  instructions from a single thread can be issued on any cycle. CPU SMT: An SMT processor that allows instructions from 2 threads to be run  concurrently (i.e., there are two functional units), and instructions from either or  both threads can be issued to run on any cycle. Assume we have two threads X and Y to run on these CPUs that include the    following operations:



Assume all instructions take a single cycle to execute unless noted otherwise or they  encounter a hazard.



7.12.1 [10] <7.5> Assume that you have 1 SS CPU. How many cycles will it take  to execute these two threads? How many issue slots are wasted due to hazards?


7.12.2 [10] <7.5> Now assume you have 2 SS CPUs. How many cycles will it take  to execute these two threads? How many issue slots are wasted due to hazards?


7.12.3 [10] <7.5> Assume that you have 1 MT CPU. How many cycles will it take  to execute these two threads? How many issue slots are wasted due to hazards?




Exercise 7.13

Virtualization software is being aggressively deployed to reduce the costs of managing today’s high performance servers. Companies like VMWare, Microsoft and  IBM have all developed a range of virtualization products. The general concept,  described in Chapter 5, is that a hypervisor layer can be introduced between the  hardware and the operating system to allow multiple operating systems to share  the same physical hardware. The hypervisor layer is then responsible for allocating  CPU and memory resources, as well as handling services typically handled by the  operating system (e.g., I/O). Virtualization provides an abstract view of the underlying hardware to the hosted  operating system and application software. This will require us to rethink how  multi­core and multiprocessor systems will be designed in the future to support  the sharing of CPUs and memories by a number of operating systems concurrently.


7.13.1 [30] <7.5> Select two hypervisors on the market today, and compare  and contrast how they virtualize and manage the underlying hardware (CPUs and  memory).  


7.13.2 [15] <7.5> Discuss what changes may be necessary in future multi­core  CPU platforms in order to better match the resource demands placed on these  systems. For instance, can multi­threading play an effective role in alleviating the  competition for computing resources?



Exercise 7.14
We would like to execute the loop below as efficiently as possible. We have two different machines, a MIMD machine and a SIMD machine. 
for (i=0; i < 2000; i++)
  for (j=0; j<3000; j++)
      X_array[i][j] = Y_array[j][i] + 200;



7.14.1 [10] <7.6> For a 4 CPU MIMD machine, show the sequence of MIPS  instructions that you would execute on each CPU. What is the speedup for this  MIMD machine?



7.14.2 [20] <7.6> For an 8­wide SIMD machine (i.e., 8 parallel SIMD functional  units), write an assembly program in using your own SIMD extensions to MIPS  to execute the loop. Compare the number of instructions executed on the SIMD  machine to the MIMD machine.
 


Exercise 7.15
A systolic array is an example of an MISD machine. A systolic array is a pipeline  network or “wavefront” of data processing elements. Each of these elements does  not need a program counter since execution is triggered by the arrival of data.  Clocked systolic arrays compute in “lock­step” with each processor undertaking  alternate compute and communication phases. 


7.15.1 [10] <7.6> Consider proposed implementations of a systolic array (you  can find these in on the Internet or in technical publications). Then attempt to  program the loop provided in Exercise 7.14 using this MISD model. Discuss any  difficulties you encounter.


7.15.2 [10] <7.6> Discuss the similarities and differences between an MISD and  SIMD machine. Answer this question in terms of data­level parallelism.




Exercise 7.16
Assume we want to execute the DAXP loop show on page 651 in MIPS assembly  on the NVIDIA 8800 GTX GPU described in this Chapter. In this problem, we will  assume that all math operations are performed on single­precision floating­point  numbers (we will rename the loop SAXP). Assume that instructions take the following number of cycles to execute.



7.16.1 [20] <7.7> Describe how you will constructs warps for the SAXP loop to  exploit the 8 cores provided in a single multiprocessor.




Exercise 7.17

Download the CUDA Toolkit and SDK from http://www.nvidia.com/object/cuda_ get.html. Make sure to use the “emurelease” (Emulation Mode) version of the code  (you will not need actual NVIDIA hardware for this assignment). Build the example programs provided in the SDK, and confirm that they run on the emulator.

7.17.1 [90] <7.7> Using the “template” SDK sample as a starting point, write a  CUDA program to perform the following vector operations: 
1)  a - b (vector­vector subtraction) 2)    a · b (vector dot product) The dot product of two vectors a = [a1, a2, … , an] and b = [b1, b2, … , bn] is defined as:




Submit code for each program that demonstrates each operation and verifies the  correctness of the results.



7.17.2 [90] <7.7> If you have GPU hardware available, complete a performance  analysis your program, examining the computation time for the GPU and a CPU  version of your program for a range of vector sizes. Explain any results you see.




Exercise 7.18

AMD has recently announced that they will be integrating a graphics processing  unit with their X86 cores in a single package, though with different clocks for each  of the cores. This is an example of a heterogeneous multiprocessor system which  we expect to see produced commericially in the near future. One of the key design  points will be to allow for fast data communication between the CPU and the GPU.  Presently communications must be performed between discrete CPU and GPU  chips. But this is changing in AMDs Fusion architecture. Presently the plan is to use  multiple (at least 16) PCI express channels for facilitate intercommunication. Intel  is also jumping into this arena with their Larrabee chip. Intel is considering to use  their QuickPath interconnect technology.


7.18.1 [25] <7.7> Compare the bandwidth and latency associated with these two  interconnect technologies. 




Exercise 7.19
Refer to Figure 7.7b that shows an n­cube interconnect topology of order 3 that  interconnects 8 nodes. One attractive feature of an n­cube interconnection network topology is its ability to sustain broken links and still provide connectivity. 



7.19.1 [10] <7.8> Develop an equation that computes how many links in the  n­cube (where n is the order of the cube) can fail and we can still guarantee an  unbroken link will exist to connect any node in the n­cube.



7.19.2 [10] <7.8> Compare the resiliency to failure of n­cube to a fully­ connected  interconnection network. Plot a comparison of reliability as a function of the added  number of links for the two topologies.



Exercise 7.20
Benchmarking is field of study that involves identifying representative workloads  to run on specific computing platforms in order to be able to objectively compare  performance of one system to another. In this exercise we will compare two classes  of benchmarks: the Whetstone CPU benchmark and the PARSEC Benchmark  suite. Select one program from PARSEC. All programs should be freely available  on the Internet. Consider running multiple copies of Whetstone versus running  the PARSEC Benchmark on any of systems described in Section 7.11. 


7.20.1 [60] <7.9> What is inherently different between these two classes of workload when run on these multi­core systems?


7.20.2 [60] <7.9, 7.10> In terms of the Roofline Model, how dependent will the  results you obtain when running these benchmarks be on the amount of sharing  and synchronization present in the workload used?


Exercise 7.21
When performing computations on sparse matrices, latency in the memory hierarchy becomes much more of a factor. Sparse matrices lack the spatial locality in the  data stream typically found in matrix operations. As a result, new matrix representations have been proposed.  One the earliest sparse matrix representations is the Yale Sparse Matrix Format. It  stores an initial sparse m×n matrix, M in row form using three one­dimensional  arrays. Let R be the number of nonzero entries in M. We construct an array A  of length R that contains all nonzero entries of M (in left­to­right top­to­bottom  order). We also construct a second array IA of length m + 1 (i.e., one entry per row,  plus one). IA(i) contains the index in A of the first nonzero element of row i. Row  i of the original matrix extends from A(IA(i)) to A(IA(i+1)–1). The third array, JA,  contains the column index of each element of A, so it also is of length R.


7.21.1 [15] <7.9> Consider the sparse matrix X below and write C code that  would store this code in Yale Sparse Matrix Format.


Row 1 [1, 2, 0, 0, 0, 0]
Row 2 [0, 0, 1, 1, 0, 0]
Row 3 [0, 0, 0, 0, 9, 0]
Row 4 [2, 0, 0, 0, 0, 2]
Row 5 [0, 0, 3, 3, 0, 7]
Row 6 [1, 3, 0, 0, 0, 1]
 

7.21.2 [10] <7.9> In terms of storage space, assuming that each element in  matrix X is single precision floating point, compute the amount of storage used to  store the Matrix above in Yale Sparse Matrix Format.


7.21.3 [15] <7.9> Perform matrix multiplication of Matrix X by Matrix Y shown  below. 
[2, 4, 1, 99, 7, 2]
Put this computation in a loop, and time its execution. Make sure to increase the  number of times this loop is executed to get good resolution in your timing measurement.  Compare the runtime of using a naïve representation of the matrix, and  the Yale Sparse Matrix Format.


7.21.4 [15] <7.9> Can you find a more efficient sparse matrix representation (in  terms of space and computational overhead)?




Exercise 7.22
In  future  systems,  we  expect  to  see  heterogeneous  computing  platforms  constructed out of heterogeneous CPUs. We have begun to see some appear in the  embedded processing market in systems that contain both floating point DSPs and  a microcontroller CPUs in a multichip module package. Assume that you have three classes of CPU: CPU A—A moderate speed multi­core CPU (with a floating point unit) that can  execute multiple instructions per cycle. CPU B—A fast single­core integer CPU (i.e., no floating point unit) that can execute a single instruction per cycle. CPU C—A slow vector CPU (with floating point capability) that can execute multiple copies of the same instruction per cycle. Assume that our processors run at the following frequencies:


CPU A can execute 2 instructions per cycle, CPU B can execute 1 instruction per  cycle, and CPU C can execute 8 instructions (though the same instruction) per  cycle. Assume all operations can complete execution in a single cycle of latency  without any hazards.


All three CPUs have the ability to perform integer arithmetic, though CPU B cannot perform floating point arithmetic. CPU A and B have an instruction set similar  to a MIPS processor. CPU C can only perform floating point add and subtract  operations, as well as memory loads and stores. Assume all CPUs have access to  shared memory and that synchronization has zero cost. The task at hand is to compare two matrices X and Y that each contain 1024 × 1024  floating point elements. The output should be a count of the number indices where  the value in X was larger or equal to the value in Y. 


7.22.1 [10] <7.11> Describe how you would partition the problem on the 3 different CPUs to obtain the best performance.


7.22.2 [10] <7.11> What kind of instruction would you add to the vector CPU C  to obtain better performance?



Exercise 7.23
Assume a quad­core computer system can process database queries at a steady state  rate of requests per second. Also assume that each transaction takes, on average, a  fixed amount of time to process. The following table shows pairs of transaction  latency and processing rate.



For each of the pairs in the table, answer the following questions: 7.23.1 [10] <7.11> On average, how many requests are being processed at any  given instant? 


7.23.2 [10] <7.11> If move to an 8­core system, ideally, what will happen to the  system throughput (i.e., how many queries/second will the computer process)?


7.22.3 [10] <7.11> Discuss why we rarely obtain this kind of speedup by simply  increasing the number of cores
 
 


Computer Organization and Design - Chapter 6 - Book solutions - 4th edition - Hennessy, Patterson

Exercise 6.1
 Figure 6.2 describes numerous I/O devices in terms of their behavior, partner, and data rate. However, these classifications often do not provide a complete picture of data flow within a system. Explore device classifications for the following devices.
a. Auto Pilot
 b. Automated Thermostat

6.1.1 [5] <6.1> For the devices listed in the table, identify I/O interfaces and  classify them in terms of their behavior and partner.


6.1.2 [5] <6.1> For the interfaces identified in the previous problem, estimate their data rate.


6.1.3 [5] <6.1> For the interfaces identified in the previous problem, determine whether data rate or operation rate is the best performance measurement.



Exercise 6.2
Mean Time Between Failures (MTBF), Mean Time To Replacement (MTTR), and Mean Time To Failure (MTTF) are useful metrics for evaluating the reliability and availability of a storage resource. Explore these concepts by answering the questions about devices with the following metrics.







6.2.1 [5] <6.1, 6.2> Calculate the MTBF for each of the devices in the table.


6.2.2 [5] <6.1, 6.2> Calculate the availability for each of the devices in the table.


6.2.3 [5] <6.1, 6.2> What happens to availability as the MTTR approaches 0? Is this a realistic situation?
 
 
6.2.4 [5] <6.1, 6.2> What happens to availability as the MTTR gets very high, i.e., a device is difficult to repair? Does this imply the device has low availability?
 


Exercise 6.3
Average and minimum times for reading and writing to storage devices are  common measurements used to compare devices. Using techniques from Chapter 6, calculate values related to read and write time for disks with the following  characteristics.



6.3.1 [10] <6.2, 6.3> Calculate the average time to read or write a 1024­byte  sector for each disk listed in the table.


6.3.2 [10] <6.2, 6.3> Calculate the minimum time to read or write a 2048­byte sector for each disk listed in the table.


6.3.3 [10] <6.2, 6.3> For each disk in the table, determine the dominant factor for performance. Specifically, if you could make an improvement to any aspect of the disk, what would you choose? If there is no dominant factor, explain why.



Exercise 6.4
Ultimately, storage system design requires consideration of usage scenarios as well as disk parameters. Different situations require different metrics. Let’s try to systematically evaluate disk systems. Explore differences in how storage systems should be evaluated by answering the questions about the following applications.
a. Aircraft Control System
b. Phone Switch


6.4.1 [5] <6.2, 6.3> For each application, would decreasing the sector size during reads and writes improve performance? Explain your answer.


6.4.2 [5] <6.2, 6.3> For each application, would increasing disk rotation speed improve performance? Explain your answer.


6.4.3 [5] <6.2, 6.3> For each application, would increasing disk rotation speed improve system performance given that MTTF is decreased? Explain your answer.



Exercise 6.5
FLASH memory is one of the first true competitors for traditional disk drives. Explore the implications of FLASH memory by answering questions about the  following applications.
a. Aircraft Control System
b. Phone Switch

6.5.1 [5] <6.2, 6.3, 6.4> As we move towards solid state drives constructed from FLASH memory, what will change about disk read times assuming that the data transfer rate stays constant?


6.5.2 [10] <6.2, 6.3, 6.4> Would each application benefit from a solid state FLASH drive given that cost is a design factor?


6.5.3 [10] <6.2, 6.3, 6.4> Would each application be inappropriate for a solid state FLASH drive given that cost is NOT a design factor?



Exercise 6.6
Explore the nature of FLASH memory by answering the questions related to performance for FLASH memories with the following characteristics.


6.6.1 [10] <6.2, 6.3, 6.4> Calculate the average time to read or write a 1024­byte sector for each FLASH memory listed in the table.


6.6.2 [10] <6.2, 6.3, 6.4> Calculate the minimum time to read or write a 512­byte sector for each FLASH memory listed in the table.


6.6.3 [5] <6.2, 6.3, 6.4> Figure 6.6 shows that FLASH memory read and write access times increase as FLASH memory gets larger. Is this unexpected? What factors cause this?
 


Exercise 6.7
I/O can be performed either synchronously or asynchronously. Explore the differences by answering performance questions about the following peripherals.
a. Printer
b. Scanner

6.7.1 [5] <6.5> What would be the most appropriate bus type (synchronous or asynchronous) for handling communications between a CPU and the peripherals listed in the table?


6.7.2 [5] <6.5> What problems would long, synchronous busses cause for connections between a CPU and the peripherals listed in the table?


6.7.3 [5] <6.5> What problems would asynchronous busses cause for connections between a CPU and the peripherals listed in the table?



Exercise 6.8
Among the most common bus types used in practice today are FireWire (IEEE 1394), USB, PCI, and SATA. Although all four are asynchronous, they are implemented in different ways giving them different characteristics. Explore different bus structures by answering questions about the busses and the following peripherals.
a. Mouse
b. Graphics Coprocessor

6.8.1 [5] <6.5> Select an appropriate bus (FireWire, USB, PCI, or SATA) for the peripherals listed in the table. Explain why the bus selected is appropriate. (See Figure 6.8 for key characteristics of each bus.)


6.8.2 [20] <6.5> Use online or library resources and summarize the communication structure for each bus type. Identify what the bus controller does and where the control physically is.


6.8.3 [15] <6.5> Outline limitations of each of the bus types. Explain why those limitations must be taken into consideration when using the bus.




Exercise 6.9
Communicating with I/O devices is achieved using combinations of polling, interrupt handling, memory mapping, and special I/O commands. Answer the  questions about communicating with I/O subsystems for the following applications using combinations of these techniques.
a. Auto Pilot
b. Automated Thermostat

6.9.1 [5] <6.6> Describe device polling. Would each application in the table be appropriate for communication using polling techniques? Explain.


6.9.2 [5] <6.6> Describe interrupt driven communication. For each application in the table, if polling is inappropriate, explain how interrupt driven techniques could be used.


6.9.3 [10] <6.6> For the applications listed in the table, outline a design for memory mapped communication. Identify reserved memory locations and outline their contents.


6.9.4 [10] <6.6> For the applications listed in the table, outline a design for commands implementing command driven communication. Identify commands and their interaction with the device.


6.9.5 [5] <6.6> Does it make sense to define I/O subsystems that use a combination of memory mapping and command driven communication? Explain your answer.




Exercise 6.10
Section 6.6 defines an eight­step process for handling interrupts. The Cause and Status registers together provide information on the cause of the interrupt and the status of the interrupt handling system. Explore interrupt handling by answering the questions about the following combinations of interrupts.
a. Ethernet Controller Data Mouse Controller Reboot
b. Mouse Controller Power Down Overheat


6.10.1 [5] <6.6> When an interrupt is detected, the Status register is saved and all but the highest priority interrupt is disabled. Why are low­priority interrupts disabled? Why is the status register saved prior to disabling interrupts?


6.10.2 [10] <6.6> Prioritize interrupts from the devices listed in each table row.


6.10.3 [10] <6.6> Outline how an interrupt from each of the devices listed in the table would be handled


6.10.4 [5] <6.6> What happens if the interrupt enable bit of the Cause register is not set when handling an interrupt? What value could the interrupt mask value take to accomplish the same thing?


6.10.5 [5] <6.6> Most interrupt handling systems are implemented in the operating system. What hardware support could be added to make interrupt handling more efficient? Contrast your solution to potential hardware support for function calls.


6.10.6 [5] <6.6> In some interrupt handling implementations, an interrupt causes an immediate jump to an interrupt vector. Instead of a Cause register where each interrupt sets a bit, each interrupt has its own interrupt vector. Can the same priority interrupt system be implemented using this approach? Is there any advantage to this approach?



Exercise 6.11
Direct Memory Access (DMA) allows devices to access memory directly rather than working through the CPU. This can dramatically speed up the performance of peripherals, but adds complexity to memory system implementations. Explore DMA implications by answering the questions about the following peripherals.
a. Mouse Controller
b. Ethernet Controller

6.11.1 [5] <6.6> Does the CPU relinquish control of memory when DMA is active? For example, can a peripheral simply communicate with memory directly, avoiding the CPU completely?


6.11.2 [10] <6.6> Of the peripherals listed in the table, which would benefit from DMA? What criteria determine if DMA is appropriate?


6.11.3 [10] <6.6> Of the peripherals listed in the table, which could cause coherency problems with cache contents? What criteria determine if coherency issues must be addressed?


6.11.4 [5] <6.6> Describe what problems could occur when mixing DMA and virtual memory. Which of the peripherals in the table could introduce such problems? How can they be avoided?
 



Exercise 6.12
Metrics for I/O performance may vary dramatically from application to application. Where the number of transactions processed dominates performance in some situations, data throughput dominates in others. Explore I/O performance evaluation by answering the questions for the following applications.
a. Mathematical Computations
b. Online Chat


6.12.1 [10] <6.7> For each application in the table, does I/O performance dominate system performance?


6.12.2 [10] <6.7> For each application in the table, is I/O performance best measured using raw data throughput?


6.12.3 [5] <6.7> For each application in the table, is I/O performance best measured using the number of transactions processed?


6.12.4 [5] <6.7> Is there a relationship between the performance measures from the previous two problems and choosing whether to use polling or interrupt driven communication? What about the choice of using memory mapped or command driven I/O?
 



Exercise 6.13
Benchmarks play an important role in evaluating and selecting peripheral devices. For benchmarks to be useful, they must exhibit properties similar to those experienced by a device under normal use. Explore benchmarks and device selection by answering questions about the following applications.
a. Mathematical Computations
b. Online Chat


6.13.1 [5] <6.7> For each application in the table, define characteristics that a set of benchmarks should exhibit when evaluating an I/O subsystem.


6.13.2 [15] <6.7> Using online or library resources, identify a set of standard benchmarks for applications in the table. Why do standard benchmarks help?


6.13.3 [5] <6.7> Does it make sense to evaluate an I/O subsystem outside the larger system it is a part of? How about evaluating a CPU?
 



Exercise 6.14
RAID is among the most popular approaches to parallelism and redundancy in storage systems. The name Redundant Arrays of Inexpensive Disks implies  several things about RAID arrays that we will explore in the context of the following  activities.
a. High-Performance Mathematical Computations
b. Online Video Services

6.14.1 [10] <6.9> RAID 0 uses striping to force parallel access among many disks. Why does striping improve disk performance? For each of the activities listed in the table, will striping help better achieve their goals?


6.14.2 [5] <6.9> RAID 1 mirrors data among several disks. Assuming that inexpensive disks have lower MTBF than expensive disks, how can redundancy using inexpensive disks result in a system with lower MTBF? Use the mathematical definition of MTBF to explain your answer. For each of the activities listed in the table, will RAID 1 help better achieve their goals?


6.14.3 [5] <6.9> Like RAID 1, RAID 3 provides higher data availability. Explain the trade­off between RAID 1 and RAID 3. Would each of the applications listed in the table benefit from RAID 3 over RAID 1?



Exercise 6.15
RAID 3, RAID 4, and RAID 5 all use parity system to protect blocks of data.  Specifically, a parity block is associated with a collection of data blocks. Each row in the following table shows the values of the data and parity blocks, as described in Figure 6.13.



6.15.1 [10] <6.9> Calculate the new RAID 3 parity value P’ for data in lines a and b in the table.


6.15.2 [10] <6.9> Calculate the new RAID 4 parity value P’ for data in lines a and b in the table.


6.15.3 [5] <6.9> Is RAID 3 or RAID 4 more efficient? Are there reasons why RAID 3 would be preferable to RAID 4?


6.15.4 [5] <6.9> RAID 4 and RAID 5 use roughly the same mechanism to calculate and store parity for data blocks. How does RAID 5 differ from RAID 4 and for what applications would RAID 5 be more efficient?


6.15.5 [5] <6.9> RAID 4 and RAID 5 speed improvements grow with respect to RAID 3 as the size of the protected block grows. Why is this the case? Is there a situation where RAID 4 and RAID 5 would be no more efficient than RAID 3?



Exercise 6.16
The emergence of web servers for ecommerce, online storage, and communication has made disk servers critical applications. Availability and speed are well­known metrics for disk servers, but power consumption is becoming increasingly important. Answer the questions about configuration and evaluation of disk servers with the following parameters.


6.16.1 [10] <6.8, 6.10> Find the maximum sustained I/O rate for random reads and writes. Ignore disk conflicts and assume the RAID controller is not the bottleneck. Follow the same approach as outlined in Section 6.10 making similar assumptions where necessary.


6.16.2 [10] <6.8, 6.10> Assume we are configuring a Sun Fire x4150 server as described in Section
 6.10. Determine if a configuration of 8 disks presents an I/O bottleneck. Repeat for configurations of 16, 4, and 2 disks.


6.16.3 [10] <6.8, 6.10> Determine if the PCI bus, DIMM, or the Front Side Bus presents an I/O bottleneck. Use the same parameters and assumptions used in Section 6.10.


6.16.4 [5] <6.8, 6.10> Explain why real systems tend to use benchmarks or real applications to assess actual performance.



Exercise 6.17
Determining the performance of a single server with relatively complete data is an easy task. However, when comparing servers from different vendors providing different data, choosing among alternatives can be difficult. Explore the process of finding and evaluating servers by answering questions about the following application.
Database Server

6.17.1 [15] <6.8, 6.10> For the application listed above, identify runtime characteristics for an operational system. Choose characteristics that will support evaluation similar to that performed for Exercise 6.16



6.17.2 [15] <6.8, 6.10> For the application listed above, find a server available in the marketplace that you feel would be appropriate for running the application. Before evaluating the server, identify reasons why it was selected.


6.17.3 [20] <6.8, 6.10> Using metrics similar to those used in Chapter 6 and Exercise 6.16, assess the server you identified in 6.17.2 in comparison to the Sun Fire x4150 server evaluated in Exercise 6.16. Which would you choose? Did the results of your analysis surprise you? Specifically, would you choose differently?


6.17.4 [15] <6.8, 6.10> Identify a standard benchmark set that would be useful for comparing the server you identified in 6.17.2 with the Sun Fire x4150.



Exercise 6.18
Measurements and statistics provided by storage vendors must be carefully interpreted to gain meaningful predictions about their system behavior. The following table provides data for various disk drives.



6.18.1 [10] <6.12> Calculate annual failure rate (AFR) for disks in the table.


6.18.2 [10] <6.12> Assume that annual failure rate varies over the lifetime of disks in the previous table. Specifically, assume that AFR is three times as high in the 1st month of operation and doubles every year starting in the 5th year. How many disks would be replaced after 7 years of operation? What about 10 years?


6.18.3 [10] <6.12> Assume that disks with lower failure rates are more expensive. Specifically, disks are available at a higher cost that will start doubling their failure rate in year 8 rather than year 5. How much more would you pay for disks if your intent is to keep them for 7 years? What about 10 years?



Exercise 6.19
For disks in the table in Exercise 6.18, assume that your vendor offers a RAID 0 configuration that will increase storage system throughput by 70% and a RAID 1 configuration that will drop AFR of disk pairs by 2. Assume that the cost of each solution is 1.6 times the original solution cost. 6.19.1 [5] <6.9, 6.12> Given only the original problem parameters, would you recommend upgrading to either RAID 0 or RAID 1 assuming individual disk parameters remain the same in the previous table?


6.19.2 [5] <6.9, 6.12> Given that your company operates a global search engine with a large disk farm, does upgrading to either RAID 0 or RAID 1 make economic sense given that your income model is based on the number of advertisements served?


6.19.3 [5] <6.9, 6.12> Repeat 6.19.2 for a large disk farm operated by an online backup company. Does upgrading to either RAID 0 or RAID 1 make economic sense given that your income model is based on the availability of your server?



Exercise 6.20
Day­to­day evaluation and maintenance of operational computer systems involves many of the concepts discussed in Chapter 6. Explore the intricacies of evaluating systems by exploring the following questions.

6.20.1 [20] <6.10, 6.12> Configure the Sun Fire x4150 to provide 10 terabytes of storage for a processor array of 1000 processors running bioinformatics simulations. Your configuration should minimize power consumption while addressing throughput and availability concerns for the disk array. Make sure you consider the properties of large simulations when performing your configuration.


6.20.2 [20] <6.10, 6.12> Recommend a backup and data archiving system for the disk array from
 6.20.1. Compare and contrast disk, tape, and online backup capabilities. Use Internet and library resources to identify potential servers. Assess cost and suitability for the application using parameters described in Chapter 6. Select parameters for comparison using properties of the application as well as specified requirements.


6.20.3 [15] <6.10, 6.12> Competing vendors for the systems you identified in 6.20.2 have offered to allow you to evaluate their systems on site. Identify the benchmarks you will use to determine which system is best for your application. Determine how long it will take you to gather enough data to make your determination.