Homework 3

CMSC 411 / Olano, Fall 2015

Turn in this assignment

Computation Time (20 points)

Assume multiplies take 32 cycles, branches take 2 cycles, and all other instructions take 1 cycle. At 2 GHz, how long does this program take to execute? Show the intermediate steps as well as the final number

        addi $3, $0, 0
        addi $1, $0, 10
loop:   mul  $2, $1, $1
        add  $3, $3, $2
        addi $1, $1, -1
        bnez $1, loop

Instruction Speedup (20 points)

Array indexing operations often need take the form base + index*stride, where stride is the spacing between array elements (row size, struct size, etc.). In the MIPS instruction set, reading an array element takes at least three instructions:

muli $offset, $index, #stride
add  $addr, $base, $offset
lw   $element, 0($offset)

Consider adding new R-type load and store instructions for strides up to 32 that uses the shift field for the stride. This form of this instruction would be:

lwa  $element, $base, $index, #stride

encoded as

op (6b) rs (5b) rt (5b) rd (5b) sh (5b) func (6b)

If all instructions on this processor take one cycle (including the multiplies) and 5% are lw and sw instructions that can be replaced with the new instruction, what is the expected speedup?

Combined Speedup (20 points)

Consider a processor where all instructions take one cycle at 500 MHz, limited by the time for the multiply operation. If you break the multiply into five cycles, you can increase the clock speed to 2 GHz. If 4% of the instructions are multiplies, what is the expected speedup?

Amdahl's Law (20 points)

Consider the processor from the end of the previous problem (2 GHz clock rate, 4% of the instructions are multiplies, which take 5 cycles, all other instructions take one cycle). If you reduce the multiply to just four cycles, what is the speedup for multiply instructions? What fraction of the time is spent in multiply instructions? Using Amdahl's law, what is the expected overall speedup?

Other Applications (20 points)

A naive ray tracing algorithm spends 95% of its time doing ray intersection computation, 5% on I/O, and negligible time traversing data structures. A smarter ray tracer may avoid unnecessary ray intersection computations by using more complex data structures, spending 60% of its time traversing data structures, 10% on ray intersections, and 30% on I/O. Identify the unchanged time between the two cases. If that time is the same for both, what is the speedup of the smarter ray tracer relative to the naive one?

Extra credit (20 points)

The CDC 6600 was a supercomputer designed in 1965 by Seymour Cray. It had 60 bit words. Each instruction was either 30 bits (1/2 word) or 15 bits (1/4 word). The 30-bit instructions were not allowed to be split across a word boundary, so if there were only 15 bits left when a 30-bit instruction was needed, you would encode a 15-bit No-op so the 30-bit instruction could start in the next word. Given a mix of 50% 15-bit instructions and 50% 30-bit instructions, we want to figure out the expected percentage of no-op instructions. Write a program to simulate this by randomly inserting either A for a 15-bit instruction, BB for a 30-bit instruction, or N for a no-op into a string. From the results of this simulation, what is the expected percentage of N instructions? What would the expected speedup be if you were able to avoid the no-op instructions? Include the percentage and speedup in your readme.txt.

Solve for this percentage analytically. Warning: I had to enumerate the possible cases, use conditional probabilities, and a few probability identities to do this. It is not that easy, and probably not worth the extra points over what you'd get by just writing the simulation.


Submit on paper at the beginning of class. If you do the simulation portion of the extra credit, submit that electronically in your assn3 git directory with a short "readme.txt". You do not need to submit the readme.txt if you do not attempt the extra credit.