Question 1

Consider the following loop, which implements a vector dot product

loop: L.D   F2, X(R1)
      L.D   F4, Y(R1)
      MUL.D F6, F4, F2
      ADD.D F0, F0, F6
      SUBUI R1, R1, #8
      BNEZ  R1, loop

Assume a 5-stage integer pipeline with successful branch prediction, so the BNEZ can be considered to take a single cycle. Integer instructions have a single execution stage, and floating point instructions have variable length execution, where all instructions are fully pipelined with an initiation interval of 1, and have the following latencies

Producing Instruction Instruction Using Results Latency in cycles
Integer ALU Integer ALU 0
Integer ALU Branch 1
Float Load Float ALU 1
Float Add Float ALU 3
Float Multiply Float ALU 4

Using this data:

  1. Show the possible stalls for one loop iteration. How many clock cycles does one loop iteration take?
  2. Schedule the loop by reordering instructions to reduce the execution time. Show the stalls and give the number of clock cycles for one iteration.
  3. Unroll enough times to minimize the number of stalls. Show the stalls for one unrolled loop, and give the number of clock cycles per original iteration. That is the number of cycles for the unrolled loop divided by the number of iterations that have been unrolled.

Question 2

Consider the original version of the code using Tomasulo's algorithm for scheduling, with the following assumptions:

Functional Unit Reservation Stations Number of Units Cycles in EX
Integer 2 1 1
Load/Store 2 2 2
Float Add 2 2 4
Float Multiply 2 1 5

Complete this table for the first two iterations

Instruction Issue Execution Complete Write Result
L.D F2, X(R1)
L.D F4, Y(R1)
MUL.D F6, F4, F2
ADD.D F0, F0, F6
SUBUI R1, R1, #8
X
X
BNEZ R1, loop
X
X
L.D F2, X(R1)
L.D F4, Y(R1)
MUL.D F6, F4, F2
ADD.D F0, F0, F6
SUBUI R1, R1, #8
X
X
BNEZ R1, loop
X
X