Pipelining

The MIPS R4000 processor has an 8-stage pipeline with stages shown below

R4000 pipeline stages: IF IS RF EX DF DS TC WB

Draw a timing diagram for the following sequence of instructions, with cycles on the horizontal axis and instructions issued on the vertical axis. Show any necessary stalls and draw arrows between stages for any necessary forwarding.

LW    $s1, 0($s1)
LW    $s2, 0($s2)
ADD   $s3, $s1, $s2
ADDI  $s3, #1
LW    $s4, 0($s3)
ADDI  $s4, #5
SW    0($s3), $s4

Branching

The R4000 branch delay is 3 cycles. Assuming branch prediction with a BTB in the IF stage, consider the following code

        BNEZ  $s5, Target
        SUB   $s1, $s2, $s3
Target: ADD   $s1, $s1, $s4
        SUB   $s3, $s3, $s1
        ADD   $s2, $s1, $s4
        ADD   $s2, $s2, $s3

For each of the following, draw a timing diagram with cycles on the horizontal axis and instructions issued on the vertical axis. Clearly indicate instructions that are issued speculatively, then abandoned

Correct Predition of Not Taken

Draw the diagram if the branch is predicted to not be taken, and is, in fact, not taken.

Correct Prediction of Taken

Draw the diagram if the branch is predicted to be taken, and is, in fact, taken.

Incorrect Predition of Not Taken

Draw the diagram if the branch is predicted to not be taken, but is actually taken.

Incorrect Prediction of Taken

Draw the diagram if the branch is predicted to be taken, but is actually not taken.

Instruction Set Architecture

The Intel 4004 was a 4-bit microprocessor, and the first processor created by Intel. For this question, refer to the Intel MCS-4 Assembly Language Programming Manual.

The SUB instruction performs A = A + ~R + ~C for accumulator A, Register R, and carry bit C; and the SBM instruction performs A = A + ~M + ~C for M in memory.

A 2's complement negation is -R = ~R + 1, so these instructions performs a 2's complement subtract if C is initialized to 0. As seen in section 4.6 of the 4004 manual, you can chain these instructions for subtracts of larger than four bits, but need to complement the carry between each 4 bits of a multi-digit subtract.

Non-inverted Carry

If the subtract instructions instead performed A = A + ~R + C (without inverting the carry), you would need to set the carry to 1 before a single four-bit subtract or the first of a sequence, but would not need to do anything to the carry between digits of a multi-digit subtract. Rewrite the code on page 4-17 for this new version of the subtract

Local speedup

What is the expected speedup of just this subtract code, as a function of the number of loop iterations?

Overall speedup

Assuming 32-bit subtract operations are 5% of the total run time of a program, what is the overall speedup?