CMSC 611, Spring 2018 Homework 2

SGI Origin 2000

The SGI Origin 2000 parallel computer included a HUB ASIC (Application-Specific Integrated Circuit). It connected a pair of CPUs to the network and implemented the cache coherence protocol. This ASIC consisted of about 900K gates, running at 100 MHz. (Erikísson et al., "Origin system design methodology and experience: 1M-gate ASICs and beyond", Proceedings of IEEE Compcon '97)

Partioned Testing

The Hub was broken into five modules for Verilog testing. The modules were further divided into a total of 20 “chiplets”. Why not simulate the whole thing at once?

Simulation speed

Full-chip tests combined a process running a RTL-level version of the Hub with a separate RTL simulation process for each R10000 CPU, simulating at 6 cycles per second (using a total of 49 processes across 32 processors for a simulation of a 16-node system). What is the speedup of the 195 MHz R10000 real system over the simulation.

Chiplet Interface

Timing simulations of chiplets and the inter-chiplet cross-chip timing drove the layout and determined the clock speed. Eiríksson et al. say, “wherever possible, outputs were launched out of the chiplets from a register”. What is the benefit to doing this? What did they potentially lose by using registers for chiplet outputs?

Lingering Bugs

There were still bugs that did not appear until testing with the actual chip. Why were these not caught earlier?

CDC 6600

The CDC 6600 was a computer designed by Seymour Cray (of supercomputer fame) in 1965. Among other things, this computer used 60-bit words. Instructions using only registers (R) took 15 bits, while those including an address (A) took 30 bits. Each word could hold multiple instructions (e.g. up to four R-type), but instructions were not allowed to be split across a word boundary. So, if you wanted to have an A-type instruction with only 15-bits left in the current word, you would fill with a 15-bit NO-OP (N) instruction, and start the A-type at the beginning of the next word.

Simulation

Write a program to empirically estimate the percentage of NO-OP instructions if the initial instruction mix (before adding NO-OPs) was 35% R-type and 65% A-type. You can do this by generating random stream of instructions of both types then padding with NO-OPs as necessary to obey the alignment constraints. Report the percentage of R and A in the original instruction stream, the percentage of words that have an R or A as their first instruction, the percentage of words that have an R, A or N as their last instruction, and the final percentages of R, A, and N per final instruction.

Also turn in a printout of your program. It shouldn't be too long (mine is 72 lines, 18 of which are comments or blanks, and 13 are printing the final statistics). Obviously, we won't run it, but I want to see your approach.

Analytical

Derive the percentage analytically. This is trickier than it may seem. Hint: the first instruction of the word has a different probability of occurrance than later instructions in the word, with A-type instructions more likely since they could be pushed into the first spot in the word from the previous word. This affects the probability of each of the other sequences that can occur in a word. Solving for the first-instruction probabilities is the key to finding the other statistics. If your program and analytic solution do not match, there's a problem with at least one of them.

Many of the details of the CDC 6600 for this problem were taken from Blaauw and Brooks, Computer Architecture: Concepts and Evolution. According to that source, the CDC 6600 could also only branch to addresses on a word boundary, resulting in still more NO-OPs (which I am not asking you to model). With these two sources (branch target alignment and instruction alignment), apparently 23% of compiled code for this machine was NO-OP instructions!

x86 CPI

Consider this code performing a dot product between two 16-bit integer vectors

do {
	r = r + a[i] * b[i];
	++i;
} while (--c != 0);

That can be translated into this code for the Intel 8086 (yes the one from 1991), assuming r is in the BX register, c is in the CX register, i is in the SI register, and the result is generated in AX:

loop:	MOV   AX, a[SI]	; AX = a[SI]
		IMUL  b[SI]		; AX = AX * b[SI]
		ADD   BX, AX		; r = r + AX
		ADD   SI, 2		; SI = SI + 2 (for 2-byte data): this is the ++i
		DEC   CX		; CX = CX - 1: this is the --c
		JNZ   loop		; this is the while test
		;; assume this is followed by a register/register ALU instruction

8086 CPI and execution time

Using the timing data for this processor, compute the CPI for this code. You will need to use the "Effective Address (EA)" timing table as well as the individual instruction tables. In this and later questions, if the instruction gives a range of timings, your answer should also be a range. Given a 10 MHz clock speed and 100 iterations, what is the expected execution time of this code?

286 Processor

What are the CPI and expected execution time for the 286 processor running at the same clock speed. What is the speedup for this code of the 286 as compared to the 8086?

Amdahl

What percentage of the total execution time is spent in the IMUL instruction? If you could make this instruction 3x faster, how much faster would that make the whole loop? Answer for both the 8086 and 286, using Amdahl's Law to solve it in both cases.

Evaluating Changes

For each processor, what speedup in the IMUL instruction would be worth reducing the clock speed from 10 MHz to 5 MHz for this code.