ARM Shared Memory

The ARM Cortex A57 uses a modification of the ESI protocol we cover in class, called the MOESI protocol. This adds "Modified" and "Owned" states that allow data to be shared directly from one core's L1 cache to another core, without having to go through the L2 cache. If the data in a block in the Owned state is dirty, it is written back to L2 when the Owned block becomes Invalid (from eviction due to cache conflict, or write request from another core).

Here’s a summary of the states.

Draw a state diagram for this protocol. To avoid too much state transition spaghetti, I suggest arranging the states in a circle. Including state transitions for CR=This core's CPU requests a Read, CW=This core's CPU requests a Write, BR=see a Read on the Bus from another core, and BW=see a Write on the Bus from another core. Include on each transition what, if anything, the core sends on the bus: W = send Write notification, R = send Read notification, or D = send Data.

Profiling Data

The Core i7 has roughly these cache and memory latencies:

L1 latency4 cycles
L2 total latency10 cycles
L2 penalty6 cycles
L3 unshared latency40 cycles
L3 shared in another core64 cycles
L3 modified in another core75 cycles
Remote L3100-300 cycles
Local DRAM60 ns
Remote DRAM100 ns

On a 2.6 GHz Core i7, I recorded the following stats for a test program. This is the same ray-tracing program used for the valgrind demo in the first couple of weeks of class, but this time recorded using a sampling profiler and the i7's hardware counters.

Cycles47,926,025,373
Instructions96,604,699,195
Branches9,791,557,458
Mispredicted Branches10,399,323
L1 Hits29,788,371,721
L1 Misses236,895,473
L2 Misses58,501,962
L3 Misses20,018,953

The most expensive single function in the program, Sphere::intersect(), was responsible for this subset of the totals

Cycles1,463,862,906
Instructions3,191,721,155
Branches322,707,628
Mispredicted Branches290,988
L1 Hits994,868,818
L1 Misses6,764,568
L2 Misses1,689,718
L3 Misses568,123

Basic Ahmdal

  1. How long does this program take to run?
  2. How much of that time is in the Sphere::intersect() function?
  3. What percentage is Sphere::intersect of the total program execution time?
  4. If Sphere::intersect() could be made 1.15x faster, what would the overall speedup be?
  5. What would the new total execution time be?

CPI

For each question, give answers for the program as a whole, and just for the Sphere::intersect() function

  1. What is the CPI?
  2. This processor should be able to achieve four instructions per cycle. What would the expected execution time be if we could achieve that rate?
  3. What would the speedup be of achieving 0.25 CPI vs. the actual CPI

Branch Prediction

For each question, give answers for the program as a whole, and just for the Sphere::intersect() function

  1. What is the branch misprediction rate?
  2. Assuming a 17 cycle branch misprediction penalty, how many cycles are spent on branch mispredictions?
  3. What percentage of the execution time is spent on branch misprediction?
  4. What would the expected speedup and execution time be if you could eliminate the branch mispredictions?

Cache Misses

For each question, give answers for the program as a whole, and just for the Sphere::intersect() function

  1. What is the average memory access time? Since it is a single program on a single core, the use the unshared L3 time and the local DRAM time.
  2. How many cycles are spent on cache misses?
  3. What percentage of the execution time is spent on cache misses?
  4. What would the expected speedup and execution time be if you could eliminate the cache misses?