For this assignment, we will be estimating performance of these linear and binary search functions through a sorted array using the methods from this introduction to data flow graphs for scheduling and optimization.

// Key and pointer to data used for search array
struct KeyValue {
        unsigned int key;
        Data *data;
};

// sequentially search array: O(len)
Data *linearSearch(unsigned int target, KeyValue *list, unsigned int len)
{
        for (unsigned int i=0; i < len && list[i].key <= target; ++i) {	// cache hit, predictable branch
                if (list[i].key == target)							// cache hit, predictable branch
                        return list[i].data;							// cache hit
        }

        return nullptr;
}

// binary search through array: O(lg(n))
Data *binarySearch(unsigned int target, KeyValue *list, unsigned int len)
{
        unsigned int lo = 0, hi = len;
        unsigned int mid = lo + (hi - lo)/2;

        while (lo < hi && list[mid].key != target) {				// cache miss, predictable branch
                if (list[mid].key < target)						// cache hit, unpredictable branch
                        lo = mid+1;
                else
                        hi = mid;

                mid = lo + (hi - lo)/2;
        }

        if (list[mid].key == target)							// cache hit, but outside loop
                return list[mid].data;

        return nullptr;
}

We will use the following assumptions: Up to four non-dependent instructions can be issued per cycle. Memory addressing and access issue separately with dependency between them. ALU operations take one cycle, memory operations take four cycles for an L1 cache hit and sixteen for an L1 cache miss. Uncondiitional branches take zero cyces (the branch target can issue in the same cycle as the branch). Predictable conditional branches take one cycle, and unpredictable ones take sixteen cycles.

Linear Array Search

Single iteration data flow

Using the data flow graph notation from the link above, draw the single-iteration timing for a middle iteration of the linear search. Except for the final iteration when the value is found, the branches always branch the same way, so are predictable. The memory access pattern through the list is linear, and can be easily recognized by hardware prefetch, so assume all memory accesses are L1 cache hits.

Linear search timing

What is the average estimated cycles for a linear search in an array of length len, assuming the keys in the list and the target to search for are all uniformly distributed. Ignore the timing differences for the first and last iteration, as well as the timing for any code outside the loop.

Binary Array Search

Single iteration data flow

Using the data flow graph notation from the link above, draw a single iteration timing for a middle iteration of the binary search. Since the binary search jumps around in the array, the first access to the list array each iteration can be assumed to be an L1 cache miss. Also, the branches for the while condition are predictable, but the conditional inside the loop is unpredictable.

Binary search timing

What is the average estimated cycles for a binary search in an array of length len, assuming the keys in the list and the target to search for are all uniformly distributed. Ignore the timing differences for the beginning and ending iterations, as well as the timing for any code outside the loop.

Comparison

The linear search will win for small lists, due to the cache and branch prediction misses of the binary search, but eventually the better asympototic behavior will allow the binary search to pull ahead. Based on your results from questions 1 and 2, how long would the list need to be for the binary search to be faster. Like the previous problems, ignore the timing differences for any beginning or ending iterations when the cache or branch behavior might be different, and the timing for the extra code outside the loop. It'd be more accurate to include those, but likely within the margin of error of our rough assumptions anyway.

Static Scheduling

Note that this question has completely different memory latency and branching assumptions than the previous questions.

Unoptimized code

Identify the reason for each nop instructions in this linear array search, compiled for MIPS. Some of these are not necessary, but represent cycles that would be lost due to a stall even without the explicit nop. In each of these cases, identify the register that would be reponsible for the stall.

Optimized code

This optimized version has fewer nop instructions, but the compiler misses some. Rewrite this code to get rid of as many additional nop instructions as you can.