1. Terminology
    1. Physical
      1. Node: hardware unit with single memory
        • Primarily in clusters & supercomputers
      2. CPU: used to be one processor, now may have multiple cores sharing cache
      3. Core: Distinct processor
      4. SIMD Lane: one SIMD processing "core"
    2. Computational
      1. Thread: distinct flow of control
        1. May have several on a core & swap between them
        2. OS thread: OS decides swaps at large granularity
        3. Hyperthread: multiple threads on one core
          1. Share at least some core resources (cache, execution units)
          2. Own copy of others (registers, program counter)
          3. Switch on an instruction-by-instruction basis
          4. Hide instruction dependency, cache miss, branch penalty, ...
        4. GPU thread: may interleave several on each SIMD lane
          1. Own registers, share other SIMD resources just like another lane
          2. Allows latency hiding
          3. Warp: GPU terminology for a set of SIMD threads
      2. Bottleneck
        1. Thing that is limiting performance
        2. Often changes away from the bottleneck will have no perf impact
        3. Locate and improve bottleneck to have biggest perf impact
        4. Additional non-bottlenecked work essentially free
  2. Classifications of parallelism
    1. Flynn: SISD, MIMD, SIMD, (MISD)
    2. Task parallel vs. data parallel
      1. Data parallel
        1. see SPMD / streaming
        2. Object decomposition
        3. Spatial decomposition
        4. Finite elements/volumes
      2. Task parallel
        1. Do different tasks on different threads
        2. Pipeline
          1. Performer: App, Cull, Draw
          2. OSG: Update, Cull, Sort, Render
        3. Unrelated tasks
          1. AI, physics, graphics
        4. Producer/consumer
    3. Parallel programming styles
      1. OpenMP (compiler, #regions in code)
      2. MPI (message passing)
      3. Spawn/Fork / Join (PRAM)
      4. Parallel job manager
  3. Ordering
    1. Order only by observed data/synchronization
      1. Given x=0; T1: x=x+1; T2: x=x+2; what is x?
      2. 1, 2, or 3
    2. Draw order as graph
  4. Thread safety
    1. Behave correctly given multiple threads
      1. Private data
      2. Lock or manage shared access
      3. Manage work through queues
    2. Mutex: mutual exclusion
      1. One thread at a time in mutex region
      2. Block if a thread is already in region
      3. Acquire (or acquire on create), release (or release on destroy)
      4. Contention: too much time blocking for access
    3. Lock: only one thread can hold lock
      1. Lock, do something, unlock
      2. Blocks if another thread has lock
      3. Watch for deadlock!
    4. Semaphores
      1. Wait (P)
        • block until >0, then decrement
      2. Signal (V)
        • non-blocking increment
      3. Queue: Signal for work, wait until work
      4. Lock: Signal to unlock, wait to lock
    5. Atomic operations
      1. Atomic ALU op (add, etc.)
      2. Test and set
        1. write, return old value
        2. Lock: while(test(x,1)==1); Unlock: x=0
      3. Compare and swap
        1. If passes comparison, write value, return old
      4. Barrier/fence
        1. Complete read/write before fence before proceeding
        2. CPU read/write can occur out of order
        3. Can result in code after lock completing before acquiring lock
        4. Fence enforces ordering
  5. Problems
    1. Cache block sharing: MESI = Modified, Exclusive, Shared, Invalid
      1. Thomadakis 2011 graph
        1. Write a bunch of memory by CPU 0 (size = x axis)
        2. Read by CPU N
        3. Red = same CPU, see L1, L2, L3, memory
        4. Green, Blue, Magenta = other core
          1. L3/mem cost same as CPU0
          2. L1/L2 cost higher &emdash; near memory
          3. Must write modified, then read
        5. Cyan = core on 2nd CPU
      2. Lessons
        1. Avoid read/write conflicts, avoid write/write conflicts
        2. This is one write/one read, worse if ping-pong
        3. Group shared data by use
        4. False sharing = on the same cache line as shared data
    2. Poor synchronization
      1. Thread has to wait on another for work
      2. Thread has to wait on another for shared resource
        1. Output buffer
        2. Common data (e.g. memory allocator)
      3. Lessons
        1. Decoupled, not just thread safe
        2. Separate output buffer per thread (e.g. Vulkan)
        3. Pre-allocate data, use pools, use delete queue
        4. Messaging helps decoupling
  6. Lock-free/wait-free data structures
    1. Avoid mutex over large regions
    2. Data structure works with atomic operations
    3. Wait free: stronger constraint not to block
    4. Hash example
      1. Non-zero integer keys, non-zero integer values
      2. Insert, search, no erase; fixed size table
      3. Open addressing (place collisions elsewhere in table)
      4. Linear probing
        • More cache friendly than quadratic or multiple hash
      5. Cache performance
        1. Multiple overlapping reads OK
        2. Writes/reads of different keys likely in different cache lines
      6. Insert
        1. Linear probe using atomic compare (w/ 0) and swap (w/ key)
        2. If return == 0, successfully inserted key
        3. If return == key, found existing entry
        4. Once found, update value
      7. Search
        1. Linear probe looking for key
        2. If return != key, keep looking
        3. If return == 0, done, not found
        4. If return == key and value == 0, not found
          1. Incomplete update in progress
          2. Consistent with probe prior to insert
    5. See also: lock free queue
    6. Warning: Lock free is hard to get right!
      1. Example: GPU lock
        while(atomicCAS(lock,0,id) != 0) {}  // wait to acquire lock
        // ... do single thread code ...
        atomicExcg(lock,0);                  // release lock
        1. SIMD GPU continues loops until all threads exit
        2. One thread gets lock, but doesn't continue because others are still in loop
        3. Lock is never released
      2. If only knowledge is this class, recommend using, not writing
      3. Take a parallel programming class
      4. Look at lots of lock-free data structures
  7. Civ V example
    1. Demo: 25:25
    2. Mutex: 36:36