Concurrent Computing

General

mutex = mutual exclusion = avoid the simultaneous use of a common resource, such as a global variable, by pieces of computer code called critical sections.
critical section = critical section is a piece of code that accesses a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution.
semaphore = protected variable or abstract data type which constitutes the classic method for restricting access to shared resources such as shared memory in a parallel programming environment.
Atomic instructions such as "test-and-set", "fetch-and-add" or "compare-and-swap". These instructions allow a single process to test if the lock is free, and if free, acquire the lock in a single atomic operation.
OpenMP = Open Multi Processing - An application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++ and Fortran on many architectures, including Unix and Microsoft Windows platforms.

lock - wikipedia article
granularity = a measure of the amount of data the lock is protecting. In general, choosing a coarse granularity (a small number of locks, each protecting a large segment of data) results in less lock overhead when a single process is accessing the protected data, but worse performance when multiple processes are running concurrently. This is because of increased lock contention. The more coarse the lock, the higher the likelihood that the lock will stop an unrelated process from proceeding. Conversely, using a fine granularity (a larger number of locks, each protecting a fairly small amount of data) increases the overhead of the locks themselves but reduces lock contention. More locks also increase the risk of deadlock.

make code good before parallelizing
start with high-level
beware tradeoffs, execution speed vs development time, hardware and software dependencies

load the data once, all the threads work on same shared data
OpenMP
- same code, just use pragma
- !$omp parallel do reduction( +:sum) &
- compiler will just ignore pragmas if you don't have openmp
- gcc simple.c -std=c99 -fopenmp
R
- mclapply = multicore
- lapply vs library(parallel); mclapply( input vector, FUN=func(){}
- basic optimization before you jump to parallelization
- avoid needless copy
- prefer apply over for loops
C++
- tbb::parallel_for( )
- class "vector" is not thread safe. Ok if threads operate on non-overlapping indices. Otherwise use concurrent_vector.
- c++-11 thread library

cuda Compute unified device architecture
- new language construct that preprocessor converts to low level cudaMallocManaged, add <<< >>>
R
- library(gpuR), detectGPUs(), vclMatrix
- transfer gpu back to regular RAM as.matrix(result)

spark start -t 120 3 # launch cluster 3 nodes 120 minutes spark list -d # obtain details

txt.map( lambda l: len(l) ).reduce ...