Concurrent Computing

From Colettapedia
Jump to navigation Jump to search

General

  • mutex = mutual exclusion = avoid the simultaneous use of a common resource, such as a global variable, by pieces of computer code called critical sections.
  • critical section = critical section is a piece of code that accesses a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution.
  • semaphore = protected variable or abstract data type which constitutes the classic method for restricting access to shared resources such as shared memory in a parallel programming environment.
  • Atomic instructions such as "test-and-set", "fetch-and-add" or "compare-and-swap". These instructions allow a single process to test if the lock is free, and if free, acquire the lock in a single atomic operation.
  • OpenMP = Open Multi Processing - An application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++ and Fortran on many architectures, including Unix and Microsoft Windows platforms.

Locks

  • lock - wikipedia article
  • granularity = a measure of the amount of data the lock is protecting. In general, choosing a coarse granularity (a small number of locks, each protecting a large segment of data) results in less lock overhead when a single process is accessing the protected data, but worse performance when multiple processes are running concurrently. This is because of increased lock contention. The more coarse the lock, the higher the likelihood that the lock will stop an unrelated process from proceeding. Conversely, using a fine granularity (a larger number of locks, each protecting a fairly small amount of data) increases the overhead of the locks themselves but reduces lock contention. More locks also increase the risk of deadlock.

Aug 2 2018 presentation

  • make code good before parallelizing
  • start with high-level
  • beware tradeoffs, execution speed vs development time, hardware and software dependencies

Threading

  • load the data once, all the threads work on same shared data
  • OpenMP
    • same code, just use pragma
    • !$omp parallel do reduction( +:sum) &
    • compiler will just ignore pragmas if you don't have openmp
    • gcc simple.c -std=c99 -fopenmp
  • R
    • mclapply = multicore
    • lapply vs library(parallel); mclapply( input vector, FUN=func(){}
    • basic optimization before you jump to parallelization
    • avoid needless copy
    • prefer apply over for loops
  • C++
    • tbb::parallel_for( )
    • class "vector" is not thread safe. Ok if threads operate on non-overlapping indices. Otherwise use concurrent_vector.
    • c++-11 thread library

GPU

  • cuda Compute unified device architecture
    • new language construct that preprocessor converts to low level cudaMallocManaged, add <<< >>>
  • R
    • library(gpuR), detectGPUs(), vclMatrix
    • transfer gpu back to regular RAM as.matrix(result)

Apache Spark

  • data is so larg it doens't fit on a single computer
  • mapreduce - data sits unmutable, data reduction step, compute sum
  • start spark cluster
  • module load spark

spark start -t 120 3 # launch cluster 3 nodes 120 minutes spark list -d # obtain details

  • pyspark --master spark://cn0600:7077 -executao0memory=40g
  • txt = spark.sparkContent.textFile( "test.bed")
  • txt.filter( lambda l: l.startswith('chr7'))

txt.map( lambda l: len(l) ).reduce ...

Slurm