R

RStudio Server Admin

RStudio

PDF

Installing packages

sudo R
install.packages('mclust')
source("http://bioconductor.org/biocLite.R")
biocLite("Biobase")
source("http://bioinformatics.mdanderson.org/OOMPA/oompaLite.R")
oompaLite()

General

library() - show all installed packages
message() as superior to print and cat

Jupyter Notebook Tips

Don't use dev.new() or your plots won't show up

Links

Seven Ways To Plot Dendrograms in R

Functions

distance = dissimilarity

ggplot

piechart:

library( tidyverse)
library(ggplot2)
library(readxl)

var <- read_excel("GENES.xlsx")

stats <- var %>% select( `Gene Biotype` ) %>% drop_na %>% 
  group_by(`Gene Biotype`) %>% summarize( count=n() )

names(stats) <- c( "name", "count")

ggplot( stats, aes( x="", y=count, fill=factor(name) ) ) + 
  geom_col( width= 1) +  
  scale_fill_manual( values=c('red', 'blue', 'green', 'darkorchid', 'sienna4', 'thistle', 'gray0',
                              'khaki4', 'seagreen', 'midnightblue', 'azure4', 'cornflowerblue',
                              'olivedrab', 'lightgreen', 'purple4', 'turquoise') ) +
  coord_polar( theta="y", start=0 )

Data Types

parallel package

R-core package, replaces 3rd party multicore and snow

references

course-grained parallelization

large chunks of computations in parallel
chunks of computation are unrelated, do not need to communicate in any way
great for bootstrapping

load-balancing paradigm

start up M worker processes
allow worker processes access to data
split task into N tasks (chunks)
send the first M number of tasks to the M workers
- Implementation detail - via serialization, it not unlimited
when a worker finishes a task send it another one

Worker paradigms

"Cluster" - SAME MACHINE, start new processes ("snow" style)
"fork" - SAME MACHINE, POSIX forks, copy on write, theoretically cheaper, not avail on windows
MPI - MULTIPLE MAHCINES

CPUs/Cores

physical CPU has 2 or more cores that run independently
ergo concept of "logical CPU"
considerations
- we know how many ther are, but how many available TO YOU?
- hopefully your chunk of code that you want to run in parallel does not, itself, run multiple cores

apply functions

mclapply() - sets up ephemeral group of cores for this computation
makeCluster() + parLapply() + stopCluster() - set up a pool of workers, then call stop cluster when done
parRapply/parCapply for per row/col apply for matrix

"Cluster", i.e., multiprocessing API

makeCluster( n_cores, type="PSOCK|FORK") - calls down to one of two subfunctions
- makePSOCKcluster() - uses Rscript to launch further copies of R on this or other hosts
- makeForkCluster() - not available on windows - workers automatically inherit environment of parent session
stdout and stderr are discarded by default, unless logged by outfile option

library doParallel

registerDoParallel(myCluster)
stopCluster(myCluster)

R

Contents

RStudio Server Admin

PDF

Installing packages

General

Jupyter Notebook Tips

Links

Functions

ggplot

Data Types

parallel package

references

course-grained parallelization

load-balancing paradigm

Worker paradigms

CPUs/Cores

apply functions

"Cluster", i.e., multiprocessing API

library doParallel

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools