R
Jump to navigation
Jump to search
PDF
Contents
RStudio Server Admin
File:R-datatypes-and-syntax.pdf
Installing packages
sudo R install.packages('mclust') source("http://bioconductor.org/biocLite.R") biocLite("Biobase") source("http://bioinformatics.mdanderson.org/OOMPA/oompaLite.R") oompaLite()
General
library()
- show all installed packages- message() as superior to print and cat
Jupyter Notebook Tips
- Don't use
dev.new()
or your plots won't show up
Links
Functions
- distance = dissimilarity
ggplot
- piechart:
library( tidyverse) library(ggplot2) library(readxl) var <- read_excel("GENES.xlsx") stats <- var %>% select( `Gene Biotype` ) %>% drop_na %>% group_by(`Gene Biotype`) %>% summarize( count=n() ) names(stats) <- c( "name", "count") ggplot( stats, aes( x="", y=count, fill=factor(name) ) ) + geom_col( width= 1) + scale_fill_manual( values=c('red', 'blue', 'green', 'darkorchid', 'sienna4', 'thistle', 'gray0', 'khaki4', 'seagreen', 'midnightblue', 'azure4', 'cornflowerblue', 'olivedrab', 'lightgreen', 'purple4', 'turquoise') ) + coord_polar( theta="y", start=0 )
Data Types
parallel package
- R-core package, replaces 3rd party multicore and snow
references
- package manual
- HPC and Parallel Computing in R
- Sept 2017 parallel foreach guide
- http://pablobarbera.com/POIR613/code/06-parallel-computing.html ditto]
course-grained parallelization
- large chunks of computations in parallel
- chunks of computation are unrelated, do not need to communicate in any way
- great for bootstrapping
load-balancing paradigm
- start up M worker processes
- allow worker processes access to data
- split task into N tasks (chunks)
- send the first M number of tasks to the M workers
- Implementation detail - via serialization, it not unlimited
- when a worker finishes a task send it another one
Worker paradigms
- "Cluster" - SAME MACHINE, start new processes ("snow" style)
- "fork" - SAME MACHINE, POSIX forks, copy on write, theoretically cheaper, not avail on windows
- MPI - MULTIPLE MAHCINES
CPUs/Cores
- physical CPU has 2 or more cores that run independently
- ergo concept of "logical CPU"
- considerations
- we know how many ther are, but how many available TO YOU?
- hopefully your chunk of code that you want to run in parallel does not, itself, run multiple cores
apply functions
- mclapply() - sets up ephemeral group of cores for this computation
- makeCluster() + parLapply() + stopCluster() - set up a pool of workers, then call stop cluster when done
- parRapply/parCapply for per row/col apply for matrix
"Cluster", i.e., multiprocessing API
- makeCluster( n_cores, type="PSOCK|FORK") - calls down to one of two subfunctions
- makePSOCKcluster() - uses Rscript to launch further copies of R on this or other hosts
- makeForkCluster() - not available on windows - workers automatically inherit environment of parent session
- stdout and stderr are discarded by default, unless logged by outfile option
library doParallel
- registerDoParallel(myCluster)
- stopCluster(myCluster)