R

From Colettapedia
Jump to: navigation, search

RStudio Server Admin

PDF

File:R-datatypes-and-syntax.pdf

Installing packages

sudo R
install.packages('mclust')
source("http://bioconductor.org/biocLite.R")
biocLite("Biobase")
source("http://bioinformatics.mdanderson.org/OOMPA/oompaLite.R")
oompaLite()

General

  • library() - show all installed packages
  • message() as superior to print and cat

Jupyter Notebook Tips

  • Don't use dev.new() or your plots won't show up

Links

Functions

  • distance = dissimilarity

ggplot

  • piechart:
library( tidyverse)
library(ggplot2)
library(readxl)

var <- read_excel("GENES.xlsx")

stats <- var %>% select( `Gene Biotype` ) %>% drop_na %>% 
  group_by(`Gene Biotype`) %>% summarize( count=n() )

names(stats) <- c( "name", "count")

ggplot( stats, aes( x="", y=count, fill=factor(name) ) ) + 
  geom_col( width= 1) +  
  scale_fill_manual( values=c('red', 'blue', 'green', 'darkorchid', 'sienna4', 'thistle', 'gray0',
                              'khaki4', 'seagreen', 'midnightblue', 'azure4', 'cornflowerblue',
                              'olivedrab', 'lightgreen', 'purple4', 'turquoise') ) +
  coord_polar( theta="y", start=0 )

Data Types

parallel package

  • R-core package, replaces 3rd party multicore and snow

references

course-grained parallelization

  • large chunks of computations in parallel
  • chunks of computation are unrelated, do not need to communicate in any way
  • great for bootstrapping

load-balancing paradigm

  • start up M worker processes
  • allow worker processes access to data
  • split task into N tasks (chunks)
  • send the first M number of tasks to the M workers
    • Implementation detail - via serialization, it not unlimited
  • when a worker finishes a task send it another one

Worker paradigms

  • "Cluster" - SAME MACHINE, start new processes ("snow" style)
  • "fork" - SAME MACHINE, POSIX forks, copy on write, theoretically cheaper, not avail on windows
  • MPI - MULTIPLE MAHCINES

CPUs/Cores

  • physical CPU has 2 or more cores that run independently
  • ergo concept of "logical CPU"
  • considerations
    • we know how many ther are, but how many available TO YOU?
    • hopefully your chunk of code that you want to run in parallel does not, itself, run multiple cores

apply functions

  • mclapply() - sets up ephemeral group of cores for this computation
  • makeCluster() + parLapply() + stopCluster() - set up a pool of workers, then call stop cluster when done
  • parRapply/parCapply for per row/col apply for matrix

"Cluster", i.e., multiprocessing API

  • makeCluster( n_cores, type="PSOCK|FORK") - calls down to one of two subfunctions
    • makePSOCKcluster() - uses Rscript to launch further copies of R on this or other hosts
    • makeForkCluster() - not available on windows - workers automatically inherit environment of parent session
  • stdout and stderr are discarded by default, unless logged by outfile option

library doParallel

  • registerDoParallel(myCluster)
  • stopCluster(myCluster)