From Colettapedia
Jump to: navigation, search

RStudio Server Admin



R Environment info

  • R.home() - path to R installation
  • R.Version() - returns a list
  • version - returns a simple list which prints nicely interactively
  • .Platform - Platform Specific Variables
  • sessionInfo() - Collect Information About the Current R Session
  • Sys.*
    • Sys.info() - Extract System and User Information
    • Sys.glob( "/path1/*", "/path2/*.tiff" )
  • list.files()
  • list.dirs()
  • file.*()

Package Management

  • find.package( "BiocManager" ) - path to package installation directory
    • See also: system.file( package="BiocManager" )
  • .Library - global string of path to default library for R packages
  • .Library.site - locations of site libraries. Can be set via the environment variable $R_LIBS_SITE (as a non-empty colon-separated list of library trees).
  • .libPaths() - R library tree getter and setter function
  • install.packages()
  • available.packages() - list every package available for download installation from the given mirror
  • remove.packages()

Bioconductor Package management

  • BiocManager vignette
  • install.packages("BiocManager")
  • BiocManager::install() - install core packages
  • BiocManager::version()
  • BiocManager::valid()
  • BiocManager::available()


  • library() - show all installed packages
  • message() as superior to print and cat

Jupyter Notebook Tips

  • Don't use dev.new() or your plots won't show up



  • distance = dissimilarity


  • piechart:
library( tidyverse)

var <- read_excel("GENES.xlsx")

stats <- var %>% select( `Gene Biotype` ) %>% drop_na %>% 
  group_by(`Gene Biotype`) %>% summarize( count=n() )

names(stats) <- c( "name", "count")

ggplot( stats, aes( x="", y=count, fill=factor(name) ) ) + 
  geom_col( width= 1) +  
  scale_fill_manual( values=c('red', 'blue', 'green', 'darkorchid', 'sienna4', 'thistle', 'gray0',
                              'khaki4', 'seagreen', 'midnightblue', 'azure4', 'cornflowerblue',
                              'olivedrab', 'lightgreen', 'purple4', 'turquoise') ) +
  coord_polar( theta="y", start=0 )

Data Types

parallel package

  • R-core package, replaces 3rd party multicore and snow


course-grained parallelization

  • large chunks of computations in parallel
  • chunks of computation are unrelated, do not need to communicate in any way
  • great for bootstrapping

load-balancing paradigm

  • start up M worker processes
  • allow worker processes access to data
  • split task into N tasks (chunks)
  • send the first M number of tasks to the M workers
    • Implementation detail - via serialization, it not unlimited
  • when a worker finishes a task send it another one

Worker paradigms

  • "Cluster" - SAME MACHINE, start new processes ("snow" style)
  • "fork" - SAME MACHINE, POSIX forks, copy on write, theoretically cheaper, not avail on windows


  • physical CPU has 2 or more cores that run independently
  • ergo concept of "logical CPU"
  • considerations
    • we know how many ther are, but how many available TO YOU?
    • hopefully your chunk of code that you want to run in parallel does not, itself, run multiple cores

apply functions

  • mclapply() - sets up ephemeral group of cores for this computation
  • makeCluster() + parLapply() + stopCluster() - set up a pool of workers, then call stop cluster when done
  • parRapply/parCapply for per row/col apply for matrix

"Cluster", i.e., multiprocessing API

  • makeCluster( n_cores, type="PSOCK|FORK") - calls down to one of two subfunctions
    • makePSOCKcluster() - uses Rscript to launch further copies of R on this or other hosts
    • makeForkCluster() - not available on windows - workers automatically inherit environment of parent session
  • stdout and stderr are discarded by default, unless logged by outfile option

library doParallel

  • registerDoParallel(myCluster)
  • stopCluster(myCluster)