R

From Colettapedia
Jump to navigation Jump to search

General

  • library() - show all installed packages
  • message() as superior to print and cat
  • UseMethod() is a primitive function but uses standard argument matching. One of the means of dispatch of methods.

RStudio Server Admin

PDF

R Environment info

  • R.home() - path to R installation
  • R.Version() - returns a list
  • version - returns a simple list which prints nicely interactively
  • .Platform - Platform Specific Variables
  • sessionInfo() - Collect Information About the Current R Session
  • Sys.*
    • Sys.info() - Extract System and User Information
    • Sys.glob( "/path1/*", "/path2/*.tiff" )
  • list.files()
  • list.dirs()
  • file.*()

Package Management

  • find.package( "BiocManager" ) - path to package installation directory
    • See also: system.file( package="BiocManager" )
  • .Library - global string of path to default library for R packages
  • .Library.site - locations of site libraries. Can be set via the environment variable $R_LIBS_SITE (as a non-empty colon-separated list of library trees).
  • .libPaths() - R library tree getter and setter function
  • update.packages(ask=FALSE)
  • old.packages()
  • new_packages()
  • install.packages()
  • available.packages() - list every package available for download installation from the given mirror
  • remove.packages()

Bioconductor Package management

  • BiocManager vignette
  • install.packages("BiocManager")
  • BiocManager::install() - install core packages
  • BiocManager::version()
  • BiocManager::valid()
  • BiocManager::available()

Jupyter Notebook Tips

  • Don't use dev.new() or your plots won't show up

Links

Functions

  • distance = dissimilarity


foreach package

  • vignette
  • manual
  • x <- foreach(i=1:3) %do% sqrt(i)
  • x <- foreach(a=1:3, b=rep(10, 3)) %do% (a + b)
  • x <- foreach(i=1:3, .combine='c') %do% exp(i)
  • x <- foreach(i=1:4, .combine='cbind') %do% rnorm(4)
  • foreach(i=4:1, .combine='c', .inorder=FALSE) %dopar% { ... }
  • x <- foreach(a=irnorm(1, count=10), .combine='c') %:% when(a >= 0) %do% sqrt(a) - list comprehensions


parallel package

  • R-core package, replaces 3rd party multicore and snow

references

course-grained parallelization

  • large chunks of computations in parallel
  • chunks of computation are unrelated, do not need to communicate in any way
  • great for bootstrapping

load-balancing paradigm

  • start up M worker processes
  • allow worker processes access to data
  • split task into N tasks (chunks)
  • send the first M number of tasks to the M workers
    • Implementation detail - via serialization, it not unlimited
  • when a worker finishes a task send it another one

Worker paradigms

  • "Cluster" - SAME MACHINE, start new processes ("snow" style)
  • "fork" - SAME MACHINE, POSIX forks, copy on write, theoretically cheaper, not avail on windows
  • MPI - MULTIPLE MAHCINES

CPUs/Cores

  • physical CPU has 2 or more cores that run independently
  • ergo concept of "logical CPU"
  • considerations
    • we know how many ther are, but how many available TO YOU?
    • hopefully your chunk of code that you want to run in parallel does not, itself, run multiple cores

apply functions

  • mclapply() - sets up ephemeral group of cores for this computation
  • makeCluster() + parLapply() + stopCluster() - set up a pool of workers, then call stop cluster when done
  • parRapply/parCapply for per row/col apply for matrix

"Cluster", i.e., multiprocessing API

  • makeCluster( n_cores, type="PSOCK|FORK") - calls down to one of two subfunctions
    • makePSOCKcluster() - uses Rscript to launch further copies of R on this or other hosts
    • makeForkCluster() - not available on windows - workers automatically inherit environment of parent session
  • stdout and stderr are discarded by default, unless logged by outfile option

library doParallel

  • registerDoParallel(myCluster)
  • stopCluster(myCluster)



Tidy evaluation

References

  • Programming with dplyr
    • data masking - use data (columnar) variables like they are environment variables
      • env variables are programming variables usually created by a <-
      • data variables are "statistical" variables that live in a data frame
    • When you have an env-variable that is a character vector, you need to index into the .data pronoun with [[, like summarise(df, mean = mean(.datavar))
  • rlang quasiquotation examples
  • Tidy Eval cheatsheet from 2018-11
  • More examples from 2017

General

  • Importing library(rlang) turns on many of these featuers
  • Metaprogramming, similar to non-standard evaluation, similar to unquoted variable names
  • Change the context of computation. When you need to delay the computation of expression, change the context, and resume computation
  • To evaluate expression, you search environments for name-value bindings. Non-standard evalutation means you might modify the expression first, or modify the chain of searched environments
  • Indirect specification of variable names is HARDER, i.e., passing a variable name as a function argument (ease of interactive use was prioritized)
  1. Every expression has a tree
  2. Capture the tree by quoting
    • Expressions are either evaluated using usual R rules, or quoted and evaluated with special rules
  3. Unquoting makes it easier to build trees
  4. Using this pattern in your function means you need deal with non-standard evaluation by quoting one or more of your expressions and later unquote it
  5. Quosures ("Closure + quotation") capture expression and environment

Increasing levels of lazy evaluation

  1. "Pass the dots" : for a single function arg, use ellipsis as the one func arg and then use it as the placeholder to handle NSE
  2. For more than one func arg, you "embrace" or double moustache the argument to do in-place indirection
  3. If you need to write functions that make names from user input, you need := ("colon-equals") operator
 
 SpaghettiPlot <- function( observations, Y_var, time_var, dot_color_var, alpha=0.5 ) {
    #options(warn=-1) # Turn warnings off
    
    # This function uses Tidy Evaluation:
    # https://dplyr.tidyverse.org/articles/programming.html
    
    # Tidy evaluation concept 1:
    # Wrap function arguments in "{{ }}", a.k.a., the "embrace" open and close
    # operators to get the nave of the data variable out of the
    # function argument.
    
    # Tidy evaluation cencept 2:
    # Use ensym() to create a string based on the expression
    # that the user passed when calling the function:
    Y_var_name_str <- ensym( Y_var  )
    
    # glue() is a drop in replacement for the "paste()" function
    # (has nothing to do with Tidy Evaluation, is just a cool library)
    plot_title <- glue( "{Y_var_name_str} Trajectories" )
    
    plot <- observations %>%
        ggplot( 
            aes( 
                x={{ time_var }},          # Tidy Eval
                y={{ Y_var }},             # Tidy Eval
                group=id_individual
            )
        ) +
        geom_line( color='gray' ) +
        geom_point(
            aes(
                color={{ dot_color_var }}, # Tidy Eval
            ),
            alpha=alpha
        ) +
        geom_smooth( 
            inherit.aes=F,
            aes( 
                x={{ time_var }},         # Tidy Eval
                y={{ Y_var }}             # Tidy Eval
            ),
            method='lm'
        ) +
        facet_grid( Sex ~ AgeCat ) +
        labs( y = Y_var_name_str  ) +
        labs( title = plot_title )
    
    #options( warn=0 )
    return( plot )
 }

Selection versus action

  • group_by vs group_by_at( vars( ends_with( "color" ) )
  • compare: lapply style
```
starwars %>%
   summarize(
       height = mean( height, na.rm = TRUE ),
       mass = mean( mass, na.rm = TRUE )
   )
# =======VERSUS==========
starwars %>%
   summarize_at(
       vars( height, mass ),
       ~ mean( ., na.rm = TRUE )
   )
# =======VERSUS==========
summary_functions <- list(
   ~ mean( ., na.rm = TRUE ),
   ~ sd( ., na.rm = TRUE )
)
# Results spread across columns
starwars %>%
   summarize_at(
       vars( height:mass ),
       summary_functions
   )
```
  • transmute creates new vectors and passes only them along

Nested Data

Combining Summary Data frames

full_dataset %>%
   group_by( Race, Sex ) %>%
   nest %>%
   rowwise %>%
   transmute(
       stats = list( pivot_wider( enframe( summary( data$Age ) ) ) ),
       count = nrow( data )
   ) %>%
   unnest( cols = c(stats, count) ) %>%
   arrange( desc( count ) )