R

General

library() - show all installed packages
message() as superior to print and cat
UseMethod() is a primitive function but uses standard argument matching. One of the means of dispatch of methods.

RStudio Server Admin

RStudio

PDF

File:R-datatypes-and-syntax.pdf

R Environment info

R.home() - path to R installation
R.Version() - returns a list
version - returns a simple list which prints nicely interactively
.Platform - Platform Specific Variables
sessionInfo() - Collect Information About the Current R Session
Sys.*
- Sys.info() - Extract System and User Information
- Sys.glob( "/path1/*", "/path2/*.tiff" )
list.files()
list.dirs()
file.*()

Package Management

find.package( "BiocManager" ) - path to package installation directory
- See also: system.file( package="BiocManager" )
.Library - global string of path to default library for R packages
.Library.site - locations of site libraries. Can be set via the environment variable $R_LIBS_SITE (as a non-empty colon-separated list of library trees).
.libPaths() - R library tree getter and setter function
update.packages(ask=FALSE)
old.packages()
new_packages()
install.packages()
available.packages() - list every package available for download installation from the given mirror
remove.packages()

Bioconductor Package management

BiocManager vignette
install.packages("BiocManager")
BiocManager::install() - install core packages
BiocManager::version()
BiocManager::valid()
BiocManager::available()

Jupyter Notebook Tips

Don't use dev.new() or your plots won't show up

Links

Seven Ways To Plot Dendrograms in R

Functions

distance = dissimilarity

foreach package

vignette
manual
x <- foreach(i=1:3) %do% sqrt(i)
x <- foreach(a=1:3, b=rep(10, 3)) %do% (a + b)
x <- foreach(i=1:3, .combine='c') %do% exp(i)
x <- foreach(i=1:4, .combine='cbind') %do% rnorm(4)
foreach(i=4:1, .combine='c', .inorder=FALSE) %dopar% { ... }
x <- foreach(a=irnorm(1, count=10), .combine='c') %:% when(a >= 0) %do% sqrt(a) - list comprehensions

parallel package

R-core package, replaces 3rd party multicore and snow

references

course-grained parallelization

large chunks of computations in parallel
chunks of computation are unrelated, do not need to communicate in any way
great for bootstrapping

load-balancing paradigm

start up M worker processes
allow worker processes access to data
split task into N tasks (chunks)
send the first M number of tasks to the M workers
- Implementation detail - via serialization, it not unlimited
when a worker finishes a task send it another one

Worker paradigms

"Cluster" - SAME MACHINE, start new processes ("snow" style)
"fork" - SAME MACHINE, POSIX forks, copy on write, theoretically cheaper, not avail on windows
MPI - MULTIPLE MAHCINES

CPUs/Cores

physical CPU has 2 or more cores that run independently
ergo concept of "logical CPU"
considerations
- we know how many ther are, but how many available TO YOU?
- hopefully your chunk of code that you want to run in parallel does not, itself, run multiple cores

apply functions

mclapply() - sets up ephemeral group of cores for this computation
makeCluster() + parLapply() + stopCluster() - set up a pool of workers, then call stop cluster when done
parRapply/parCapply for per row/col apply for matrix

"Cluster", i.e., multiprocessing API

makeCluster( n_cores, type="PSOCK|FORK") - calls down to one of two subfunctions
- makePSOCKcluster() - uses Rscript to launch further copies of R on this or other hosts
- makeForkCluster() - not available on windows - workers automatically inherit environment of parent session
stdout and stderr are discarded by default, unless logged by outfile option

library doParallel

registerDoParallel(myCluster)
stopCluster(myCluster)

Tidy evaluation

References

Programming with dplyr
- data masking - use data (columnar) variables like they are environment variables
  - env variables are programming variables usually created by a <-
  - data variables are "statistical" variables that live in a data frame
- When you have an env-variable that is a character vector, you need to index into the .data pronoun with [[, like summarise(df, mean = mean(.datavar))
rlang quasiquotation examples
Tidy Eval cheatsheet from 2018-11
More examples from 2017

General

Importing library(rlang) turns on many of these featuers

Metaprogramming, similar to non-standard evaluation, similar to unquoted variable names
Change the context of computation. When you need to delay the computation of expression, change the context, and resume computation
To evaluate expression, you search environments for name-value bindings. Non-standard evalutation means you might modify the expression first, or modify the chain of searched environments
Indirect specification of variable names is HARDER, i.e., passing a variable name as a function argument (ease of interactive use was prioritized)

Every expression has a tree
Capture the tree by quoting
- Expressions are either evaluated using usual R rules, or quoted and evaluated with special rules
Unquoting makes it easier to build trees
Using this pattern in your function means you need deal with non-standard evaluation by quoting one or more of your expressions and later unquote it
Quosures ("Closure + quotation") capture expression and environment

Increasing levels of lazy evaluation

"Pass the dots" : for a single function arg, use ellipsis as the one func arg and then use it as the placeholder to handle NSE
For more than one func arg, you "embrace" or double moustache the argument to do in-place indirection
If you need to write functions that make names from user input, you need := ("colon-equals") operator

 
 SpaghettiPlot <- function( observations, Y_var, time_var, dot_color_var, alpha=0.5 ) {
    #options(warn=-1) # Turn warnings off
    
    # This function uses Tidy Evaluation:
    # https://dplyr.tidyverse.org/articles/programming.html
    
    # Tidy evaluation concept 1:
    # Wrap function arguments in "{{ }}", a.k.a., the "embrace" open and close
    # operators to get the nave of the data variable out of the
    # function argument.
    
    # Tidy evaluation cencept 2:
    # Use ensym() to create a string based on the expression
    # that the user passed when calling the function:
    Y_var_name_str <- ensym( Y_var  )
    
    # glue() is a drop in replacement for the "paste()" function
    # (has nothing to do with Tidy Evaluation, is just a cool library)
    plot_title <- glue( "{Y_var_name_str} Trajectories" )
    
    plot <- observations %>%
        ggplot( 
            aes( 
                x={{ time_var }},          # Tidy Eval
                y={{ Y_var }},             # Tidy Eval
                group=id_individual
            )
        ) +
        geom_line( color='gray' ) +
        geom_point(
            aes(
                color={{ dot_color_var }}, # Tidy Eval
            ),
            alpha=alpha
        ) +
        geom_smooth( 
            inherit.aes=F,
            aes( 
                x={{ time_var }},         # Tidy Eval
                y={{ Y_var }}             # Tidy Eval
            ),
            method='lm'
        ) +
        facet_grid( Sex ~ AgeCat ) +
        labs( y = Y_var_name_str  ) +
        labs( title = plot_title )
    
    #options( warn=0 )
    return( plot )
 }

Selection versus action

group_by vs group_by_at( vars( ends_with( "color" ) )
compare: lapply style

```
starwars %>%
   summarize(
       height = mean( height, na.rm = TRUE ),
       mass = mean( mass, na.rm = TRUE )
   )

# =======VERSUS==========
starwars %>%
   summarize_at(
       vars( height, mass ),
       ~ mean( ., na.rm = TRUE )
   )

# =======VERSUS==========
summary_functions <- list(
   ~ mean( ., na.rm = TRUE ),
   ~ sd( ., na.rm = TRUE )
)

# Results spread across columns
starwars %>%
   summarize_at(
       vars( height:mass ),
       summary_functions
   )
```

transmute creates new vectors and passes only them along

Nested Data

Combining Summary Data frames

full_dataset %>%
   group_by( Race, Sex ) %>%
   nest %>%
   rowwise %>%
   transmute(
       stats = list( pivot_wider( enframe( summary( data$Age ) ) ) ),
       count = nrow( data )
   ) %>%
   unnest( cols = c(stats, count) ) %>%
   arrange( desc( count ) )

R

Contents

General

RStudio Server Admin

PDF

R Environment info

Package Management

Bioconductor Package management

Jupyter Notebook Tips

Links

Functions

foreach package

parallel package

references

course-grained parallelization

load-balancing paradigm

Worker paradigms

CPUs/Cores

apply functions

"Cluster", i.e., multiprocessing API

library doParallel

Tidy evaluation

References

General

Increasing levels of lazy evaluation

Selection versus action

Nested Data

Combining Summary Data frames

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools