R
Jump to navigation
Jump to search
PDF
General
library()
- show all installed packages- message() as superior to print and cat
UseMethod()
is a primitive function but uses standard argument matching. One of the means of dispatch of methods.
RStudio Server Admin
R Environment info
- R.home() - path to R installation
- R.Version() - returns a list
- version - returns a simple list which prints nicely interactively
- .Platform - Platform Specific Variables
- sessionInfo() - Collect Information About the Current R Session
- Sys.*
- Sys.info() - Extract System and User Information
- Sys.glob( "/path1/*", "/path2/*.tiff" )
- list.files()
- list.dirs()
- file.*()
Package Management
- find.package( "BiocManager" ) - path to package installation directory
- See also: system.file( package="BiocManager" )
- .Library - global string of path to default library for R packages
- .Library.site - locations of site libraries. Can be set via the environment variable $R_LIBS_SITE (as a non-empty colon-separated list of library trees).
- .libPaths() - R library tree getter and setter function
- update.packages(ask=FALSE)
- old.packages()
- new_packages()
- install.packages()
- available.packages() - list every package available for download installation from the given mirror
- remove.packages()
Bioconductor Package management
- BiocManager vignette
- install.packages("BiocManager")
- BiocManager::install() - install core packages
- BiocManager::version()
- BiocManager::valid()
- BiocManager::available()
Jupyter Notebook Tips
- Don't use
dev.new()
or your plots won't show up
Links
Functions
- distance = dissimilarity
foreach package
- vignette
- manual
x <- foreach(i=1:3) %do% sqrt(i)
x <- foreach(a=1:3, b=rep(10, 3)) %do% (a + b)
x <- foreach(i=1:3, .combine='c') %do% exp(i)
x <- foreach(i=1:4, .combine='cbind') %do% rnorm(4)
foreach(i=4:1, .combine='c', .inorder=FALSE) %dopar% { ... }
x <- foreach(a=irnorm(1, count=10), .combine='c') %:% when(a >= 0) %do% sqrt(a)
- list comprehensions
parallel package
- R-core package, replaces 3rd party multicore and snow
references
- package manual
- HPC and Parallel Computing in R
- Sept 2017 parallel foreach guide
- http://pablobarbera.com/POIR613/code/06-parallel-computing.html ditto]
course-grained parallelization
- large chunks of computations in parallel
- chunks of computation are unrelated, do not need to communicate in any way
- great for bootstrapping
load-balancing paradigm
- start up M worker processes
- allow worker processes access to data
- split task into N tasks (chunks)
- send the first M number of tasks to the M workers
- Implementation detail - via serialization, it not unlimited
- when a worker finishes a task send it another one
Worker paradigms
- "Cluster" - SAME MACHINE, start new processes ("snow" style)
- "fork" - SAME MACHINE, POSIX forks, copy on write, theoretically cheaper, not avail on windows
- MPI - MULTIPLE MAHCINES
CPUs/Cores
- physical CPU has 2 or more cores that run independently
- ergo concept of "logical CPU"
- considerations
- we know how many ther are, but how many available TO YOU?
- hopefully your chunk of code that you want to run in parallel does not, itself, run multiple cores
apply functions
- mclapply() - sets up ephemeral group of cores for this computation
- makeCluster() + parLapply() + stopCluster() - set up a pool of workers, then call stop cluster when done
- parRapply/parCapply for per row/col apply for matrix
"Cluster", i.e., multiprocessing API
- makeCluster( n_cores, type="PSOCK|FORK") - calls down to one of two subfunctions
- makePSOCKcluster() - uses Rscript to launch further copies of R on this or other hosts
- makeForkCluster() - not available on windows - workers automatically inherit environment of parent session
- stdout and stderr are discarded by default, unless logged by outfile option
library doParallel
- registerDoParallel(myCluster)
- stopCluster(myCluster)
Tidy evaluation
References
- Programming with dplyr
- data masking - use data (columnar) variables like they are environment variables
- env variables are programming variables usually created by a <-
- data variables are "statistical" variables that live in a data frame
- When you have an env-variable that is a character vector, you need to index into the .data pronoun with [[, like summarise(df, mean = mean(.datavar))
- data masking - use data (columnar) variables like they are environment variables
- rlang quasiquotation examples
- Tidy Eval cheatsheet from 2018-11
- More examples from 2017
General
- Importing
library(rlang)
turns on many of these featuers
- Metaprogramming, similar to non-standard evaluation, similar to unquoted variable names
- Change the context of computation. When you need to delay the computation of expression, change the context, and resume computation
- To evaluate expression, you search environments for name-value bindings. Non-standard evalutation means you might modify the expression first, or modify the chain of searched environments
- Indirect specification of variable names is HARDER, i.e., passing a variable name as a function argument (ease of interactive use was prioritized)
- Every expression has a tree
- Capture the tree by quoting
- Expressions are either evaluated using usual R rules, or quoted and evaluated with special rules
- Unquoting makes it easier to build trees
- Using this pattern in your function means you need deal with non-standard evaluation by quoting one or more of your expressions and later unquote it
- Quosures ("Closure + quotation") capture expression and environment
Increasing levels of lazy evaluation
- "Pass the dots" : for a single function arg, use ellipsis as the one func arg and then use it as the placeholder to handle NSE
- For more than one func arg, you "embrace" or double moustache the argument to do in-place indirection
- If you need to write functions that make names from user input, you need := ("colon-equals") operator
SpaghettiPlot <- function( observations, Y_var, time_var, dot_color_var, alpha=0.5 ) { #options(warn=-1) # Turn warnings off # This function uses Tidy Evaluation: # https://dplyr.tidyverse.org/articles/programming.html # Tidy evaluation concept 1: # Wrap function arguments in "{{ }}", a.k.a., the "embrace" open and close # operators to get the nave of the data variable out of the # function argument. # Tidy evaluation cencept 2: # Use ensym() to create a string based on the expression # that the user passed when calling the function: Y_var_name_str <- ensym( Y_var ) # glue() is a drop in replacement for the "paste()" function # (has nothing to do with Tidy Evaluation, is just a cool library) plot_title <- glue( "{Y_var_name_str} Trajectories" ) plot <- observations %>% ggplot( aes( x={{ time_var }}, # Tidy Eval y={{ Y_var }}, # Tidy Eval group=id_individual ) ) + geom_line( color='gray' ) + geom_point( aes( color={{ dot_color_var }}, # Tidy Eval ), alpha=alpha ) + geom_smooth( inherit.aes=F, aes( x={{ time_var }}, # Tidy Eval y={{ Y_var }} # Tidy Eval ), method='lm' ) + facet_grid( Sex ~ AgeCat ) + labs( y = Y_var_name_str ) + labs( title = plot_title ) #options( warn=0 ) return( plot ) }
Selection versus action
- group_by vs group_by_at( vars( ends_with( "color" ) )
- compare: lapply style
``` starwars %>% summarize( height = mean( height, na.rm = TRUE ), mass = mean( mass, na.rm = TRUE ) )
# =======VERSUS========== starwars %>% summarize_at( vars( height, mass ), ~ mean( ., na.rm = TRUE ) )
# =======VERSUS========== summary_functions <- list( ~ mean( ., na.rm = TRUE ), ~ sd( ., na.rm = TRUE ) )
# Results spread across columns starwars %>% summarize_at( vars( height:mass ), summary_functions ) ```
- transmute creates new vectors and passes only them along
Nested Data
Combining Summary Data frames
full_dataset %>% group_by( Race, Sex ) %>% nest %>% rowwise %>% transmute( stats = list( pivot_wider( enframe( summary( data$Age ) ) ) ), count = nrow( data ) ) %>% unnest( cols = c(stats, count) ) %>% arrange( desc( count ) )