A future for R

2016/07/04

useR 2016에서 있었던 것중에 특징적인 것을 가져왔습니다.

R markdown을 사용한 문서입니다.

특히 말미에 ODBC가 언급된 것이 흥미롭습니다.


A Future for R

Henrik Bengtsson, UC San Francisco

useR 2016, Stanford, CA, 2016-06-28

Slide comments

This is a 15 + 3 minute presentation.

My first R assignment

Calculating the sum 1 + 2 + … + 100:

> y <- sum(1:100)
> y
[1] 5050

My first R assignment

Calculating the sum 1 + 2 + … + 100:

> y <- slow_sum(1:100)    ## 2 min
> y
[1] 5050

Divide-and-conquer alternative: Calculate a = 1 + 2 + … + 50, and then b = 51 + 52 + … + 100, and then add up a and b:

> a <- slow_sum(1:50)     ## 1 min
> b <- slow_sum(51:100)   ## 1 min
> y <- a + b
> y
[1] 5050

Slide comments

My first future assignment

Parallel divide-and-conquer: Calculate a = 1 + 2 + … + 50 and b = 51 + 52 + … + 100 at the same time, and then add up a and b:

> library("future")
> plan(multiprocess)       ## Parallel processing

Future assignment: y %<-% x

> a %<-% slow_sum(1:50)    ## These two assignments are
> b %<-% slow_sum(51:100)  ## non-blocking and in parallel
>
> y <- a + b               ## Waits for a and b
> y
[1] 5050

Friedman & Wise (1976, 1977), Hibbard (1976), Baker & Hewitt (1977)

Definition: Future

y %<-% { expr }

Slide comments

References

R package: future (!)

Image: CRAN version: 1.0.0 Image: Travis CI: passing Image: Codecov: 97%

Many ways to resolve futures

Strategy: | :———————|:———————- eager | sequentially lazy | only if needed multiprocess | in parallel cluster | on a set of machines |

> plan(lazy)

> a %<-% slow_sum(1:50)
> b %<-% slow_sum(51:100)
>
> b
[1] 3775

Future a will never be evaluated!

Consistent futures everywhere (!)

> library("future")
> plan(multiprocess)
> demo("mandelbrot")

Calculating and plotting
Mandelbrot regions ...

Image: Screenshot of demo(mandelbrot)

- Region 1 done
- Region 2 done
- Region 7 done
- Region 5 done
- ...

Image: Screenshot of demo(mandelbrot)

Future takes care of globals (!)

Global variables and functions that are needed for the future expression to be resolved are identified automatically and frozen / exported. Packages are automatically loaded.

x <- rnorm(n=100)

y %<-% { slow_sum(x) }

Globals identified and frozen / exported: 1. slow_sum() - a function.

  1. x - a numeric vector of length 100.

Nested futures (!)

x <- rnorm(n=100)

a %<-% {
  c %<-% slow_sum(x[1:25])
  d %<-% slow_sum(x[26:50])
  c + d
}

b %<-% {
  c %<-% slow_sum(x[51:75])
  d %<-% slow_sum(x[76:100])
  c + d
}

y <- a + b

Different strategies for resolving, e.g.

High Performance Compute (HPC) clusters

Image: Future Art: Mainframe computer room

Map-Reduce for HPC

## Find our 40 FASTQ files
fastq <- dir(pattern = "[.]fq$")           ## 200 GB each!

## Align them
bam <- lapply(fastq, FUN = DNAseq::align)  ## 6 hours each!
library("BatchJobs")
reg <- makeRegistry(id="DNASEQseq")

fastq <- dir(pattern = "[.]fq$")
batchMap(reg, fastq, fun = DNAseq::align)
submitJobs(reg)
bam <- loadResults(reg)

Image: CRAN version: 0.12.0 Image: Travis CI: passing Image: Codecov: 90%

future.BatchJobs: Futures for HPC

future.BatchJobs: Job scheduler:
batchjobs_slurm Slurm
batchjobs_sge Sun Grid Engine
batchjobs_torque TORQUE / PBS
batchjobs_lsf Load Sharing Facility
batchjobs_openlava OpenLava
library("future.BatchJobs")
plan(batchjobs_slurm)
bam <- listenv()
for (i in seq_along(fastq)) {
  bam[[i]] %<-% DNAseq::align(fastq[i])
}

Parallel alternatives

Image: Future Art: Inside of the Stanford Torus by Donald Davis, NASA paintings, 1975 (public domain)

foreach: Futures with foreach()

## Align them
foreach(i = seq_along(fastq), .export = "fastq") %dopar% {
  DNAseq::align(fastq[i])
}

Image: CRAN version: 0.2.0 Image: Travis CI: passing Image: Codecov: 100% (48 lines!)

The doFuture package provides a foreach %dopar% adapter such that any type of futures can be used wherever the foreach package is used.

library("doFuture")
registerDoFuture()
plan(multiprocess)

Slide comments

Other foreach backend adapter

1100+ packages can now parallelize on HPC

Package that depends on foreach: ~300 directly + ~800 indirectly

These can now also take advantage of compute clusters:

library("doFuture")
registerDoFuture()      ## (a) Tell foreach to use futures

library("future.BatchJobs")
plan(batchjobs_slurm)   ## (b) Resolve via Slurm scheduler
library("plyr")
fastq <- dir(pattern = "[.]fq$")
bam <- llply(fastq, DNAseq::align, .parallel = TRUE)

Slide comments

> pkgs <- tools::dependsOnPkgs("foreach", recursive=TRUE, installed=utils::available.packages())
> str(pkgs)
 chr [1:1099] "abcrf" "admixturegraph" "ApacheLogProcessor" "BANFF" ...

Future bonuses

Image: Future Art: Vision of a future with video calls

Plot remotely - display locally

> library("future")
> plan(remote, workers="remote.server.org")
## Plot remotely
> g %<-% R.devices::capturePlot({
+   filled.contour(volcano, color.palette = terrain.colors)
+   title(main = "volcano data: filled contour map")
+ })
## Display locally
> g

Image: Screenshot of example(volcano) plot

Slide comments

Building a better future

@HenrikBengtsson

Thank you!

Appenix

A1. Summary

                 | **future**  | parallel  | foreach     | BatchJobs | BiocParallel

:——————–|:————|:———-|:————|:———-|:————- Synchroneous |** yes | yes | yes | yes | yes Asynchroneous | yes | yes | yes | yes | �� Uniform API | yes | | yes | yes | yes Extendable API | yes | | yes | yes | �� Globals | yes | | (yes) | | Packages | yes | | (yes) | | For loops | yes | | foreach() | | While loops | yes | | | | Nested config | yes | | | | Recursive protection | yes | mc | | mc | Early stopping | yes | | | | �� Traceback | yes | | | | �� RNG stream | manual **| mc & SNOW | doRNG | manual | mc & SNOW

Slide comments

Early stopping is not supported by foreach

> library("foreach")
> registerDoSEQ()
> v <- foreach(i = 1:3) %dopar% { message(i); if (i == 2) stop("x"); i }
1
2
3
Error in { : task 2 failed - "x"

A2. Future has a minimalistic API (!)

A3. Nested futures with plyr
library("future.BatchJobs")
plan(list(batchjobs_slurm, multiprocess))

library(plyr)

bam <- llply(fastq, function(fq) {

  chrs <- llply(1:24, function(chr) {
    DNAseq::align(fq, chromosome = chr)
  }, .parallel = TRUE)

  merge_chromosomes(chrs)

}, .parallel = TRUE)

A4. BiocParallel: Futures with Bioconductor

You can use futures with BiocParallel, e.g.

library("BiocParallel")
register(DoparParam(), default = TRUE)

library("doFuture")
registerDoFuture()

plan(multiprocess)

bplapply(fastq, function(fq) {
  DNAseq::align(fq)
})

A5. Profile code remotely - display locally

> library("future")
> plan(remote, workers="remote.server.org")

> library("profvis")

> dat <- data.frame(
+   x = rnorm(50e3),
+   y = rnorm(50e3)
+ )

## Profile remotely
> p %<-% profvis({
+   plot(x ~ y, data = dat)
+   m <- lm(x ~ y, data = dat)
+   abline(m, col = "red")
+ })
## Browse locally
> p

Image: Screenshot of profvis HTML report

A6. Implement Future API for new backend

To implement a new type of future, create the following methods:

  • f <- myfuture({ expr })

    • creates a future of class MyFuture extending Future
  • value(f) for MyFuture

    • gets value of future (blocking)
  • resolved(f) for MyFuture

    • checks if future is resolved or not (non-blocking)

A7. Futures I’d like to see

  • plan(r32)

    • e.g. RODBC::odbcConnectAccess() works only on 32-bit R.
  • plan(p2p)

    • Private and / or community-based peer-to-peer computer cluster
  • plan(rhelp)

    • Post R scripts to R-help and ask for the results :P

A8. Future improvements

Standardization

  • Capturing stdout and stderr uniformly
  • Random number generation (Pierre L’Ecuyer’s RNG streams)
    • easy to do manually right now

Optional features

  • Logging
  • Progress bars
  • Memoization (caching of results)
  • On the-fly time and memory benchmark statistics