A future for R

2016/07/04

useR 2016에서 있었던 것중에 특징적인 것을 가져왔습니다.

R markdown을 사용한 문서입니다.

특히 말미에 ODBC가 언급된 것이 흥미롭습니다.


A Future for R

Henrik Bengtsson, UC San Francisco

useR 2016, Stanford, CA, 2016-06-28

Slide comments

This is a 15 + 3 minute presentation.

My first R assignment

Calculating the sum 1 + 2 + … + 100:

> y <- sum(1:100)
> y
[1] 5050

My first R assignment

Calculating the sum 1 + 2 + … + 100:

> y <- slow_sum(1:100)    ## 2 min
> y
[1] 5050

Divide-and-conquer alternative: Calculate a = 1 + 2 + … + 50, and then b = 51 + 52 + … + 100, and then add up a and b:

> a <- slow_sum(1:50)     ## 1 min
> b <- slow_sum(51:100)   ## 1 min
> y <- a + b
> y
[1] 5050

Slide comments

My first future assignment

Parallel divide-and-conquer: Calculate a = 1 + 2 + … + 50 and b = 51 + 52 + … + 100 at the same time, and then add up a and b:

> library("future")
> plan(multiprocess)       ## Parallel processing

Future assignment: y %<-% x

> a %<-% slow_sum(1:50)    ## These two assignments are
> b %<-% slow_sum(51:100)  ## non-blocking and in parallel
>
> y <- a + b               ## Waits for a and b
> y
[1] 5050

Friedman & Wise (1976, 1977), Hibbard (1976), Baker & Hewitt (1977)

Definition: Future

y %<-% { expr }

Slide comments

References

R package: future (!)

Image: CRAN version: 1.0.0 Image: Travis CI: passing Image: Codecov: 97%

Many ways to resolve futures

Strategy:
eager sequentially
lazy only if needed
multiprocess in parallel
cluster on a set of machines
> plan(lazy)

> a %<-% slow_sum(1:50)
> b %<-% slow_sum(51:100)
>
> b
[1] 3775

Future a will never be evaluated!

Consistent futures everywhere (!)

> library("future")
> plan(multiprocess)
> demo("mandelbrot")

Calculating and plotting
Mandelbrot regions ...

Image: Screenshot of demo(mandelbrot)

- Region 1 done
- Region 2 done
- Region 7 done
- Region 5 done
- ...

Image: Screenshot of demo(mandelbrot)

Future takes care of globals (!)

Global variables and functions that are needed for the future expression to be resolved are identified automatically and frozen / exported. Packages are automatically loaded.

x <- rnorm(n=100)

y %<-% { slow_sum(x) }

Globals identified and frozen / exported:

  1. slow_sum() - a function.

  2. x - a numeric vector of length 100.

Nested futures (!)

x <- rnorm(n=100)

a %<-% {
  c %<-% slow_sum(x[1:25])
  d %<-% slow_sum(x[26:50])
  c + d
}

b %<-% {
  c %<-% slow_sum(x[51:75])
  d %<-% slow_sum(x[76:100])
  c + d
}

y <- a + b

Different strategies for resolving, e.g.

High Performance Compute (HPC) clusters

Image: Future Art: Mainframe computer room

Map-Reduce for HPC

## Find our 40 FASTQ files
fastq <- dir(pattern = "[.]fq$")           ## 200 GB each!

## Align them
bam <- lapply(fastq, FUN = DNAseq::align)  ## 6 hours each!
library("BatchJobs")
reg <- makeRegistry(id="DNASEQseq")

fastq <- dir(pattern = "[.]fq$")
batchMap(reg, fastq, fun = DNAseq::align)
submitJobs(reg)
bam <- loadResults(reg)

Image: CRAN version: 0.12.0 Image: Travis CI: passing Image: Codecov: 90%

future.BatchJobs: Futures for HPC

future.BatchJobs: Job scheduler:
batchjobs_slurm Slurm
batchjobs_sge Sun Grid Engine
batchjobs_torque TORQUE / PBS
batchjobs_lsf Load Sharing Facility
batchjobs_openlava OpenLava
library("future.BatchJobs")
plan(batchjobs_slurm)
bam <- listenv()
for (i in seq_along(fastq)) {
  bam[[i]] %<-% DNAseq::align(fastq[i])
}

Parallel alternatives

Image: Future Art: Inside of the Stanford Torus by Donald Davis, NASA paintings, 1975 (public domain)

foreach: Futures with foreach()

## Align them
foreach(i = seq_along(fastq), .export = "fastq") %dopar% {
  DNAseq::align(fastq[i])
}

Image: CRAN version: 0.2.0 Image: Travis CI: passing Image: Codecov: 100% (48 lines!)

The doFuture package provides a foreach %dopar% adapter such that any type of futures can be used wherever the foreach package is used.

library("doFuture")
registerDoFuture()
plan(multiprocess)

Slide comments

Other foreach backend adapter

1100+ packages can now parallelize on HPC

Package that depends on foreach: ~300 directly + ~800 indirectly

These can now also take advantage of compute clusters:

library("doFuture")
registerDoFuture()      ## (a) Tell foreach to use futures

library("future.BatchJobs")
plan(batchjobs_slurm)   ## (b) Resolve via Slurm scheduler
library("plyr")
fastq <- dir(pattern = "[.]fq$")
bam <- llply(fastq, DNAseq::align, .parallel = TRUE)

Slide comments

> pkgs <- tools::dependsOnPkgs("foreach", recursive=TRUE, installed=utils::available.packages())
> str(pkgs)
 chr [1:1099] "abcrf" "admixturegraph" "ApacheLogProcessor" "BANFF" ...

Future bonuses

Image: Future Art: Vision of a future with video calls

Plot remotely - display locally

> library("future")
> plan(remote, workers="remote.server.org")
## Plot remotely
> g %<-% R.devices::capturePlot({
+   filled.contour(volcano, color.palette = terrain.colors)
+   title(main = "volcano data: filled contour map")
+ })
## Display locally
> g

Image: Screenshot of example(volcano) plot

Slide comments

Building a better future

@HenrikBengtsson

Thank you!

Appenix

A1. Summary

                 | **future**  | parallel  | foreach     | BatchJobs | BiocParallel

:——————–|:————|:———-|:————|:———-|:————- Synchroneous |** yes **| yes | yes | yes | yes Asynchroneous |** yes **| yes | yes | yes | �� Uniform API |** yes **| | yes | yes | yes Extendable API |** yes **| | yes | yes | �� Globals |** yes **| | (yes) | | Packages |** yes **| | (yes) | | For loops |** yes **| | foreach() | | While loops |** yes **| | | | Nested config |** yes **| | | | Recursive protection |** yes **| mc | | mc | Early stopping |** yes **| | | | �� Traceback |** yes **| | | | �� RNG stream |** manual **| mc & SNOW | doRNG | manual | mc & SNOW

Slide comments

Early stopping is not supported by foreach

> library("foreach")
> registerDoSEQ()
> v <- foreach(i = 1:3) %dopar% { message(i); if (i == 2) stop("x"); i }
1
2
3
Error in { : task 2 failed - "x"

A2. Future has a minimalistic API (!)

x <- 101:200

Create implicit futures

a %<-% slow_sum(x[1:50]) b %<-% slow_sum(x[51:100])

Get their values

y <- a + b

y [1] 15050

x <- 101:200

Create explicit futures

f <- future( slow_sum(x[1:50]) ) g <- future( slow_sum(x[51:100]) )

Get their values

y <- value(f) + value(g)

y [1] 15050

A3. Nested futures with plyr

library("future.BatchJobs")
plan(list(batchjobs_slurm, multiprocess))

library(plyr)

bam <- llply(fastq, function(fq) {

  chrs <- llply(1:24, function(chr) {
    DNAseq::align(fq, chromosome = chr)
  }, .parallel = TRUE)

  merge_chromosomes(chrs)

}, .parallel = TRUE)

A4. BiocParallel: Futures with Bioconductor

You can use futures with BiocParallel, e.g.

library("BiocParallel")
register(DoparParam(), default = TRUE)

library("doFuture")
registerDoFuture()

plan(multiprocess)

bplapply(fastq, function(fq) {
  DNAseq::align(fq)
})

A5. Profile code remotely - display locally

> library("future")
> plan(remote, workers="remote.server.org")

> library("profvis")

> dat <- data.frame(
+   x = rnorm(50e3),
+   y = rnorm(50e3)
+ )

## Profile remotely
> p %<-% profvis({
+   plot(x ~ y, data = dat)
+   m <- lm(x ~ y, data = dat)
+   abline(m, col = "red")
+ })
## Browse locally
> p

Image: Screenshot of profvis HTML report

A6. Implement Future API for new backend

To implement a new type of future, create the following methods:

A7. Futures I’d like to see

A8. Future improvements

Standardization

Optional features