Basil Okola: Parallel programming

Basil Okola

Sometimes you get to run repetitive tasks that consume a lot of time and computing memory. A good workaround is distributing these tasks across processors, what is universally referred to as core. parallel package comes in handy in such situations. The only thing you need to do is try bundle your procedures into a function so that you can easily apply them within the parallel library.

library(parallel)

Parallel programming is achieved by the parallel::mclapply() function. This function does not however work if you are running on windows like I am. For that, you’d have to use parallel::parLapply(). First you register available processors. If using windows:

cl = makeCluster(detectCores())

If in Linux/Damian systems:

numcores = detectCores()

I used parallel programming in a sample-size simulation that saved considerable amount of time, running 1000 simulations at each sample size.

set.seed(0123)
sample_power = function(n_sample = seq(50, 450, 25),
                        n_simulations = 1:1000,
                        alpha = 0.05 / 14,
                        effect = 26.57964) {
  set.seed(0123)
  n = c()
  pval = c()
  z = c() # z statistic
  powr = c()
  for (j in seq_along(n_sample)) {
    for (i in seq_along(n_simulations)) {
      treatment = rpois(n_sample[j], lambda = effect + 1.947)
      # Mean of the roses in water (mean from pilot study M = 12.53)
      control = rpois(n_sample[j], lambda = effect)
      sim_data = data.frame(
        response = c(treatment, control),
        treatment = rep(c(0, 1), each = n_sample[j])
        # ,
        # species = rep(c(0, 1), each = n_sample[j])
      )
      z[i] = summary(glm(
        response ~ treatment,
        #+ species,
        data = sim_data,
        family = poisson(link = 'log')
      ))$coeff["treatment", "z value"]
      pval[i] =  summary(glm(
        response ~ treatment,
        #+ species,
        data = sim_data,
        family = poisson(link = 'log')
      ))$coeff["treatment", "Pr(>|z|)"] / 2
      
    }
    n = c(n, n_sample[j])
    powr = c(powr, sum(pval < alpha) / length(n_simulations))
    out = data.frame(n, powr)
    
  }
  return(out)
}

Running the function in a normal way using lapply():

# normal run
system.time({
hh = lapply(seq(50, 500, 50),sample_power)
})

   user  system elapsed 
 140.65    0.80  143.33

Using parallel:parLapply(). Always remember to stop clusters after your task completes.

system.time({
  out = parLapply(cl, seq(50, 500, 50),sample_power)
})

   user  system elapsed 
   0.02    0.02   39.47

# remember to stop cluster
stopCluster(cl)

Thanks for reading!

Distill is a publication format for scientific and technical writing, native to the web.

Learn more about using Distill at https://rstudio.github.io/distill.

Parallel programming

Citation