Parallel programming

parallel programming simulation sample size power

Running large tasks in parallel in R: Power and Sample size Simulation with parallel package.

Basil Okola https://github.com/Bokola
04-23-2021

Sometimes you get to run repetitive tasks that consume a lot of time and computing memory. A good workaround is distributing these tasks across processors, what is universally referred to as core. parallel package comes in handy in such situations. The only thing you need to do is try bundle your procedures into a function so that you can easily apply them within the parallel library.

library(parallel)

Parallel programming is achieved by the parallel::mclapply() function. This function does not however work if you are running on windows like I am. For that, you’d have to use parallel::parLapply(). First you register available processors. If using windows:

If in Linux/Damian systems:

numcores = detectCores()

I used parallel programming in a sample-size simulation that saved considerable amount of time, running 1000 simulations at each sample size.

set.seed(0123)
sample_power = function(n_sample = seq(50, 450, 25),
                        n_simulations = 1:1000,
                        alpha = 0.05 / 14,
                        effect = 26.57964) {
  set.seed(0123)
  n = c()
  pval = c()
  z = c() # z statistic
  powr = c()
  for (j in seq_along(n_sample)) {
    for (i in seq_along(n_simulations)) {
      treatment = rpois(n_sample[j], lambda = effect + 1.947)
      # Mean of the roses in water (mean from pilot study M = 12.53)
      control = rpois(n_sample[j], lambda = effect)
      sim_data = data.frame(
        response = c(treatment, control),
        treatment = rep(c(0, 1), each = n_sample[j])
        # ,
        # species = rep(c(0, 1), each = n_sample[j])
      )
      z[i] = summary(glm(
        response ~ treatment,
        #+ species,
        data = sim_data,
        family = poisson(link = 'log')
      ))$coeff["treatment", "z value"]
      pval[i] =  summary(glm(
        response ~ treatment,
        #+ species,
        data = sim_data,
        family = poisson(link = 'log')
      ))$coeff["treatment", "Pr(>|z|)"] / 2
      
    }
    n = c(n, n_sample[j])
    powr = c(powr, sum(pval < alpha) / length(n_simulations))
    out = data.frame(n, powr)
    
  }
  return(out)
}

Running the function in a normal way using lapply():

# normal run
system.time({
hh = lapply(seq(50, 500, 50),sample_power)
})
   user  system elapsed 
 140.65    0.80  143.33 

Using parallel:parLapply(). Always remember to stop clusters after your task completes.

system.time({
  out = parLapply(cl, seq(50, 500, 50),sample_power)
})
   user  system elapsed 
   0.02    0.02   39.47 
# remember to stop cluster
stopCluster(cl)

Thanks for reading!

Distill is a publication format for scientific and technical writing, native to the web.

Learn more about using Distill at https://rstudio.github.io/distill.

Citation

For attribution, please cite this work as

Okola (2021, April 23). Basil Okola: Parallel programming. Retrieved from https://bokola214.netlify.app/posts/2021-04-23-parallel-programming/

BibTeX citation

@misc{okola2021parallel,
  author = {Okola, Basil},
  title = {Basil Okola: Parallel programming},
  url = {https://bokola214.netlify.app/posts/2021-04-23-parallel-programming/},
  year = {2021}
}