A simple guide for sample size and power calculation, with a poisson distribution as a case study.
We were recently tasked with determining sample size and power for a project in one of the courses in the Master of Statistics program. Sample size and power calculations have been well documented, and involves following 5 steps:
Specify a parameter, hypothesis and test
Specify significance level
Specify effect size
Obtain values or estimates of other parameters needed
Specify a target value for power.
The project involved determining from 14 compounds, those that guaranteed a longer vase life for a rose flower when compared to water. We chose to model count of days and adjusted for species of the flowers, including some random effects as well.
We set \(\alpha=0.05\), effect size of 1, power set at \(1 - \beta = 0.85\) with a right tailed alternative hypothesis. We had to correct for multiple hypothesis test using bonferroni correction. A pilot study of flowers preserved with water was used to estimate mean vase life used for simulating sample size.
We sampled from a poisson distribution (\(Y \sim poisson(\lambda)\)) to simulate sample size and power but did not however include random intercepts in the sample size calculation, which overestimated our overall variability.
Poisson_sims <-
function(n_grid = seq(50, 1500, 50),
lambda0 = 26,
lambda1 = 27,
alpha = 0.05/14,
test = "two_sided",
n_sims = 10000,
seed_nr = 1234) {
# power_vec <- matrix(nrow = 1, ncol = length(n_grid))
power_vec = vector(mode = 'integer')
for (j in 1:length(n_grid)) {
# 1. Choose sample size per group
N <- n_grid[j]
# 2. Select parameters
lambda.control = lambda0
lambda.treated = lambda1
alpha = alpha
# 3. Simulate huge number of experiments and test
numberSimulation <- n_sims
pval <- numeric(numberSimulation)
zval <- numeric(numberSimulation)
set.seed(seed_nr)
for (i in 1:numberSimulation) {
# We simulate from Poisson distribution
controlGroup <- rpois(N, lambda = lambda.control)
treatedGroup <- rpois(N, lambda = lambda.treated)
simData <- data.frame(
response = c(controlGroup, treatedGroup),
treatment = rep(c(0, 1), each = N)
)
# We use a GLM model for Poisson regression to test effect of treatment
# (Wald test)
glm_fit <-
summary(glm(
response ~ treatment,
data = simData,
family = poisson()
))
pval[i] <- glm_fit$coeff["treatment", "Pr(>|z|)"]
zval[i] <- glm_fit$coeff["treatment", "z value"]
if (test == "greater" & zval[i] > 0) {
pval[i] <- pval[i] / 2
}
if (test == "greater" & zval[i] < 0) {
pval[i] <- 1 - (pval[i] / 2)
}
if (test == "less" & zval[i] < 0) {
pval[i] <- pval[i] / 2
}
if (test == "less" & zval[i] > 0) {
pval[i] <- 1 - (pval[i] / 2)
}
}
# 4. Estimate power
power_vec[j] = sum(pval < alpha) / numberSimulation
}
return(list(n_grid = n_grid, power_vec = power_vec))
}
library(parallel)
# for Linux os
numcores = detectCores()
# windows
cl = makeCluster(detectCores())
# normal rune
# system.time({
# hh = lapply(seq(200, 250, 1),sample_power)
# })
# clustermap for supplying additional arguments
system.time({
out_list = clusterMap(cl,test = 'greater',Poisson_sims)
# out_list = parLapply(cl, n_sample = seq(50, 150, 50), sample_power)
})
user system elapsed
0.19 0.45 6207.24
# remember to stop cluster
stopCluster(cl)
out = do.call(cbind, out_list$greater) %>% as.data.frame()
names(out) = c('n', 'powr')
# split out to portions of 10 rows
outb = split(out, (seq(nrow(out))-1) %/% 10)
out_tab = do.call(cbind, outb)
names(out_tab) = sub('^.[^1-9]', "", names(out_tab))
# knitr::kable(out_tab, caption = "Power as simulated for different sample sizes")
cl = makeCluster(detectCores())
# normal run
# system.time({
# hh = lapply(seq(200, 250, 1),sample_power)
# })
system.time({
# out_list = parLapply(cl, seq(1051, 1200, 1),sample_power)
out_list = clusterMap(cl, n_grid = seq(101, 250, 10), test = 'greater',lambda1 = 28,Poisson_sims)
})
user system elapsed
0.03 0.00 185.35
# remember to stop cluster
stopCluster(cl)
out = do.call(rbind.data.frame, out_list) %>% as.data.frame()
names(out) = c('n', 'powr')
outb = split(out, (seq(nrow(out))-1) %/% 30)
out_tab = do.call(cbind, outb)
names(out_tab) = sub('^.[^1-9]', "", names(out_tab))
# knitr::kable(out_tab, caption = "Power as simulated for different sample sizes")
p2 = ggplot(out,aes(x = n, y = powr)) +
geom_line(color = 'red', size = 1) +
geom_hline(aes(yintercept = .80), linetype = 'dashed') +
geom_hline(aes(yintercept = .85), linetype = 'dashed') +
geom_hline(aes(yintercept = .90), linetype = 'dashed') +
theme_minimal() +
# cowplot::theme_minimal_hgrid(rel_small = 1) +
scale_y_continuous(labels = scales::percent) + #, limits = c(0.0, 1)) +
scale_x_continuous(breaks = seq(101, 250, 10)) +
# theme(axis.text.x = element_text(angle = 90, vjust = 0.1, hjust = 1)) +
labs(x = "Sample size", y = 'Power') #+expand_limits(y = 0)
print(p2)
ggsave(p2, filename = file.path(data_dir, "sample size zoom.png"))
Additional links on sample size determination:
https://nickch-k.github.io/EconometricsSlides/Week_08/Power_Simulations.html https://cran.r-project.org/web/packages/paramtest/vignettes/Simulating-Power.html
Learn more about using Distill at https://rstudio.github.io/distill.
For attribution, please cite this work as
Okola (2021, April 19). Basil Okola: sample size calculation for GLMM model with a log link. Retrieved from https://bokola214.netlify.app/posts/2021-04-19-sample-size-calculation-for-glmm-model-with-a-log-link/
BibTeX citation
@misc{okola2021sample, author = {Okola, Basil}, title = {Basil Okola: sample size calculation for GLMM model with a log link}, url = {https://bokola214.netlify.app/posts/2021-04-19-sample-size-calculation-for-glmm-model-with-a-log-link/}, year = {2021} }