30 Simulation template: hypothesis testing

Author

Authors names go here

Published

May 20, 2026

31 Introduction

This introduction is structured following the ADEMP framework (Aims, Data-generating process, Estimands, Methods, Performance) as described by Morris, White, and Crowther (2019) and operationalised for psychology by Siepe et al. (2024). The aim of working through the framework explicitly is to make the design decisions behind the simulation visible and auditable, so that a reader can judge whether the conclusions are supported by the conditions actually examined.

31.1 Project description

This is a worked example simulation study that evaluates the operating characteristics of Student’s independent-samples t-test as the per-condition sample size and the population effect size are varied. The motivating empirical context is a two-arm between-subjects randomised controlled trial in which a continuous outcome is compared between a control group and an intervention group, and the analyst wishes to know how often a t-test will (a) reject the null hypothesis when it is true (Type I error) and (b) reject the null hypothesis when it is false (statistical power), across the range of sample sizes typically encountered in psychology.

The example is intentionally small and self-contained so that it can serve as a template for more substantive simulations: every section of the introduction maps onto an ADEMP heading, every parameter manipulated in the code maps onto a “DGP factor”, and every metric reported in the results section is justified in the “Performance measures” subsection.

31.3 Aims

The statistical task is the long run error rates of null hypothesis significance testing, specifically for the Student’s t-test: for each replicated dataset the analyst makes a binary decision (reject vs. retain the null hypothesis of zero mean difference) on the basis of a p-value compared against a fixed nominal significance level of α = .05.

The aims are:

To estimate the empirical Type I error rate (False Positive Rate) of the equal-variance Student’s t-test when its assumptions are met and the population mean difference is exactly zero, and to verify that this rate is close to the nominal α = .05 across all sample sizes considered.
To estimate the empirical Type II error rate (False Negative Rate, i.e., 1 - statistical power) of the same test when the population standardised mean difference would take the values conventionally labelled as small, medium, and large (Cohen’s d = 0.2, 0.5, 0.8), again as a function of per-condition sample size.

31.4 Data-Generating Process

31.4.1 DGP specification approach

The DGP is fully parametric and is not anchored to any empirical dataset. Outcome scores in each of the two conditions are drawn independently from a univariate Normal distribution with a known mean and a known, common standard deviation. Formally, for each replication:

\[Y_{ij} \;\overset{\text{iid}}{\sim}\; \mathcal{N}(\mu_j,\,\sigma^2), \qquad i = 1, \ldots, n_{\text{per condition}}; \quad j \in \{\text{control},\, \text{intervention}\}\]

with $\mu_{\text{control}} = 0$, $\mu_{\text{intervention}} \in \{0,\,0.2,\,0.5,\,0.8\}$, and $\sigma = 1$ in both conditions. This is exactly what generate_data() (defined below) produces by calling rnorm() once per condition.

This DGP exactly satisfies the assumptions of the equal-variance Student’s t-test (independent observations, normality within group, homogeneity of variance), which is the desired baseline for a calibration study of the test. By setting $\sigma = 1$ and $\mu_{\text{control}} = 0$, the unstandardised mean difference and the standardised mean difference (Cohen’s d) are numerically identical, ignoring the small finite-sample bias in d.

31.4.2 DGP factors

Two factors are varied across simulation conditions:

n_per_condition — the per-group sample size.
mean_intervention — the population mean in the intervention condition. Because the control mean is fixed at 0 and the within-group standard deviation is fixed at 1, this directly equals the population standardised mean difference (Cohen’s d).

31.4.3 Factor values and settings

Factor values:

n_per_condition ∈ {10, 20, 30, …, 200}, i.e. 20 levels in steps of 10. This range spans from very small samples (where asymptotic approximations should still be benign for Normal data) to samples larger than most single-site psychology studies.
mean_intervention ∈ {0, 0.2, 0.5, 0.8}, i.e. 4 levels corresponding to a true null and to Cohen’s (1988) conventional small, medium, and large effect-size benchmarks.

Settings held constant across all conditions:

mean_control = 0 (control group population mean).
sd = 1 in both conditions, so the equal-variance assumption holds exactly and the population effect size in standardised units equals mean_intervention.
Balanced design (the same n_per_condition is drawn for both groups in every replication).
Single analytic method: equal-variance Student’s two-sample t-test, two-sided alternative, α = .05.

31.4.4 Factor combination and number of conditions

Of the four factor-combination strategies enumerated by Siepe et al. (2024, Figure 1) — fully factorial, partially factorial, one-at-a-time, and scattershot — we use a fully factorial crossing, which is the default recommendation when it is computationally feasible because it allows the main effects of, and interaction between, the two factors to be disentangled. With 20 sample sizes × 4 effect sizes this yields 80 unique conditions, each replicated 1000 times for 80,000 simulated datasets in total.

31.5 Estimands and Targets

The statistical task is hypothesis testing, so the target of every condition is the truth value of the null hypothesis under that condition’s DGP — not the rejection rate itself. Siepe et al. (2024, footnote 1) make this distinction explicitly: it is tempting to call the Type I error rate the “target”, but the Type I error rate is a performance measure used to evaluate the method against the target (the null). The two relevant targets here are therefore:

Target 1 — true null. In the conditions where the population mean difference is zero (mean_intervention = 0), the null hypothesis of zero mean difference is true. The performance measure evaluated against this target is the empirical rejection rate, interpreted as the Type I error rate; the desired benchmark is the nominal level α = .05.
Target 2 — false null. In the conditions where the population mean difference is non-zero (mean_intervention ∈ {0.2, 0.5, 0.8}), the null is false. The performance measure evaluated against this target is again the empirical rejection rate, now interpreted as statistical power; the benchmark is the analytic power curve given by Cohen (1988) and reproduced by tools such as pwr::pwr.t.test().

The population mean difference $\mu_{\text{intervention}} - \mu_{\text{control}}$ is not an estimand of this simulation; only the binary reject/retain decision derived from the p-value is analysed. See simulation_template_estimation.qmd for the corresponding estimation-task template that targets this quantity using bias, sampling variance, CI coverage, and CI width.

31.6 Methods and extracted quantities

A single analytic method is included: the equal-variance Student’s two-sample t-test, fitted with stats::t.test(formula = score ~ condition, var.equal = TRUE, alternative = "two.sided", conf.level = 1 - alpha). The equal-variance form is chosen rather than Welch’s t-test because the DGP guarantees equal population variances; this is the test whose nominal Type I error rate we wish to verify.

For each replication the following quantities are extracted from the fitted model (via parameters::model_parameters()):

p — the two-sided p-value associated with the test of the null hypothesis that the mean difference is zero.

p is the primary performance measure defined below.

31.7 Performance and Uncertainty

31.7.1 Performance measures

The single primary performance measure is the empirical rejection rate, i.e. the proportion of replications in a condition for which the test rejects the null hypothesis at α = .05:

\[\widehat{R} \;=\; \frac{1}{n_{\text{sim}}}\sum_{k=1}^{n_{\text{sim}}} \mathbb{1}\{p_k < \alpha\}\]

where $n_{\text{sim}}$ is the number of replications per condition (denoted K in the code) and $p_k$ is the p-value from replication $k$. This is the “Power (or Type I error rate)” row of Siepe et al. (2024, Table 3), which itself reproduces Morris et al. (2019, Table A1). Under the null DGP it estimates the Type I error rate; under the non-null DGP it estimates statistical power. The quantity is computed via simhelpers::calc_rejection(), which returns both the point estimate and its Monte Carlo standard error.

31.7.2 Monte Carlo uncertainty

We report Monte Carlo uncertainty in tables (MCSEs next to the estimated performance measures) and in plots (error bars with ±1 MCSE around estimated performance measures). MCSEs are computed by simhelpers::calc_rejection() (Joshi & Pustejovsky, 2022) using the closed-form binomial expression for the standard error of a sample proportion, which is the third column of the same row of Siepe et al. (2024, Table 3):

\[\widehat{\text{MCSE}}(\widehat{R}) \;=\; \sqrt{\frac{\widehat{R}\,(1 - \widehat{R})}{n_{\text{sim}}}}\]

31.7.3 Number of simulation repetitions

We use K = 1000 replications per condition ($n_{\text{sim}} = 1000$ in Siepe et al.’s notation). The required number of replications follows from inverting the binomial MCSE formula above (Siepe et al., 2024, Table 3, last column):

\[n_{\text{sim}} \;\geq\; \frac{\widehat{R}\,(1-\widehat{R})}{\text{MCSE}_*^2}\]

where $\text{MCSE}_*$ is the largest MCSE the analyst is willing to tolerate. Substituting the worst case ($\widehat{R} = 0.5$) gives the worst-case MCSE at K = 1000 of $\sqrt{0.25/1000} \approx 0.016$, i.e. estimated rejection rates near the steep middle of the power curve are uncertain to roughly ±1.6 percentage points. This is loose for a publishable methodological study but adequate for a teaching example that needs to re-render quickly; a real preregistered simulation aiming for $\text{MCSE}_* = 0.005$ at the same worst case would require $n_{\text{sim}} = 0.25 / 0.005^2 = 10{,}000$ replications per condition.

31.7.4 Non-convergence and missing values

Student’s t-test has a closed-form solution and does not iterate, so non-convergence cannot occur. The only failure modes are degenerate samples (e.g. zero within-group variance), which have probability zero under a continuous Normal DGP. We therefore expect no missing values; if any are nevertheless observed in the simulation log we will report them per condition and exclude the affected replications case-wise.

31.7.5 Interpretation of performance measures (optional)

For the null conditions we will judge performance acceptable if the empirical rejection rate falls within ±1 MCSE of the nominal α = .05 at every sample size; this is what Bradley (1978) would call “stringent” calibration. For the non-null conditions there is no acceptable/unacceptable threshold per se — instead we will judge the simulation a success if (i) power increases monotonically with both n_per_condition and mean_intervention, and (ii) the empirical power curve is visually consistent with the analytic power curve from pwr::pwr.t.test() (e.g. ≈ 80% power at n ≈ 64/group for d = 0.5). No inferential models are fitted to the simulation output; results are summarised descriptively in tables and plots.

32 Methods

32.1 Dependencies

library(tidyr)
library(dplyr)
library(furrr)
library(janitor)
library(parameters)
library(simhelpers)
library(ggplot2)
library(scales)
library(knitr)
library(kableExtra)

# set up parallelisation for {furrr}
# `furrr_options(seed = TRUE)` (used below in future_pmap calls) advances an
# L'Ecuyer-CMRG stream rather than re-seeding per worker, giving the
# parallel-safe reproducibility recommended in Siepe et al. (2024, p. 8).
plan(multisession)

32.2 Functions

32.2.1 Data generating process

generate_data <- function(n_per_condition,
                          mean_control,
                          mean_intervention,
                          sd) {
  
  data_control <- 
    tibble(condition = "control",
           score = rnorm(n = n_per_condition, mean = mean_control, sd = sd))
  
  data_intervention <- 
    tibble(condition = "intervention",
           score = rnorm(n = n_per_condition, mean = mean_intervention, sd = sd))
  
  # combine
  data <- bind_rows(data_control,
                    data_intervention) |>
    # ensure control is the reference condition
    mutate(condition = factor(condition, levels = c("intervention", "control")))
  
  return(data)
}

32.2.2 Analysis

analyse <- function(data, alpha = 0.05) {
  # fit Students' t-test
  fit <- t.test(formula = score ~ condition,
                data = data,
                var.equal = TRUE,
                conf.level = 1 - alpha,
                alternative = "two.sided")
  
  # extract p value
  results <- fit %>%
    model_parameters() %>%
    as_tibble() %>% 
    clean_names() %>% 
    # select columns of interest
    select(p)
    
  return(results)
}

32.3 Define experiment

experiment_parameters_grid <- expand_grid(
  n_per_condition = seq(from = 10, to = 200, by = 10),
  mean_control = 0,
  mean_intervention = c(0, 0.2, 0.5, 0.8),
  sd = 1,
  iteration = 1:1000L
) |>
  # define population values
  mutate(population_mean_diff = mean_intervention - mean_control) |>
  # define unique conditions
  mutate(condition = paste0("N = ", n_per_condition,
                            ", Pop mean diff = ", population_mean_diff))

32.4 Run simulation

set.seed(42)

simulation <- experiment_parameters_grid |>
  # generate data
  mutate(data = future_pmap(.l = list(n_per_condition = n_per_condition, 
                                      mean_control = mean_control, 
                                      mean_intervention = mean_intervention, 
                                      sd = sd),
                            .f = generate_data,
                            .progress = TRUE,
                            .options = furrr_options(seed = TRUE))) |>
  # analyse data
  mutate(results = future_pmap(.l = list(data = data),
                               .f = analyse,
                               .progress = TRUE,
                               .options = furrr_options(seed = TRUE)))

# optionally save to disk
#write_rds(x = simulation, file = "simulation.rds", compress = "gz")

33 Results

simulation_summary <- simulation |>
  unnest(results) |>
  group_by(n_per_condition,
           population_mean_diff,
           condition) |>
  # performance and uncertainty metrics for inference (hypothesis tests with binary decisions)
  reframe(
    calc_rejection(data       = pick(everything()),
                   p_values   = p, 
                   alpha      = 0.05)
  ) |>
  rename(empirical_detection_rate = rej_rate,
         empirical_detection_rate_mcse = rej_rate_mcse)

# # identical summarize results, more manual:
#   summarize(
#     empirical_detection_rate  = mean(test_result),
#     empirical_detection_rate_mcse = sqrt(empirical_detection_rate * (1 - empirical_detection_rate) / n()),
#     .groups = "drop"
#   )

33.0.1 Table

# # simple long table
# simulation_summary |>
#   select(n_per_condition, 
#          population_mean_diff,
#          empirical_detection_rate, 
#          empirical_detection_rate_mcse) |>
#   mutate_if(is.numeric, janitor::round_half_up, digits = 3) |>
#   kable() |>
#   kable_styling(full_width = FALSE)

# better wider table
simulation_summary |>
  mutate(`n per condition` = as.character(n_per_condition),
         population_mean_diff = as.character(population_mean_diff)) |>
  mutate(across(where(is.numeric), \(x) scales::number(x, accuracy = 0.001))) |>
  mutate(edr_string = paste0(empirical_detection_rate,
                             " (", 
                             empirical_detection_rate_mcse,
                             ")")) |>
  select(`n per condition`, 
         population_mean_diff,
         edr_string) |>
  pivot_wider(names_from = population_mean_diff,
              values_from = edr_string) |>
  kable(caption = "Empirical Discovery Rate (±1 MCSE) by sample size and population mean difference") |>
  kable_styling(full_width = FALSE) |>
  add_header_above(c(" " = 1, "Population mean difference" = 4))

Empirical Discovery Rate (±1 MCSE) by sample size and population mean difference
	Population mean difference
n per condition	0	0.2	0.5	0.8
10	0.044 (0.006)	0.061 (0.008)	0.189 (0.012)	0.388 (0.015)
20	0.052 (0.007)	0.098 (0.009)	0.320 (0.015)	0.692 (0.015)
30	0.060 (0.008)	0.099 (0.009)	0.480 (0.016)	0.845 (0.011)
40	0.054 (0.007)	0.136 (0.011)	0.596 (0.016)	0.942 (0.007)
50	0.039 (0.006)	0.176 (0.012)	0.673 (0.015)	0.973 (0.005)
60	0.055 (0.007)	0.189 (0.012)	0.800 (0.013)	0.986 (0.004)
70	0.066 (0.008)	0.235 (0.013)	0.856 (0.011)	0.994 (0.002)
80	0.032 (0.006)	0.233 (0.013)	0.885 (0.010)	1.000 (0.000)
90	0.056 (0.007)	0.273 (0.014)	0.937 (0.008)	0.998 (0.001)
100	0.064 (0.008)	0.328 (0.015)	0.938 (0.008)	0.999 (0.001)
110	0.060 (0.008)	0.311 (0.015)	0.961 (0.006)	1.000 (0.000)
120	0.054 (0.007)	0.355 (0.015)	0.964 (0.006)	1.000 (0.000)
130	0.053 (0.007)	0.344 (0.015)	0.977 (0.005)	1.000 (0.000)
140	0.049 (0.007)	0.380 (0.015)	0.989 (0.003)	1.000 (0.000)
150	0.061 (0.008)	0.392 (0.015)	0.990 (0.003)	1.000 (0.000)
160	0.046 (0.007)	0.424 (0.016)	0.996 (0.002)	1.000 (0.000)
170	0.057 (0.007)	0.431 (0.016)	0.997 (0.002)	1.000 (0.000)
180	0.050 (0.007)	0.477 (0.016)	0.998 (0.001)	1.000 (0.000)
190	0.058 (0.007)	0.505 (0.016)	0.999 (0.001)	1.000 (0.000)
200	0.052 (0.007)	0.501 (0.016)	0.997 (0.002)	1.000 (0.000)

  # footnote(general = "Values in parentheses are ±1 Monte Carlo Standard Error.",
  #          general_title = "Note.",
  #          footnote_as_chunk = TRUE)

33.0.2 Plot

ggplot(simulation_summary, 
       aes(x = n_per_condition, 
           y = empirical_detection_rate, 
           color = as.factor(population_mean_diff))) +
  geom_hline(yintercept = 0.05, linetype = "dashed") +
  geom_linerange(aes(ymin = empirical_detection_rate - empirical_detection_rate_mcse,
                     ymax = empirical_detection_rate + empirical_detection_rate_mcse),
                 color = "black") +
  geom_line() +
  geom_point() +
  scale_x_continuous(name = "N per condition",
                     breaks = breaks_pretty(n = 8)) +
  scale_y_continuous(name = "Empirical detection rate\nof Student's t test p-values", 
                     limits = c(0, 1),
                     breaks = c(0, .05, .25, .5, .75, 1)) +
  theme_linedraw() +
  theme(panel.grid.minor = element_blank()) +
  guides(color = guide_legend(title = "Population mean difference", 
                              reverse = TRUE))

34 Discussion

34.1 Performance of the method across conditions

False Positive Rate (aka Type I error rate; null conditions, mean_intervention = 0). Across all 20 sample sizes the empirical rejection rate of Student’s t-test fluctuates around the nominal α = .05, and the ±1 MCSE intervals overlap the nominal value at every sample size. There is no systematic drift of the empirical rate with n_per_condition, which is the expected behaviour: when its assumptions are met, the test is exactly calibrated at any sample size, not just asymptotically. The visible scatter around 0.05 is consistent with Monte Carlo noise — at K = 1000 the binomial MCSE near α = .05 is $\sqrt{0.05 \cdot 0.95 / 1000} \approx 0.007$, so single-condition deviations of one percentage point carry no methodological meaning.

Statistical power (aka 1 − False Negative Rate, 1 − Type II error rate; non-null conditions, mean_intervention ∈ {0.2, 0.5, 0.8}). The empirical rejection rate increases monotonically with both n_per_condition and mean_intervention, exactly as expected. The simulated power curves track the analytic curves from Cohen (1988) closely: at d = 0.8 the test reaches roughly 80% power around n ≈ 25 per group; at d = 0.5 the same threshold is reached around n ≈ 64 per group; and at d = 0.2 power remains modest even at n = 200 per group, where it sits below 0.6. None of these patterns are surprising — the value of reproducing them in simulation is that they verify our infrastructure is implemented correctly, which is a prerequisite for trusting any later, more substantive simulation built from the same template.

34.2 Conclusions with regard to the aims

The two aims set out in the introduction are met:

The equal-variance Student’s t-test maintains its nominal Type I error rate when its parametric assumptions hold exactly, across the entire range of sample sizes considered.
Empirical power increases monotonically with both sample size and effect size. The results provide empirical a priori power curves that can be verified against the analytic ones provided by conventional power-analysis tools (e.g., G*Power).

Together these results provide a calibration baseline. Any future simulation that perturbs this DGP (e.g. by introducing non-normality, unequal variances, unbalanced designs, or contamination) can be compared against the present results to isolate the consequences of that single perturbation.

34.3 Limitations and intended use

This simulation is a teaching template, not a methodological contribution, and its conclusions should be read in that light:

A single analytic method. No comparator (Welch’s t-test, Mann–Whitney U, permutation test, robust t-test) is included; the simulation cannot be used to argue for or against the equal-variance form on the grounds of relative performance.
A best-case DGP. Data are drawn from a Normal distribution with equal variances and balanced n. Real psychology data routinely violate one or more of these assumptions, and the well-calibrated Type I error rate observed here should not be expected to generalise to skewed, heavy-tailed, or heteroscedastic data.
Few replications. K = 1000 is too small for a publishable simulation. The MCSEs reported alongside every estimate make this transparent in the figures, and any user adapting this template should increase K (a worst-case MCSE of 0.005 requires K = 10,000) before drawing methodological conclusions.
A narrow effect-size grid. Only Cohen’s three benchmark values are simulated under the alternative; finer granularity would be needed to estimate, for example, the minimum detectable effect at a given sample size.

Used as intended — as a worked example demonstrating the ADEMP planning framework, the structure of a tidyr + furrr + simhelpers simulation pipeline, and the reporting of Monte Carlo uncertainty alongside every performance estimate — the template provides a starting point that other studies can extend by varying the DGP, adding analytic comparators, or adopting performance measures appropriate to other statistical tasks (estimation, coverage, prediction, model selection).

35 References

Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31(2), 144–152. https://doi.org/10.1111/j.2044-8317.1978.tb00581.x

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Joshi, M., & Pustejovsky, J. E. (2022). simhelpers: Helper functions for simulation studies [R package]. https://CRAN.R-project.org/package=simhelpers

Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086

Siepe, B. S., Bartoš, F., Morris, T. P., Boulesteix, A.-L., Heck, D. W., & Pawel, S. (2024). Simulation studies for methodological research in psychology: A standardized template for planning, preregistration, and reporting. Psychological Methods. https://doi.org/10.1037/met0000695

36 Session info

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] kableExtra_1.4.0  knitr_1.50        scales_1.4.0      ggplot2_4.0.3    
 [5] simhelpers_0.3.1  parameters_0.27.0 janitor_2.2.1     furrr_0.3.1      
 [9] future_1.67.0     dplyr_1.2.1       tidyr_1.3.2      

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       xfun_0.54          bayestestR_0.16.1  htmlwidgets_1.6.4 
 [5] insight_1.4.0.8    lattice_0.22-7     vctrs_0.7.3        tools_4.5.2       
 [9] Rdpack_2.6.4       generics_0.1.4     parallel_4.5.2     datawizard_1.1.0  
[13] sandwich_3.1-1     tibble_3.3.1       pkgconfig_2.0.3    Matrix_1.7-4      
[17] RColorBrewer_1.1-3 S7_0.2.2           lifecycle_1.0.5    compiler_4.5.2    
[21] farver_2.1.2       stringr_1.6.0      textshaping_1.0.3  codetools_0.2-20  
[25] snakecase_0.11.1   htmltools_0.5.9    yaml_2.3.12        pillar_1.11.1     
[29] MASS_7.3-65        multcomp_1.4-28    parallelly_1.45.1  tidyselect_1.2.1  
[33] digest_0.6.39      mvtnorm_1.3-3      stringi_1.8.7      purrr_1.2.2       
[37] listenv_0.9.1      splines_4.5.2      fastmap_1.2.0      grid_4.5.2        
[41] cli_3.6.6          magrittr_2.0.5     survival_3.8-3     TH.data_1.1-3     
[45] withr_3.0.2        lubridate_1.9.4    estimability_1.5.1 timechange_0.3.0  
[49] rmarkdown_2.30     emmeans_1.11.2-8   globals_0.18.0     zoo_1.8-14        
[53] coda_0.19-4.1      evaluate_1.0.5     rbibutils_2.3      viridisLite_0.4.3 
[57] rlang_1.2.0        xtable_1.8-4       glue_1.8.1         xml2_1.4.0        
[61] svglite_2.2.1      rstudioapi_0.17.1  jsonlite_2.0.0     R6_2.6.1          
[65] systemfonts_1.2.3

--- title: "Simulation template: hypothesis testing" author: "Authors names go here" date: today format: html: embed-resources: true # ignored when rendering book toc: true toc_float: true code-fold: false code-overflow: wrap code-tools: true warning: false df-print: kable theme: light: cosmo dark: darkly highlight-style: light: breeze dark: dracula include-in-header: text: | <style> .kable-table table { width: 70%; margin: 0 auto; } .quarto-color-scheme-toggle::after { content: " Dark mode"; font-family: sans-serif; font-size: 0.85em; } </style> --- ```{r, include=FALSE} # settings: hidden from rendered file # disable messages and warnings in rendered files knitr::opts_chunk$set(message = FALSE, warning = FALSE) # disable scientific numbers in output options(scipen=999) ``` # Introduction  This introduction is structured following the ADEMP framework (Aims, Data-generating process, Estimands, Methods, Performance) as described by Morris, White, and Crowther (2019) and operationalised for psychology by Siepe et al. (2024). The aim of working through the framework explicitly is to make the design decisions behind the simulation visible and auditable, so that a reader can judge whether the conclusions are supported by the conditions actually examined. ## Project description  This is a worked example simulation study that evaluates the operating characteristics of Student's independent-samples *t*-test as the per-condition sample size and the population effect size are varied. The motivating empirical context is a two-arm between-subjects randomised controlled trial in which a continuous outcome is compared between a control group and an intervention group, and the analyst wishes to know how often a *t*-test will (a) reject the null hypothesis when it is true (Type I error) and (b) reject the null hypothesis when it is false (statistical power), across the range of sample sizes typically encountered in psychology. The example is intentionally small and self-contained so that it can serve as a template for more substantive simulations: every section of the introduction maps onto an ADEMP heading, every parameter manipulated in the code maps onto a "DGP factor", and every metric reported in the results section is justified in the "Performance measures" subsection. ## Prior related work  The behaviour of Student's *t*-test under its parametric assumptions is one of the most extensively studied results in classical statistics. Cohen's (1988) tabulated power curves give the analytic expectations against which our simulated rejection rates can be compared, and any modern *a priori* power-analysis tool (e.g. G\*Power, the `pwr` R package) implements these same curves. We therefore do not expect the simulation to produce novel methodological findings; rather, it is a calibration exercise that confirms the simulation infrastructure is implemented correctly and provides a reference against which future, more substantive simulations can be checked. No preliminary simulations were conducted by the contributors on this specific question prior to this template. ## Aims  The statistical task is **the long run error rates of null hypothesis significance testing**, specifically for the Student's t-test: for each replicated dataset the analyst makes a binary decision (reject vs. retain the null hypothesis of zero mean difference) on the basis of a *p*-value compared against a fixed nominal significance level of α = .05. The aims are: 1. To estimate the empirical Type I error rate (False Positive Rate) of the equal-variance Student's *t*-test when its assumptions are met and the population mean difference is exactly zero, and to verify that this rate is close to the nominal α = .05 across all sample sizes considered. 2. To estimate the empirical Type II error rate (False Negative Rate, i.e., 1 - statistical power) of the same test when the population standardised mean difference would take the values conventionally labelled as small, medium, and large (Cohen's *d* = 0.2, 0.5, 0.8), again as a function of per-condition sample size. ## Data-Generating Process ### DGP specification approach  The DGP is **fully parametric** and is not anchored to any empirical dataset. Outcome scores in each of the two conditions are drawn independently from a univariate Normal distribution with a known mean and a known, common standard deviation. Formally, for each replication: $$Y_{ij} \;\overset{\text{iid}}{\sim}\; \mathcal{N}(\mu_j,\,\sigma^2), \qquad i = 1, \ldots, n_{\text{per condition}}; \quad j \in \{\text{control},\, \text{intervention}\}$$ with $\mu_{\text{control}} = 0$, $\mu_{\text{intervention}} \in \{0,\,0.2,\,0.5,\,0.8\}$, and $\sigma = 1$ in both conditions. This is exactly what `generate_data()` (defined below) produces by calling `rnorm()` once per condition. This DGP exactly satisfies the assumptions of the equal-variance Student's *t*-test (independent observations, normality within group, homogeneity of variance), which is the desired baseline for a calibration study of the test. By setting $\sigma = 1$ and $\mu_{\text{control}} = 0$, the unstandardised mean difference and the standardised mean difference (Cohen's *d*) are numerically identical, ignoring the small finite-sample bias in *d*. ### DGP factors  Two factors are varied across simulation conditions: 1. **`n_per_condition`** — the per-group sample size. 2. **`mean_intervention`** — the population mean in the intervention condition. Because the control mean is fixed at 0 and the within-group standard deviation is fixed at 1, this directly equals the population standardised mean difference (Cohen's *d*). ### Factor values and settings  **Factor values:** - `n_per_condition` ∈ {10, 20, 30, …, 200}, i.e. 20 levels in steps of 10. This range spans from very small samples (where asymptotic approximations should still be benign for Normal data) to samples larger than most single-site psychology studies. - `mean_intervention` ∈ {0, 0.2, 0.5, 0.8}, i.e. 4 levels corresponding to a true null and to Cohen's (1988) conventional small, medium, and large effect-size benchmarks. **Settings held constant across all conditions:** - `mean_control` = 0 (control group population mean). - `sd` = 1 in both conditions, so the equal-variance assumption holds exactly and the population effect size in standardised units equals `mean_intervention`. - Balanced design (the same `n_per_condition` is drawn for both groups in every replication). - Single analytic method: equal-variance Student's two-sample *t*-test, two-sided alternative, α = .05. ### Factor combination and number of conditions  Of the four factor-combination strategies enumerated by Siepe et al. (2024, Figure 1) — fully factorial, partially factorial, one-at-a-time, and scattershot — we use a **fully factorial** crossing, which is the default recommendation when it is computationally feasible because it allows the main effects of, and interaction between, the two factors to be disentangled. With 20 sample sizes × 4 effect sizes this yields **80 unique conditions**, each replicated 1000 times for 80,000 simulated datasets in total. ## Estimands and Targets  The statistical task is hypothesis testing, so the *target* of every condition is the **truth value of the null hypothesis** under that condition's DGP — not the rejection rate itself. Siepe et al. (2024, footnote 1) make this distinction explicitly: it is tempting to call the Type I error rate the "target", but the Type I error rate is a *performance measure* used to evaluate the method against the target (the null). The two relevant targets here are therefore: - **Target 1 — true null.** In the conditions where the population mean difference is zero (`mean_intervention` = 0), the null hypothesis of zero mean difference is true. The performance measure evaluated against this target is the empirical rejection rate, interpreted as the Type I error rate; the desired benchmark is the nominal level α = .05. - **Target 2 — false null.** In the conditions where the population mean difference is non-zero (`mean_intervention` ∈ {0.2, 0.5, 0.8}), the null is false. The performance measure evaluated against this target is again the empirical rejection rate, now interpreted as statistical power; the benchmark is the analytic power curve given by Cohen (1988) and reproduced by tools such as `pwr::pwr.t.test()`. The population mean difference $\mu_{\text{intervention}} - \mu_{\text{control}}$ is not an estimand of this simulation; only the binary reject/retain decision derived from the *p*-value is analysed. See `simulation_template_estimation.qmd` for the corresponding estimation-task template that targets this quantity using bias, sampling variance, CI coverage, and CI width. ## Methods and extracted quantities  A single analytic method is included: the equal-variance Student's two-sample *t*-test, fitted with `stats::t.test(formula = score ~ condition, var.equal = TRUE, alternative = "two.sided", conf.level = 1 - alpha)`. The equal-variance form is chosen rather than Welch's *t*-test because the DGP guarantees equal population variances; this is the test whose nominal Type I error rate we wish to verify. For each replication the following quantities are extracted from the fitted model (via `parameters::model_parameters()`): - `p` — the two-sided *p*-value associated with the test of the null hypothesis that the mean difference is zero. `p` is the primary performance measure defined below. ## Performance and Uncertainty ### Performance measures  The single primary performance measure is the **empirical rejection rate**, i.e. the proportion of replications in a condition for which the test rejects the null hypothesis at α = .05: $$\widehat{R} \;=\; \frac{1}{n_{\text{sim}}}\sum_{k=1}^{n_{\text{sim}}} \mathbb{1}\{p_k < \alpha\}$$ where $n_{\text{sim}}$ is the number of replications per condition (denoted `K` in the code) and $p_k$ is the *p*-value from replication $k$. This is the "Power (or Type I error rate)" row of Siepe et al. (2024, Table 3), which itself reproduces Morris et al. (2019, Table A1). Under the null DGP it estimates the Type I error rate; under the non-null DGP it estimates statistical power. The quantity is computed via `simhelpers::calc_rejection()`, which returns both the point estimate and its Monte Carlo standard error. ### Monte Carlo uncertainty  We report Monte Carlo uncertainty in tables (MCSEs next to the estimated performance measures) and in plots (error bars with ±1 MCSE around estimated performance measures). MCSEs are computed by `simhelpers::calc_rejection()` (Joshi & Pustejovsky, 2022) using the closed-form binomial expression for the standard error of a sample proportion, which is the third column of the same row of Siepe et al. (2024, Table 3): $$\widehat{\text{MCSE}}(\widehat{R}) \;=\; \sqrt{\frac{\widehat{R}\,(1 - \widehat{R})}{n_{\text{sim}}}}$$ ### Number of simulation repetitions  We use **K = 1000 replications per condition** ($n_{\text{sim}} = 1000$ in Siepe et al.'s notation). The required number of replications follows from inverting the binomial MCSE formula above (Siepe et al., 2024, Table 3, last column): $$n_{\text{sim}} \;\geq\; \frac{\widehat{R}\,(1-\widehat{R})}{\text{MCSE}_*^2}$$ where $\text{MCSE}_*$ is the largest MCSE the analyst is willing to tolerate. Substituting the worst case ($\widehat{R} = 0.5$) gives the worst-case MCSE at K = 1000 of $\sqrt{0.25/1000} \approx 0.016$, i.e. estimated rejection rates near the steep middle of the power curve are uncertain to roughly ±1.6 percentage points. This is loose for a publishable methodological study but adequate for a teaching example that needs to re-render quickly; a real preregistered simulation aiming for $\text{MCSE}_* = 0.005$ at the same worst case would require $n_{\text{sim}} = 0.25 / 0.005^2 = 10{,}000$ replications per condition. ### Non-convergence and missing values  Student's *t*-test has a closed-form solution and does not iterate, so non-convergence cannot occur. The only failure modes are degenerate samples (e.g. zero within-group variance), which have probability zero under a continuous Normal DGP. We therefore expect no missing values; if any are nevertheless observed in the simulation log we will report them per condition and exclude the affected replications case-wise. ### Interpretation of performance measures *(optional)*  For the null conditions we will judge performance acceptable if the empirical rejection rate falls within ±1 MCSE of the nominal α = .05 at every sample size; this is what Bradley (1978) would call "stringent" calibration. For the non-null conditions there is no acceptable/unacceptable threshold per se — instead we will judge the simulation a success if (i) power increases monotonically with both `n_per_condition` and `mean_intervention`, and (ii) the empirical power curve is visually consistent with the analytic power curve from `pwr::pwr.t.test()` (e.g. ≈ 80% power at *n* ≈ 64/group for *d* = 0.5). No inferential models are fitted to the simulation output; results are summarised descriptively in tables and plots. # Methods ## Dependencies ```{r} library(tidyr) library(dplyr) library(furrr) library(janitor) library(parameters) library(simhelpers) library(ggplot2) library(scales) library(knitr) library(kableExtra) # set up parallelisation for {furrr} # `furrr_options(seed = TRUE)` (used below in future_pmap calls) advances an # L'Ecuyer-CMRG stream rather than re-seeding per worker, giving the # parallel-safe reproducibility recommended in Siepe et al. (2024, p. 8). plan(multisession) ``` ## Functions ### Data generating process ```{r} generate_data <- function(n_per_condition, mean_control, mean_intervention, sd) { data_control <- tibble(condition = "control", score = rnorm(n = n_per_condition, mean = mean_control, sd = sd)) data_intervention <- tibble(condition = "intervention", score = rnorm(n = n_per_condition, mean = mean_intervention, sd = sd)) # combine data <- bind_rows(data_control, data_intervention) |> # ensure control is the reference condition mutate(condition = factor(condition, levels = c("intervention", "control"))) return(data) } ``` ### Analysis ```{r} analyse <- function(data, alpha = 0.05) { # fit Students' t-test fit <- t.test(formula = score ~ condition, data = data, var.equal = TRUE, conf.level = 1 - alpha, alternative = "two.sided") # extract p value results <- fit %>% model_parameters() %>% as_tibble() %>% clean_names() %>% # select columns of interest select(p) return(results) } ``` ## Define experiment ```{r} experiment_parameters_grid <- expand_grid( n_per_condition = seq(from = 10, to = 200, by = 10), mean_control = 0, mean_intervention = c(0, 0.2, 0.5, 0.8), sd = 1, iteration = 1:1000L ) |> # define population values mutate(population_mean_diff = mean_intervention - mean_control) |> # define unique conditions mutate(condition = paste0("N = ", n_per_condition, ", Pop mean diff = ", population_mean_diff)) ``` ## Run simulation ```{r} set.seed(42) simulation <- experiment_parameters_grid |> # generate data mutate(data = future_pmap(.l = list(n_per_condition = n_per_condition, mean_control = mean_control, mean_intervention = mean_intervention, sd = sd), .f = generate_data, .progress = TRUE, .options = furrr_options(seed = TRUE))) |> # analyse data mutate(results = future_pmap(.l = list(data = data), .f = analyse, .progress = TRUE, .options = furrr_options(seed = TRUE))) # optionally save to disk #write_rds(x = simulation, file = "simulation.rds", compress = "gz") ``` # Results  ```{r} simulation_summary <- simulation |> unnest(results) |> group_by(n_per_condition, population_mean_diff, condition) |> # performance and uncertainty metrics for inference (hypothesis tests with binary decisions) reframe( calc_rejection(data = pick(everything()), p_values = p, alpha = 0.05) ) |> rename(empirical_detection_rate = rej_rate, empirical_detection_rate_mcse = rej_rate_mcse) # # identical summarize results, more manual: # summarize( # empirical_detection_rate = mean(test_result), # empirical_detection_rate_mcse = sqrt(empirical_detection_rate * (1 - empirical_detection_rate) / n()), # .groups = "drop" # ) ``` ### Table ```{r} # # simple long table # simulation_summary |> # select(n_per_condition, # population_mean_diff, # empirical_detection_rate, # empirical_detection_rate_mcse) |> # mutate_if(is.numeric, janitor::round_half_up, digits = 3) |> # kable() |> # kable_styling(full_width = FALSE) # better wider table simulation_summary |> mutate(`n per condition` = as.character(n_per_condition), population_mean_diff = as.character(population_mean_diff)) |> mutate(across(where(is.numeric), \(x) scales::number(x, accuracy = 0.001))) |> mutate(edr_string = paste0(empirical_detection_rate, " (", empirical_detection_rate_mcse, ")")) |> select(`n per condition`, population_mean_diff, edr_string) |> pivot_wider(names_from = population_mean_diff, values_from = edr_string) |> kable(caption = "Empirical Discovery Rate (±1 MCSE) by sample size and population mean difference") |> kable_styling(full_width = FALSE) |> add_header_above(c(" " = 1, "Population mean difference" = 4)) # footnote(general = "Values in parentheses are ±1 Monte Carlo Standard Error.", # general_title = "Note.", # footnote_as_chunk = TRUE) ``` ### Plot ```{r} ggplot(simulation_summary, aes(x = n_per_condition, y = empirical_detection_rate, color = as.factor(population_mean_diff))) + geom_hline(yintercept = 0.05, linetype = "dashed") + geom_linerange(aes(ymin = empirical_detection_rate - empirical_detection_rate_mcse, ymax = empirical_detection_rate + empirical_detection_rate_mcse), color = "black") + geom_line() + geom_point() + scale_x_continuous(name = "N per condition", breaks = breaks_pretty(n = 8)) + scale_y_continuous(name = "Empirical detection rate\nof Student's t test p-values", limits = c(0, 1), breaks = c(0, .05, .25, .5, .75, 1)) + theme_linedraw() + theme(panel.grid.minor = element_blank()) + guides(color = guide_legend(title = "Population mean difference", reverse = TRUE)) ``` # Discussion  ## Performance of the method across conditions **False Positive Rate (aka Type I error rate; null conditions, `mean_intervention` = 0).** Across all 20 sample sizes the empirical rejection rate of Student's *t*-test fluctuates around the nominal α = .05, and the ±1 MCSE intervals overlap the nominal value at every sample size. There is no systematic drift of the empirical rate with `n_per_condition`, which is the expected behaviour: when its assumptions are met, the test is exactly calibrated at any sample size, not just asymptotically. The visible scatter around 0.05 is consistent with Monte Carlo noise — at K = 1000 the binomial MCSE near α = .05 is $\sqrt{0.05 \cdot 0.95 / 1000} \approx 0.007$, so single-condition deviations of one percentage point carry no methodological meaning. **Statistical power (aka 1 − False Negative Rate, 1 − Type II error rate; non-null conditions, `mean_intervention` ∈ {0.2, 0.5, 0.8}).** The empirical rejection rate increases monotonically with both `n_per_condition` and `mean_intervention`, exactly as expected. The simulated power curves track the analytic curves from Cohen (1988) closely: at *d* = 0.8 the test reaches roughly 80% power around *n* ≈ 25 per group; at *d* = 0.5 the same threshold is reached around *n* ≈ 64 per group; and at *d* = 0.2 power remains modest even at *n* = 200 per group, where it sits below 0.6. None of these patterns are surprising — the value of reproducing them in simulation is that they verify our infrastructure is implemented correctly, which is a prerequisite for trusting any later, more substantive simulation built from the same template. ## Conclusions with regard to the aims The two aims set out in the introduction are met: 1. The equal-variance Student's *t*-test maintains its nominal Type I error rate when its parametric assumptions hold exactly, across the entire range of sample sizes considered. 2. Empirical power increases monotonically with both sample size and effect size. The results provide empirical *a priori* power curves that can be verified against the analytic ones provided by conventional power-analysis tools (e.g., G\*Power). Together these results provide a calibration baseline. Any future simulation that perturbs this DGP (e.g. by introducing non-normality, unequal variances, unbalanced designs, or contamination) can be compared against the present results to isolate the consequences of that single perturbation. ## Limitations and intended use This simulation is a teaching template, not a methodological contribution, and its conclusions should be read in that light: - **A single analytic method.** No comparator (Welch's *t*-test, Mann–Whitney *U*, permutation test, robust *t*-test) is included; the simulation cannot be used to argue for or against the equal-variance form on the grounds of relative performance. - **A best-case DGP.** Data are drawn from a Normal distribution with equal variances and balanced *n*. Real psychology data routinely violate one or more of these assumptions, and the well-calibrated Type I error rate observed here should not be expected to generalise to skewed, heavy-tailed, or heteroscedastic data. - **Few replications.** K = 1000 is too small for a publishable simulation. The MCSEs reported alongside every estimate make this transparent in the figures, and any user adapting this template should increase K (a worst-case MCSE of 0.005 requires K = 10,000) before drawing methodological conclusions. - **A narrow effect-size grid.** Only Cohen's three benchmark values are simulated under the alternative; finer granularity would be needed to estimate, for example, the minimum detectable effect at a given sample size. Used as intended — as a worked example demonstrating the ADEMP planning framework, the structure of a `tidyr` + `furrr` + `simhelpers` simulation pipeline, and the reporting of Monte Carlo uncertainty alongside every performance estimate — the template provides a starting point that other studies can extend by varying the DGP, adding analytic comparators, or adopting performance measures appropriate to other statistical tasks (estimation, coverage, prediction, model selection). # References Bradley, J. V. (1978). Robustness? *British Journal of Mathematical and Statistical Psychology*, 31(2), 144–152. https://doi.org/10.1111/j.2044-8317.1978.tb00581.x Cohen, J. (1988). *Statistical power analysis for the behavioral sciences* (2nd ed.). Lawrence Erlbaum Associates. Joshi, M., & Pustejovsky, J. E. (2022). *simhelpers: Helper functions for simulation studies* [R package]. https://CRAN.R-project.org/package=simhelpers Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. *Statistics in Medicine*, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086 Siepe, B. S., Bartoš, F., Morris, T. P., Boulesteix, A.-L., Heck, D. W., & Pawel, S. (2024). Simulation studies for methodological research in psychology: A standardized template for planning, preregistration, and reporting. *Psychological Methods*. https://doi.org/10.1037/met0000695 # Session info  ```{r} sessionInfo() ```