Sample size calculation with GPower

Wilfried Cools (ICDS) & Sven Van Laere (BiSI)
https://www.icds.be/

sample size calculation: why & how

  • what about my sample size ?
  • more is better ≡ observation → information
    • but increasingly less so
    • but limited resources (time/money)
    • but ethical and practical considerations
  • how many is good enough ?
    • depends on what you want
  • workshop → how to calculate it ?
    • understand the reasoning
    • apply for simple but common situations
  • not one simple formula for all → GPower to the rescue

sample size calculation: a design issue

  • linked to statistical inference
    • testing → power [probability to detect existing effects]
    • estimation → accuracy [size of confidence intervals]
  • before data collection, during design of study
    • requires understanding: future data, analysis, inference (effect size, focus, ...)
    • conditional on assumptions & decisions
  • not always possible nor meaningful !
    • easier for experiments (control), less for observational studies
    • easier for confirmatory studies, much less for exploratory studies
    • NO retrospective power analyses → OK for future study only
      Hoenig, J., & Heisey, D. (2001). The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.

simple example

  • evaluation of radiotherapy to reduce a tumor in mice
  • comparing treatment group with control (=conditions)
  • tumor induced, random assignment treatment or control (equal if no effect)
  • after 20 days, measurement of tumor size (=observations)
  • analysis:
    • unpaired t-test to compare averages for treatment and control
  • goal:
    • if the average size in treatment is at least 20% less than control
      then we want to detect it (significance)
  • the main issue:
    • how to calculate the required sample size to determine the effect aimed for ?

overview

  • PART I: building blocks in action for t-test
    • sizes: effect size, sample size
    • errors: type I (\(\alpha\)), type II (\(\beta\))
    • distributions: Ho, Ha
    • criterion: confidence (estimation), power (testing)
  • PART II: moving beyond independent t-test
    • dependent groups
    • non-parametric distributions
    • multiple groups (ANOVA: omnibus, pairwise, focused)
    • proportions, correlations, ...

PART I: building blocks in action for t-test

reference example

  • sample sizes easy and meaningful to calculate for well understood problems
  • apriori specifications
    • intend to perform a statistical test
    • comparing 2 equally sized groups
    • to detect difference of at least 2
    • assuming an uncertainty of 4 SD on each mean
    • which results in an effect size of .5
    • evaluated on a Student t-distribution
    • allowing for a type I error prob. of .05 \((\alpha)\)
    • allowing for a type II error prob. of .2 \((\beta)\)
  • sample size conditional on specifications being true



https://icds.shinyapps.io/shinyt/

a formula you could use

  • for this particular case:
    • sample size (n → ?)
    • difference (d=signal → 2)
    • uncertainty (\(\sigma\)=noise → 4)
    • type I errors (\(\alpha\) → .05, so \(Z_{\alpha/2}\) → -1.96)
    • type II errors (\(\beta\) → .2, so \(Z_\beta\) → -0.84)

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\) \(n = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2} = 62.79\)

  • sample size = 2 groups x 63 observations = 126
  • note: formula's are test and statistic specific but logic remains same
  • this and other formula's implemented in various tools, our focus: GPower

GPower: a useful tool

  • popular and well established
  • free @ http://www.gpower.hhu.de/
  • implements wide variety of tests
  • implements various visualizations
  • documented fairly well
  • note: not all tests are included !

GPower: the building blocks in action

  • For this example, calculate sample size based on
    • effect size (difference of interest, scaled on standard deviation)
    • type I and type II error

GPower input

  • t-test : difference two indep. means
  • apriori: calculate sample size
  • effect size = standardized difference
    • Cohen's \(d\)
    • \(d\) = |difference| / SD_pooled
    • \(d\) = |0-2| / 4 = .5
  • \(\alpha\) = .05, 2 - tailed (\(\alpha\)/2 → .025 & .975)
  • \(power = 1-\beta\) = .8
  • allocation ratio = 1
  • ~ reference example

GPower output

  • sample size (\(n\)) = 64 x 2 = (128)
  • degrees of freedom (\(df\)) = 126 (128 - 2)
  • plot showing null Ho and alternative Ha distribution
    • in GPower central and non-central distribution
    • Ho & critical value → decision boundaries
      • critical t = 1.979, qt(.975,126)
    • Ha, shift with non-centrality parameter → truth
      • non centrality parameter (\(\delta\)) = 2.8284
        2/(4*sqrt(2))*sqrt(64)
  • power ≥ .80 (1-\(\beta\)) = 0.8015

reference example protocol

t tests - Means: Difference between two independent means (two groups)
Analysis: A priori: Compute required sample size

Input: Tail(s) = Two
Effect size d = 0.5000000
α err prob = 0.05
Power (1-β err prob) = .8
Allocation ratio N2/N1 = 1

Output: Noncentrality parameter δ = 2.8284271
Critical t = 1.9789706
Df = 126
Sample size group 1 = 64
Sample size group 2 = 64
Total sample size = 128
Actual power = 0.8014596





distributions

their cut-offs (type I and II errors)

the distance between them (sample and effect sizes)

GPower distributions

  • distribution based test selection
    • Exact Tests (8)
    • \(t\)-tests (11) → reference
    • \(z\)-tests (2)
    • \(\chi^2\)-tests (7)
    • \(F\)-tests (16)
  • focus on the density functions

  • design based test selection
    • correlation & regression (15)
    • means (19) → reference
    • proportions (8)
    • variances (2)
  • focus on the type of parameters


Ho and Ha distributions

  • Ho acts as \(\color{red}{benchmark}\) → eg., no difference
    • Ho ~ t(0,df) \(\color{green}{cut off}\) using \(\alpha\),
    • reject Ho if test returns implausible value
  • Ha acts as $\color{blue}{truth} $ → eg., difference of .5 SD

    • Ha ~ t(ncp,df)
    • ncp as violation of Ho → shift (location/shape)

non-centrality: Ho → Ha

  • ncp : non-centrality parameter

    • shift between Ho and Ha
      • assumed effect size (target or signal)
      • conditional on sample size (information)
    • overlap → power or sample size
      using \(\alpha\) on Ho and \(\beta\) on Ha
  • Ha is NOT interchangeable with Ho

    • absence of evidence \(\neq\) evidence of absence
    • equivalence testing (Ha for 'no effect')

https://icds.shinyapps.io/shinyt/

divide by n perspective on distributions

  • remember: \(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\) \(n = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2}\) \(n = 62.79\)

no Ha for estimation

  • focus on estimation, plausible values of effect, not testing or power
  • sample size without type II error \(\beta\), power, Ho or Ha
  • ~ divide by n perspective but shifted to estimate
  • precision analysis → set maximum width confidence interval
    • let E = maximum half width of confidence interval to accept
    • for confidence level \(1-\alpha\)
    • \(n = z^2_{\alpha/2} * \sigma^2 * 2 / \:E^2\) (for 2 groups)
  • equivalence with statistical testing
    • if 0 (or other reference) outside confidence bounds → significant
  • NOT GPower

type I/II error probability

  • distribution cut-off's (density → AUC=1)
  • decide whether to reject Ho assuming Ha
  • two types of error
    • P(infer=Ha|truth=Ho) = \(\alpha\)
    • P(infer=Ho|truth=Ha) = \(\beta\)
  • two types of correct inference
    • P(infer=Ho|truth=Ho) = \(1-\alpha\)
    • P(infer=Ha|truth=Ha) = \(1-\beta\) → power
  • cut-off 'known' Ho for statistical test
    • two tailed → both sides informative on Ho
    • one tailed → one side not informative on Ho

infer=Ha infer=Ho sum
truth=Ho $\alpha$ 1-$\alpha$ 1
truth=Ha 1-$\beta$ $\beta$ 1

error exercise : create plot

  • create plot
    (X-Y plot for range of values)
  • plot sample size by type I error
  • set plot to 4 curves
    • for power .8 in steps of .05
  • set \(\alpha\) on x-axis
    • from .01 to .2 in steps of .01
  • use effect size .5

notice Table option

error exercise : interpret plot

  • where on the red curve (right)
    type II error = 4 * type I error ?
  • when smaller effect size (.25), what changes ?
  • switch power and sample size (32 in step of 32)
    what is relation type I and II error ?

  • where on the yellow curve (left)
    type II error = 4 * type I error ?

  • for allocation rate 4, compare plots

decide type I/II error probability

  • rules of thumb ?
    • \(\alpha\) in range .01 - .05 → 1/100 - 1/20
    • \(\beta\) in range .1 to .2 → power = 80% to 90%
  • \(\alpha\) & \(\beta\) inversely related
    • if \(\alpha = 0\) → never reject, no power
    • if power 99% → high \(\alpha\) for same sample size
  • determine the balance
    • which error you want to avoid most ?
      • cheap aids test ? → avoid type II
      • heavy cancer treatment ? → avoid type I
    • \(\alpha\) & \(\beta\) often selected in 1/4 ratio
      type I error is 4 times worse !!

interim analyses: control type I error

  • analyze and proceed ? (peeking)
    • multiple testing as data is collected
    • inflates type I error \(\alpha\)
  • correct \(\alpha\)
    • interim analysis specific \(\alpha_i\) with overall \(\alpha\) under control
  • suggested technique: alpha spending
    • plan in advance
    • use O'Brien - Flemming bounds
    • NOT GPower

for fun: P(effect exists | test says so)

  • power → P(test says there is effect | effect exists)
  • \(P(infer=Ha|truth=Ho) = \alpha\)
  • \(P(infer=Ho|truth=Ha) = \beta\)
  • \(P(infer=Ha|truth=Ha) = power\)
  • \(P(\underline{truth}=Ha|\underline{infer}=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha)}\) → Bayes Theorem
  • \(P(truth=Ha|infer=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha|truth=Ha) * P(truth=Ha) + P(infer=Ha|truth=Ho) * P(truth=Ho)}\)
  • \(P(truth=Ha|infer=Ha) = \frac{power * P(truth=Ha)}{power * P(truth=Ha) + \alpha * P(truth=Ho)}\) → depends on prior probabilities
  • IF very low probability model is true, eg., .01 ? → \(P(truth=Ha) = .01\)
  • THEN probability effect exists if test says so is low, in this case only 14% !!
  • \(P(truth=Ha|infer=Ha) = \frac{.8 * .01}{.8 * .01 + .05 * .99} = .14\)

effect sizes

  • degree with which a certain phenomenon holds (~ Ho is false)
    • part of non-centrality (as is sample size) → shift in GPower
    • signal to noise ratio
      • typically not just the signal, to provide scale
      • eg., difference on scale of pooled standard deviation
    • bigger effect → more easy to detect it (pushing away Ha)
  • 2 main families of effect sizes → test specific
    • differences d-family // association r-family
    • transformations, eg., d = .5 → r = .243
      • \(d = \frac{2r}{\sqrt{1-r^2}}\); \(\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}\); \(\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}\)

effect sizes of Cohen

  • Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.
  • famous Cohen conventions
    • beware, just rules of thumb
    • Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed).

effect sizes of d and r family

  • most important effect size of
    • d: dichotomous - continuous
    • r: correlation - proportion variance
  • Ellis, P. D. (2010). The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results.
  • more than 70 different effect sizes... most of them related to each other
  • NOT p-value ~ partly effect size, partly sample size or power
    • do not simply compare p-values !

effect sizes in GPower (Determine)

  • often very difficult to specify
  • GPower offers help with Determine
    • difference group means
      0-2 → signal ~ minimally relevant (or expected)
    • standard deviations (sd)
      4 each group → expected noise ~ natural diversity
    • written to Effect Size d
      .5 → difference in sd
  • Reminder: effect size statistic depends on statistical test

effect size exercise : ingredients cohen d

For the reference example:

  • change mean values from 0 and 2 to 4 and 6, what changes ?
  • change sd values to 2 for each, what changes ?
    • effect size ?
    • total sample size ?
    • non-centrality ?
    • critical t ?
  • change sd values to 6 for each, what changes ?

effect size exercise : plot

  • plot power by effect size
  • set plot to 6 curves
    • for sample sizes, 34 in steps of 34
  • set effect sizes on x-axis
    • from .2 to 1.2 in steps of .05
  • use \(\alpha\) equal to .05
  • create plot
    (X-Y plot for range of values)
  • determine (approximately) the three situations from previous slide on the plot
  • how does power change when doubling the effect size, eg., from .5 to 1 ?

effect size exercise : imbalance

For the reference example:

  • change allocation ratio from 1
    • to 2, .5, 3 and 4, what to conclude ?
      • ratio 2 and .5 ?
      • imbalance + 1 or * 2 ?
  • ? no idea why n1 \(\neq\) n2

effect sizes, how to determine them in theory

  • choice of effect size matters → justify choice
  • choice of effect size
    • NOT significant → meaningless, dependent on sample size
    • realistic (eg., previously observed effect) → replicate
    • important (eg., minimally relevant effect)
  • use Determine to get started (check the manual)
    • for independent t-test → means and standard deviations
    • possible alternative is to use variance explained, eg., 1 versus 16

effect sizes, how to determine them in practice

  • experts / patients → use if possible → importance
  • literature (earlier study / systematic review) → realistic
  • pilot → guestimate dispersion estimate, but very small sample size
  • internal pilot → stopping rule (sequential/conditional)
  • turn to Cohen → use if everything else fails (rules of thumb)
  • guestimate the input parameters, what can you do ?
    • sd from assumed range / 6 assuming normal distribution
    • sd for proportions (& percentages) at conservative .5
    • sd from control, assume treatment the same

relation samples & effect size, errors I & II

  • building blocks:
    • sample size (\(n\))
    • effect size (\(\Delta\))
    • alpha (\(\alpha\))
    • power (\(1-\beta\))
  • each parameter
    conditional on others
  • GPower → type of power analysis
    • Apriori: \(n\) ~ \(\alpha\), power, \(\Delta\)
    • Post Hoc: power ~ \(\alpha\), \(n\), \(\Delta\)
    • Compromise: power, \(\alpha\) ~ \(\beta\:/\:\alpha\), \(\Delta\), \(n\)
    • Criterion: \(\alpha\) ~ power, \(\Delta\), \(n\)
    • Sensitivity: \(\Delta\) ~ \(\alpha\), power, \(n\)

type of power analysis exercise

  • for given example, step through...
    • retrieve power given n, \(\alpha\) and \(\Delta\)
    • [1] for power .8, take half the sample size, how does \(\Delta\) change ?
    • [2] set \(\beta\)/\(\alpha\) ratio to 4, what is \(\alpha\) & \(\beta\) ? what is the critical value ?
    • [3] keep \(\beta\)/\(\alpha\) ratio to 4 for effect size .5, what is \(\alpha\) & \(\beta\) ? critical value ?
  • [1] .5 to .7115 = .2115, bigger effect size compensate for loss of sample size (sensitivity)

  • [2] critical value 1.9990 with errors approx. .05 and .2 (compromise)

  • [3] critical value 1.6994 with errors approx. double, .09 and .38

getting your hands dirty

# calculator
m1=0;m2=2;s1=4;s2=4
alpha=.025;N=128
var=.5*s1^2+.5*s2^2
d=abs(m1-m2)/sqrt(2*var)
d=d*sqrt(N/2) 
tc=tinv(1-alpha,N-1)
power=1-nctcdf(tc,N-1,d)
  • in R, assuming normality
    • qt → get quantile on Ho (\(Z_{1-\alpha/2}\))
    • pt → get probability on Ha (non-central)
.n <- 64
.df <- 2*.n-2
.ncp <- 2 / (4 * sqrt(2)) * sqrt(.n)
.power <- 1 -
    pt(
        qt(.975,df=.df),
        df=.df, ncp=.ncp
    ) - 
    pt( qt(.025,df=.df), df=.df, ncp=.ncp)
round(.power,4)
## [1] 0.8015

PART II: moving beyong independent t-test

GPower beyond independent t-test

  • so far, comparing two independent means
  • selected topics with small exercises
    • dependent instead of independent
    • non-parametric instead of assuming normality
    • relations instead of groups (regression)
    • correlations
    • proportions, dependent and independent
    • more than 2 groups (compare jointly, pairwise, focused)
    • more than 1 predictor
    • repeated measures
  • GPower manual 27 tests: effect size, non-centrality parameter and example !!

dependence between groups

  • if 2 dependent groups (eg., before/after treatment) → account for correlations
  • matched pairs (t-test / means, difference 2 dependent means)
  • use reference example
    • [1] use correlation .5 to compare (effect size, ncp, n)
    • [2] how many observations if no correlation exists (reference example) ?
    • [3] difference in sample size for correlation .875 ?
    • [4] set original sample size (n=64*2) and effect size (dz=.5), compare ?
  • [1] \(\Delta\) looks same: \(\sqrt{2*(1-\rho)}\), n much smaller (1 group), ncp bit bigger
  • [2] approx. independent means, here 65 (estimate the correlation)
  • [3] effect size * 2 → sample size from 34 to 10
  • [4] -posthoc- power > .975: for 64 subjects 2 measurements, ncp > 4

non-parametric distribution

  • expect non-normally distributed residuals, avoid normality assumption
  • only considers ranks or uses permutations → price is efficiency
  • avoid when possible, eg., transformations
  • two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)
  • use reference example
    • [1] how about n ? compared to parametric → what is % loss efficiency ?
    • [2] change parent distribution to 'min ARE' ? what now ?
  • [1] a few more observations (3 more per group), less than 5 % loss

  • [2] several more observations, less efficient, more than 13 % loss (min ARE)

relations instead of group differences

  • differences between groups → relation observations & categorization
  • example → d = .5 → r = .243
  • note: slope \(\beta = {r*\sigma_y} / {\sigma_x}\)
  • regression coefficient (t-test / regression, one group size of slope)
  • sample size for comparing the slope Ha with 0 (=Ho)
    • [1] determine slope (\(\beta\), with \(\sigma_y\) = 4.12 and \(\sigma_x\) = .5)
    • [2] calculate sample size
    • [3] what if \(\sigma_x\) (predictor values) or \(\sigma_y\) (effect and error) increase ?
  • [1] for r = .243, \(\sigma_y\) = \(\sqrt{17}\) = 4.12, and \(\sigma_x\) = \(\sqrt{.25}\) = .5 (binary) → 2

  • [2] 128, same as for reference example, now with effect size \(\beta\)

  • [3] sample size decreases with \(\sigma_x\) (opposite \(\sigma_y\) ~ effect size), for same slope

relations: a variance perspective

  • between and within group variance → relation observations & categorization
  • regression coefficient (t-test / regression, fixed model single regression coef)
  • use reference example, regression style
    • variance within 4\(^2\) and between 1\(^2\), totaling \(\sigma_y^2\) = 17
    • [1] calculate sample size, compare effect sizes ?
    • [2] what if also other predictors in the model ?
  • [1] 128, same as for reference example, now with f\(^2\) = .25\(^2\) = .0625.

  • [2] loss of degree of freedom, very little impact

  • note: \(f^2={R^2/{(1-R^2)}}\)

more groups to compare, 4 cases

  • simple example: assume one control, and two treatments
  • if more than two groups, several options
    • test whether at least one differs → omnibus F-test (variances)
    • test whether all differ from eachother → pairwise comparisons
    • test whether selected pairs differ → contrast (t-test)
    • test whether linear combinations of pairs differ → contrasts (t-tests)
      eg., control versus each of the average of treatments


  • if multiple tests → inflation of type I error (\(\alpha\))
    • correct \(\alpha\) or p-value, eg., using Bonferroni
    • make more tentative inferences

F-test statistic

  • multiple groups → not one effect size d
  • F-test statistic & effect size f
  • f is ratio of variances \(\sigma_{between}^2 / \sigma_{within}^2\)
  • example: one control and two treatments
    • reference example + 1 group
    • within group observations normally distributed
    • means C=0, T1=2 and T2=4
    • sd for all groups (C,T1,T2) = 4

more groups: omnibus

  • for one control and two treatments → test that at least one differs
  • one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)
  • effect size f, with numerator/denominator df (derived from \(\eta^2\))
  • start from reference example,
    • [1] what is the sample size ? ncp ? critical F ? does size matter ?
    • [2] set extra group, either mean 1 or 4, what are the effect / sample sizes ?
      and with the mean of middle group away from global mean ?
    • [4] derive effect size with variance between .5 and within 3, 1 and 6 ?
  • [1] different effect size (f), distribution, same sample size 128 (size ~ imbalance)

  • [2] n=237 or 63, with ncp 9.87 or 10.5, middle group obscures and vise verse
    effect size increases, picks up difference 0 and 4!

  • [3] same effect size, so, same sample size, 63, ncp 10.5 (1/7th explained)

more groups: pairwise

  • assume one control, and two treatments
    • interested in all three pairwise comparisons → maybe Tukey
      • typically run aposteriori, after omnibus shows effect
    • use t-test with correction of \(\alpha\) for multiple testing
  • apply Bonferroni correction for original 3 group example
    • [1] resulting sample size for three tests ?
    • [2] what if biggest difference is ignored, sample size ?
    • [3] with original 64 sized groups, what is the power ?
  • [1] divide \(\alpha\) by 3 (86*2) → overall 86*3 = 258

  • [2] or divide by 2 (78*2) (biggest difference implied) → overall 78*3 = 234

  • [3] .6562 when /3 or .7118 when /2, power-loss

sample size calculation benefit from focus

  • better to focus during the design on specific questions

    • only consider the main comparisons in focus (eg. primary endpoints)
      • only interested in comparing two treatments → t-test
    • only consider smallest of relevant effects, largest sample size
    • set up contrasts (next slide)
  • sample size calculations (design)

    • not necessarily equivalent to statistics
    • requires justification to convince yourself and/or reviewers
  • example:

    • statistics: group difference evolution 4 repeated measurements → mixed model
    • power: difference treatment and control last time point → t-test

more groups: contrasts

  • assume one control and two treatments
    • set up 2 contrasts for T1 - C and T2 - C
    • set up 1 contrast for average(T1,T2) - C
  • each contrast requires 1 degree of freedom
  • each contrast combines a specific number of levels
  • effect sizes for planned comparisons must be calculated !!
    • contrasts (linear combination)
    • standard deviation of contrasts

      \(\sigma_{contrast} = \frac{|\sum{\mu_i * c_i}|}{\sqrt{N \sum_i^k c_i^2 / n_i}}\)
      with group means \(\mu_i\), pre-specified coefficients \(c_i\), sample sizes \(n_i\) and total sample size \(N\)

more groups: contrasts exercise

  • one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
  • obtain effect sizes for contrasts (assume equally sized for convenience)
    • \(\sigma_{contrast}\) T1-C: \(\frac{(-1*0 + 1*2 + 0*4)}{\sqrt(2*((-1)^2+1^2+0^2))} = 1\); T2-C:\(~= 2\); (T1+T2)/2-C:\(~= 1.4142\)
    • with \(\sigma\) = 4 → ratio of variances for effect sizes f .25, .5, .3536
  • sample size for each contrast, each 1 df and 2 groups
    • [1] contrasts nrs. 1 or 2
    • [2] contrasts nrs. 1 AND 2
    • [3] contrasts nr. 3
  • [1] total sample size 128 (again!!) as \(d=2f\), 64 C and 64 T1, or 34 = 17 C and 17 T2

  • [2] same with Bonferroni correction → 155 and 41 → 78 C, 78 T1, 21 T2

  • [3] total sample size 65 → 22 in each group

multiple factors

  • multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
  • multiple main effects and interaction effects
    • interaction: group specific difference between groups
      • degrees of freedom (#A-1)*(#B-1)
    • main effects: if no interaction (#X-1)
    • get effect sizes for two way anova
      https://icds.shinyapps.io/effectsizes/
  • sample size for reference example, assume a second predictor is trivial
    • [1] what is partial \(\eta^2\) ?
    • [2] sample size ?
  • [1] .0588 (0-2, sd=4)

  • [2] 128 again, with 2 groups (158 with 3 groups, df=2)

dependence within groups (repeated)

  • if repeated measures → account for correlations
  • repeated measures (F-test / Means, repeated measures...)
  • 3 main types
    • within: like dependent t-test for 2 or more measurements
    • between: use of multiple measurements per group
    • interaction: difference of change over groups
  • correlation within subject (unit)
    • informative on group differences within subject
    • redundancy for between group differences

repeated measures within

  • possible to have only 1 group (within subject comparison)
  • use effect size f = .25 (1/16 explained versus unexplained)
    • [1] use zero correlation to compare with sample size independent t-test
    • [2] for one group use correlation .5, compare sample size dependent t-test
    • [3] double number of groups to 2
    • [4] double number of measurements to 4 (correlation 0 and .5), impact ?
  • number of groups = 1, number of measurements = 2, sample size = [1] 65 and [2] 34
  • [3] changed degrees of freedom, sample size could have changed (power, crit. F)
  • [4] impact of more measurements bigger with higher correlation

repeated measures between

  • use effect size f = .25 (1/16 for variance or 2/4 for means)
    • [1] use correlation 0 and .5 with 2 groups and 2 measurements, sample size ?
    • [2] for correlation .5, compare 2 or 4 measurements, sample size ?
    • [3] double number of groups to 2
  • [1] sample size higher when higher correlation (66x2=132 for 0, 98x2=196 for .5)
  • [2] sample size lower when more measurements, unless correlation is 1 (82x2=164)
  • [3] more groups require higher sample size

repeated measures within x between

  • SPSS idiosyncrasies: https://www.youtube.com/watch?v=CEQUNYg80Y0
  • use effect size f = .25 (1/16 for variance)
    • [1] use correlation 0, compare 2 groups 2 measurements with rep. between ?
    • [2] use correlation 0.5, compare 2 groups 2 measurements with rep. within ?
    • [3] use correlation .5, compare 2 groups and 4 measurements, sample size ?
    • [4] repeat with 4 groups and 4 measurements, sample size ?
  • [1] same with indep and [2] same with dependent
  • [3] more groups, higher sample size (identical to within)
  • [4] difference between within and between

correlations

  • when comparing two independent correlations
  • z-tests / correlation & regressions: 2 indep. Pearson r's
  • makes use of Fisher Z transformations → z = .5 * log(\(\frac{1+r}{1-r}\)) → q = z1-z2
  • [1] assume correlation coefficients .7844 and .5 effect size & sample size ?
  • [2] assume .9844 and .7, effect size & sample size ?
  • [3] assume .1 and .3844 effect size & sample size ?
  • [1] effect size q = 0.5074, sample size 64*2 = 128

  • [2] effect size q = 1.5556, sample size 10*2 = 20, same difference, bigger effect

  • [3] effect size q = -0.3048, sample size 172*2 = 344, negative and smaller effect

  • note that dependent correlations are more difficult, see manual

proportions

  • comparing two independent proportions → bounded between 0 and 1
  • Fisher Exact Test (exact / proportions, difference 2 independent proportions)
  • effect sizes in odds ratio, relative risk, difference proportion
    • [1] for odds ratio 2, p2 = .60, what is p1 ?
    • [2] sample size for equal sized, and type I and II .05 and .2 ?
    • [3] sample size when .95 and .8 (difference of .15) and .05 and .2 ?
  • [1] odds ratio 2 * (.6/.4) = 3 (odds), 3/3+1 = .75
  • [2] total sample size 328, [3] total sample size 164, either at .05 or .95
  • treat as if unbounded, ok within .2 - .8, variance is p*(1-p) → maximally .25 !!

    • [4] use t-test for difference of .15
  • [4] effect size .3, sample size 352 (> 328)

proportions exercise

  • Fisher Exact Test
    • power over proportions .5 to 1
    • 5 curves, sample sizes 328, 428, 528...
    • type I error .05
    • [1] what happens ?
    • [2] repeat for one-tailed, what is different ?
  • [1] power for proportion compared to reference .6, sample size determines impact

  • [2] one-tailed, increases power, both sides (absolute value difference)
    tail to choose on the correct side

dependent proportions

  • when comparing two dependent proportions

  • McNemar test (exact / proportions, difference 2 dependent proportions)

    • include correlations implicitly, discordant pairs → change
    • effect size as odds ratio → ratio of discordance ?!
  • assume odds ratio equal to 2, equal sized, type I and II errors .05 and .2, two-way

    • [1] what is the sample size for .25 proportion discordant, and [2] .5, and [3] 1
    • [4] for odds ratio 4 or .25, how the proportion p12 and p21 change ?
    • [5] repeat for third alpha option, and consider total sample size, what happens ?
  • [1] total sample size 288 & [2] 144 & [3] impossible, but limits to 72

  • [4] for proportion discordant, 1 to 4 or 4 to 1

  • [5] sample size differs because side effects

conclusion: keep it simple, keep it real

  • sample size calculation is a design issue, not a statistical one
  • building blocks: sample & effect sizes, type I & II errors, each conditional on rest
  • effect sizes express the amount of signal compared to the background noise
  • complex models imply complex sample size calculations, if at all possible
  • GPower deals with not too complex models
    • simplify using a focus, if justifiable → then GPower can get you a long way
    • use more complex specification for more complex sample size calculations
    • leave GPower, simulation is always an option


about us ...