ICDS

Workshop offered by the Interfaculty Center Data processing and Statistics (icds.be).

Current draft (Apr 04, 2021) aims to introduce researchers to the key ideas in sample size calculation that would help them design their study (55 pages). Our target audience is primarily the research community at VUB / UZ Brussel.

We invite you to help us improve this document by sending us feedback
wilfried.cools@vub.be or anonymously at icds.be/consulting (right side, bottom)

01 Sample Size Calculation

our program
- part I: understand the reasoning
  - introduce building blocks
  - implement on t-test
- part II: explore more complex situations
  - beyond the t-test
  - simple but common
not one formula for all → GPower to the rescue
- a few exercises

02 Sample Size Calculation: demarcation

how many observations will be sufficient ?
- avoid too many, because typically observations imply a cost
  - money / time → limited resources
  - risk / harm → ethical constraints
- depends on the aim of the study
  - research aim → statistical inference
linked to statistical inference (using standard error)
- testing → power [probability to detect effect]
- estimation → accuracy [size of confidence interval]

03 Sample Size Calculation: a difficult design issue

before data collection, during design of study
- requires understanding: future data, analysis, inference (effect size, focus, …)
- conditional on assumptions & decisions
not always possible nor meaningful !
- easier for experiments (control), less for observational studies
- easier for confirmatory studies, much less for exploratory studies
- not possible for predictive models, because no standard error
- NO retrospective power analyses → OK for future study only
  Hoenig, J., & Heisey, D. (2001). The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.
alternative justifications:
- common practice, feasibility → non-statistical (importance, low cost, …)

04 Simple Example

experimental - confirmatory
evaluation of radiotherapy to reduce a tumor in mice
comparing treatment group with control (=conditions)
tumor induced, random assignment treatment or control (equal if no effect)
after 20 days, measurement of tumor size (=observations)
intended analysis: unpaired t-test to compare averages for treatment and control
SAMPLE SIZE CALCULATION:
- IF average tumor size for treatment at least 20% less than control (4 vs. 5mm)
- THEN how many observations, sufficient to detect that difference (significance) ?

05 Reference Example

sample sizes easy and meaningful to calculate for well understood problems

apriori specifications

intend to perform a statistical test
comparing 2 equally sized groups
to detect difference of at least 2
assuming an uncertainty of 4 SD on each mean
which results in an effect size of .5
evaluated on a Student t-distribution
allowing for a type I error prob. of .05 \((\alpha)\)
allowing for a type II error prob. of .2 \((\beta)\)

sample size
conditional on specifications being true

https://apps.icds.be/shinyt/

06 A formula you could use

for this particular case:
- sample size (n → ?)
- difference (d=signal → 2)
- uncertainty (\(\sigma\)=noise → 4)
- type I errors (\(\alpha\) → .05, so \(Z_{\alpha/2}\) → -1.96)
- type II errors (\(\beta\) → .2, so \(Z_\beta\) → -0.84)
sample size = 2 groups x 63 observations = 126
note: formula’s are test and statistic specific but logic remains same
this and other formula’s implemented in various tools
our focus: GPower

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\) \(n = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2} = 62.79\)

07 GPower: the building blocks in action

SIZES: effect size, sample size
ERRORS:
- Type I (\(\alpha\)) defined on distribution Ho
- Type II (\(\beta\)) evaluated on distribution Ha
calculate sample size based on effect size, and type I / II error

08 GPower: a useful tool

popular and well established
free @ http://www.gpower.hhu.de/
implements wide variety of tests
implements various visualizations
documented fairly well
note: not all tests are included !
note: not without flaws !
other tools exist (some paying)
for complex models: impossible
alternative: simulation (generate and analyze)

09 GPower input

~ reference example
t-test : difference two indep. means
apriori: calculate sample size
effect size = standardized difference [Determine]
- Cohen’s \(d\)
- \(d\) = |difference| / SD_pooled
- \(d\) = |0-2| / 4 = .5
\(\alpha\) = .05
2 - tailed (\(\alpha\)/2 → .025 & .975)
\(power = 1-\beta\) = .8
allocation ratio = 1
(equally sized groups)

10 GPower output

sample size (\(n\)) = 64 x 2 = (128)
degrees of freedom (\(df\)) = 126 (128 - 2)
critical t = 1.979
- decision boundary given \(\alpha\) and \(df\)
  qt(.975,126)
non centrality parameter (\(\delta\)) = 2.8284
- shift Ha (true) away from Ho (null)
  2/(4*sqrt(2))*sqrt(64)
distributions: central Ho and non-central Ha
power ≥ .80 (1-\(\beta\)) = 0.8015

11 Protocol: reference example

Protocol: summary for future reference or communication
File/Edit save or print file (copy-paste)

t tests - Means: Difference between two independent means (two groups)
Analysis: A priori: Compute required sample size

Input:
Tail(s) = Two
Effect size d = 0.5000000
α err prob = 0.05
Power (1-β err prob) = .8
Allocation ratio N2/N1 = 1 Output:
Noncentrality parameter δ = 2.8284271
Critical t = 1.9789706
Df = 126
Sample size group 1 = 64
Sample size group 2 = 64
Total sample size = 128
Actual power = 0.8014596

12 Building Blocks

distributions: Ho & Ha ~ test dependent shape
sizes: sample size & effect size ~ shift between Ho & Ha
errors: type I error & type II error ~ cut-off at Ho & Ha

13 GPower Statistical Tests

test family - statistical tests [in window] Exact Tests (8) \(t\)-tests (11) → `reference` \(z\)-tests (2) \(\chi^2\)-tests (7) \(F\)-tests (16) focus on the density functions		tests [in menu] correlation & regression (15) means (19) → `reference` proportions (8) variances (2) focus on the type of parameters

14 Central Ho and Non-Central Ha Distributions

Ho acts as \(\color{red}{benchmark}\) → eg., no difference
- set \(\color{green}{cut off}\) on Ho ~ t(ncp=0,df) using \(\alpha\),
- reject Ho if test returns implausible value
Ha acts as \(\color{blue}{truth}\) → eg., difference of .5 SD
- Ha ~ t(ncp!=0,df)
- ncp as violation of Ho → shift (location/shape)
ncp : non-centrality parameter combines
- assumed effect size (target or signal)
- conditional on sample size (information)
ncp : determines overlap → power ↔︎ sample size
- probability beyond \(\color{green}{cut off}\) at Ho evaluated on Ha

https://apps.icds.be/shinyt/

15 Note: Divide by N Perspective as alternative

divide by n: sample size ~ standard deviation
non-centrality parameter: sample size ~ location

\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\)
\(n = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2}\)
\(n = 62.79\)

https://apps.icds.be/shinyt/

16 Note: Ho and Ha, asymmetry in statistical testing

Ha is NOT interchangeable with Ho
cut-off at Ho using \(\alpha\)
- in statistics → observe test statistics (Ha unknown)
- in sample size calculation → assume Ha
in statistics → if fail to reject then remain in doubt
- absence of evidence \(\neq\) evidence of absence
  - p-value → P(statistic|Ho) != P(Ho|statistic)
  - example: evidence for insignificant \(\eta\) same as for \(\eta\) * 2
equivalence testing → Ha for ‘no effect’
- reject Ho that smaller than 0 - |\(\delta\)| AND bigger than 0 + |\(\delta\)|
- acts as two superiority tests combined

17 Type I/II Error Probability

inference test based on cut-off’s (density → AUC=1) type I error: incorrectly reject `Ho` (false positive): cut-off at `Ho`, error prob. \(\alpha\) controlled one/two tailed → one/both sides informative ? type II error: incorrectly fail to reject `Ho` (false negative): cut-off at `Ho`, error prob. \(\beta\) depends on `Ha` `Ha` assumed known in a power analyses power = 1 - \(\beta\) = probability correct rejection (true positive) inference versus truth infer: effect exists vs. unsure truth: effect exist vs. does not
	infer=Ha	infer=Ho	sum
truth=Ho	\(\alpha\)	1-\(\alpha\)	1
truth=Ha	1-\(\beta\)	\(\beta\)	1

18 Exercise on Errors, create plot

~ reference example
create plot
(X-Y plot for range of values)
plot sample size by type I error
set plot to 4 curves
- for power .8 in steps of .05
set \(\alpha\) on x-axis
- from .01 to .2 in steps of .01
use effect size .5

notice Table option

19 Exercise on Errors, interpret plot

where on the red curve (right)
type II error = 4 * type I error ?
when smaller effect size (.25), what changes ?
switch power and sample size (32 in step of 32)
what is relation type I and II error ?

what would be difference between curves for \(\alpha\) = 0 ?

20 Decide Type I/II Error Probability

popular choices
- \(\alpha\) often in range .01 - .05 → 1/100 - 1/20
- \(\beta\) often in range .2 to .1 → power = 80% to 90%
\(\alpha\) & \(\beta\) inversely related
- \(\alpha\) & \(\beta\) often selected in 1/4 ratio
  type I error is 4 times worse !!
- which error you want to avoid most ?
  - cheap aids test ? → avoid type II
  - heavy cancer treatment ? → avoid type I
- probability for errors always exists

21 Control Type I Error

multiple testing
- inflates type I error \(\alpha\)
- family of tests: \(1-(1-\alpha)^k\) → correct, eg., Bonferroni (\(\alpha/k\))
- interim analysis (analyze and proceed) → correct, eg., alpha spending
interim analysis
- plan in advance
- O’Brien-Flemming bounds, more efficient than Bonferroni
- NOT GPower
  - simulation tool: http://apps.icds.be/simAlphaSpending/
- determine boundaries with PASS, R (ldbounds), …

22 for fun: P(effect exists | test says so)

power → P(test says there is effect | effect exists)
\(P(infer=Ha|truth=Ho) = \alpha\)
\(P(infer=Ho|truth=Ha) = \beta\)
\(P(infer=Ha|truth=Ha) = power\)
\(P(\underline{truth}=Ha|\underline{infer}=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha)}\) → Bayes Theorem
\(P(truth=Ha|infer=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha|truth=Ha) * P(truth=Ha) + P(infer=Ha|truth=Ho) * P(truth=Ho)}\)
\(P(truth=Ha|infer=Ha) = \frac{power * P(truth=Ha)}{power * P(truth=Ha) + \alpha * P(truth=Ho)}\) → depends on prior probabilities
IF very low probability model is true, eg., .01 ? → \(P(truth=Ha) = .01\)
THEN probability effect exists if test says so is low, in this case only .14 !!
\(P(truth=Ha|infer=Ha) = \frac{.8 * .01}{.8 * .01 + .05 * .99} = .14\)

23 Effect Sizes, in principle

estimate/guestimate of minimal magnitude of interest
typically standardized: signal to noise ratio (noise provides scale)
- eg., difference on scale of pooled standard deviation
- eg., effect size \(d\)=.5 means .5 standard deviations
part of non-centrality (as is sample size) → pushing away Ha
~ practical significance (as opposed to statistical significance ~ sample size)
2 main families of effect sizes (test specific)
- d-family (differences) and r-family (associations)
- transform one into other, eg., d = .5 → r = .243
  \(\hspace{20 mm}d = \frac{2r}{\sqrt{1-r^2}}\) \(\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}\) \(\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}\)
NOT p-value ~ partly effect size, but also partly sample size

24 Effect Sizes, in literature

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed).
famous Cohen conventions but beware, just rules of thumb

more than 70 different effect sizes… most of them related
Ellis, P. D. (2010). The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results.

25 Effect Sizes, in GPower (Determine)

effect sizes are test specific
- t-test → group means and sd’s
- one-way anova →
  variance explained & error
- regression →
  again other parameters
- . . . .
GPower helps with Determine
- sliding window
- one or more effect size specifications

26 Exercise on Effect Sizes, ingredients Cohen’s d

For the reference example:

change mean values from 0 and 2 to 4 and 6, what changes ?
change sd values to 2 for each, what changes ?
- effect size ?
- total sample size ?
- critical t ?
- non-centrality ?
change sd values to 8 for each, what changes ?
change sd to 2 and 5.3, or 1 and 5.5,
how does it compare to 4 and 4 ?

27 Exercise on Effect Sizes, plot

plot powercurve: power by effect size
compare 6 sample sizes: 34 in steps of 34
for a range of effect sizes in between .2 and 1.2
use \(\alpha\) equal to .05
pinpoint the situations from previous section on the plot (sd=4 and 2).
how does power change when doubling the effect size ?

powercurve → X-Y plot for range of values

28 Exercise on Effect Size, imbalance

For the reference example:

compare for allocation ratios 1, .5, 2, 10, 50
repeat for effect size 1, and compare
? no idea why n1 \(\neq\) n2

after calculate plot, to change allocation ratio

29 Effect Sizes, how to determine them in theory

choice of effect size matters → justify choice !!
choice of effect size depends on aim of the study
- realistic (eg., previously observed effect) → replicate
- important (eg., minimally relevant effect)
- NOT significant → meaningless, dependent on sample size
choice of effect size dependent on statistical test of interest
- for independent t-test → means and standard deviations
- possible alternative: variance explained, eg., 1 versus 16+1
  - with one-way ANOVA (f=.25 instead of d=.5)
  - with linear regression (f\(^2\)=.0625 instead of d=.5)
  - https://www.psychometrica.de/effect_size.html#transform

30 Effect Sizes, how to determine them in practice

experts / patients → use if possible → importance
literature (earlier study / systematic review) → beware of publication bias → realistic
pilot → guestimate dispersion estimate (not effect size → small sample)
internal pilot → conditional power (sequential)
guestimate uncertainty…
- sd from assumed range, assume normal and divide by 6
- sd for proportions at conservative .5
- sd from control, assume treatment the same
- ...
turn to Cohen → use if everything else fails (rules of thumb)
- eg., .2 - .5 - .8 for Cohen’s d

31 Relation Sample & Effect Size, type I & II Errors

building blocks:
- sample size (\(n\))
- effect size (\(\Delta\))
- alpha (\(\alpha\))
- power (\(1-\beta\))
each parameter
conditional on others

GPower → type of power analysis
- Apriori: \(n\) ~ \(\alpha\), power, \(\Delta\)
- Post Hoc: power ~ \(\alpha\), \(n\), \(\Delta\)
- Compromise: power, \(\alpha\) ~ \(\beta\:/\:\alpha\), \(\Delta\), \(n\)
- Criterion: \(\alpha\) ~ power, \(\Delta\), \(n\)
- Sensitivity: \(\Delta\) ~ \(\alpha\), power, \(n\)

32 Exercise on Type of Power Analysis

retrieve power given n, \(\alpha\) and \(\Delta\) of reference case
then, for power .8, take half the sample size, how does \(\Delta\) change ?
then, set \(\beta\)/\(\alpha\) ratio to 4, what is \(\alpha\) & \(\beta\) ? what is the critical value ?
then, keep \(\beta\)/\(\alpha\) ratio to 4 for effect size .7, what is \(\alpha\) & \(\beta\) ? critical value ?

Solution for Type of Power Analysis

retrieve power given n, \(\alpha\) and \(\Delta\) of reference case
- use post-hoc 64x2 → .8
then, for power .8, take half the sample size, how does \(\Delta\) change ?
- use sensitivity 32x2 (d=.7114)
- \(\Delta\) from .5 to .7115 = .2115
- bigger effect \(\Delta\) compensates loss of sample size n
then, set \(\beta\)/\(\alpha\) ratio to 4, what is \(\alpha\) & \(\beta\) ? what is the critical value ?
- use compromise 32x2
- \(\alpha\) =.09 and \(\beta\) =.38, critical value 1.6994
then, keep \(\beta\)/\(\alpha\) ratio to 4 for effect size .7
- use compromise 32x2
- \(\alpha\) =.05 and \(\beta\) =.2, critical value 1.9990

33 getting your hands dirty

# calculator
m1=0;m2=2;s1=4;s2=4
alpha=.025;N=128
var=.5*s1^2+.5*s2^2
d=abs(m1-m2)/sqrt(2*var)*sqrt(N/2)
tc=tinv(1-alpha,N-1)
power=1-nctcdf(tc,N-1,d)

in R
qt → get quantile on Ho (\(Z_{1-\alpha/2}\))
pt → get probability on Ha (non-central)

.n <- 64
.df <- 2*.n-2
.ncp <- 2 / (4 * sqrt(2)) * sqrt(.n)
.power <- 1 -
    pt(
        qt(.975,df=.df),
        df=.df, ncp=.ncp
    ) - 
    pt( qt(.025,df=.df), df=.df, ncp=.ncp)
round(.power,4)

## [1] 0.8015

34 GPower, beyond the independent t-test

so far, comparing two independent means
selected topics with small exercises
- dependent instead of independent
- non-parametric instead of assuming normality
- relations instead of groups (regression)
- correlations
- proportions, dependent and independent
- more than 2 groups (compare jointly, pairwise, focused)
- more than 1 predictor
- repeated measures
GPower manual 27 tests: effect size, non-centrality parameter and example !!

35 Dependence between groups

if 2 dependent groups (eg., before/after treatment) → account for correlations
correlation typically obtained from pilot data, earlier research
GPower: matched pairs (t-test / means, difference 2 dependent means)
- use reference example, and assume correlation .5 to compare with reference effect size, ncp, n
- how many observations if no correlation exists (think then try) ? effect size ?
- what changes with correlation .875 (think: more or less n, higher or lower effect size) ?
- what would the power be with the reference sample size, n=128, but now cor=.5 ?

Solution for dependence between groups

GPower: matched pairs (t-test / means, difference 2 dependent means)
use reference example, and assume correlation .5 to compare with reference effect size, ncp, n
- \(\Delta\) looks same, n much smaller = 34
- different type of effect size: dz ~ d / \(\sqrt{2*(1-\rho)}\)
- also note: 34x2 measurements
how many observations if no correlation exists (think then try) ? effect size ?
- 65, approx. same as INdependent means → 64 (*2=128) but also estimate the correlation
- \(\Delta\) = dz = .3535 (~ d = .5)
what changes with correlation .875 (think: more or less n, higher or lower effect size) ?
- effect size * 2 → sample size from 34 to 10 (almost / 4)
what would the power be with the reference sample size, correlation .5 ? what is the ncp ?
- post - hoc power, 64 * 2 measurements, with .5 correlation
- power > .976, ncp > 4,

36 Non-parametric distribution

expect non-normally distributed residuals, not possible to avoid (eg., transformations)
only considers ranks or uses permutations → price is efficiency and flexibility
requires parent distribution (alternative hypothesis), ‘min ARE’ should be default
GPower: two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)
- use reference example, with normal parent distribution, how much efficiency is lost ?
- for a parent distribution ‘min ARE’, how much efficiency is lost ?

Solution for non-parametric distribution

GPower: two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)
use reference example, with normal parent distribution, how much efficiency is lost ?
- requires a few more observations (3 more per group)
- less than 5 % loss (~134/128)
for a parent distribution ‘min ARE’, how much efficiency is lost ?
- requires several more observations
- more than 15 % loss (~148/128)
- min ARE is safest choice without extra information, least efficient

37 A relations perspective, regression analysis

differences between groups → relation observations & grouping (categorization)
example → d = .5 → r = .243 (note: slope \(\beta = {r*\sigma_y} / {\sigma_x}\))
- .243*sqrt(\(4^2+1^2\))/sqrt(\(.5^2\)) = 2
GPower: regression coefficient (t-test / regression, one group size of slope)
- determine slope \(\beta\) and \(\sigma_y\) for reference values, d=.5 (hint:d~r), SD = 4 and \(\sigma_x\) = .5 (1/0)
- calculate sample size
- what happens with slope and sample size if predictor values are taken as 1/-1 ?
- determine \(\sigma_y\) for slope 6, \(\sigma_x\) = .5, and SD = 4, would it increase the sample size ?

Solution on a relations perspective

GPower: regression coefficient (t-test / regression, one group size of slope)
determine slope \(\beta\) and \(\sigma_y\) for reference values, d=.5, SD = 4 and \(\sigma_x\) = .5 (1/0)
- \(\sigma_x\) = \(\sqrt{.25}\) = .5 (binary, 2 groups: 0 and 1) → slope = 2, \(\sigma_y\) = 4.12 = \(\sqrt{4^2+1^2}\)
calculate sample size
- 128, same as for reference example, now with effect size slope H1 given 1/0 predictor values
what happens with slope and sample size if predictor values are taken as 1/-1 ?
- \(\beta\) is 1, a difference of 2 over 2 units instead of 1
- no difference in sample size, compensated by variance of design
determine \(\sigma_y\) for slope 6, \(\sigma_x\) = .5, and SD = 4, would it increase the sample size ?
- \(\sigma_y\) = 5 = \(\sqrt{4^2+3^2}\) (assuming balanced data)
- bigger effect → smaller sample size, only 17

38 A variance ratio perspective, ANOVA

difference between groups or relation → ratio between and within group variance
GPower: regression coefficient (t-test / regression, fixed model single regression coef)
- use reference example, regression style (sd of effect and error, but squared)
- calculate sample size, compare effect sizes ?
- what if also other predictors in the model ?
- what if 3 predictors extra reduce residual variance to 50% ?
note:
- partial \(R^2\) = variance predictor / total variance
- \(f^2\) = variance predictor / residual variance = \({R^2/{(1-R^2)}}\)

Solution on a variance ratio perspective

GPower: regression coefficient (t-test / regression, fixed model single regression coef)
- use reference example, regression style (sd of effect and error, but squared)
calculate sample size, compare effect sizes ?
- 128, same as for reference example, now with \(f^2\) = \(.25^2\) = .0625 (d=.5,r=.243)
what if also other predictors in the model ?
- very little impact → loss of degree of freedom
- ignore that predictors explain variance → reduce residual variance
what if 3 predictors extra reduce residual variance to 50% ?
- less noise → bigger effect size
- sample size much less (65)

39 A variance ratio perspective on multiple groups

multiple groups → not one effect size d
F-test statistic & effect size f, ratio of variances \(\sigma_{between}^2 / \sigma_{within}^2\)
difference between multiple groups summarized in variance \(\sigma_{between}^2\)
example: one control and two treatments
- reference example + 1 group
- sd within each group, for all groups (C,T1,T2) = 4
- means C=0, T1=2 and for example T2=4

40 Multiple Groups: Omnibus

difference between some groups → at least two differ
GPower: one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)
- effect size f, with numerator/denominator df
- obtain sample size for reference example, just 2 groups C and T1 (size=64)!
- play with sizes, how does size matter ?
- include third group, with mean 2, what are sample sizes (compare with 2 groups)?
- set third group mean to 0, how does it compare with mean 2 (think and try)?
- set third group mean to 4, but also vary middle group (eg., 1 or 3), does that have an effect ?
- change procedure: repeat for between variance 2.67 (balanced: 0, 2, 4) and within variance 16 ?

Solution for multiple groups omnibus

GPower: one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)
obtain sample size for reference example, just 2 groups C and T1 (size=64)!
- 128, same again, despite different effect size (f) and distribution
- size used only to include imbalance
include third group, with mean 2, what are sample sizes (compare with 2 groups)?
- effect sizes f = .236; sample size 177 (59*3), requires more observations
set third group mean to 0, how does it compare with mean 2 (think and try)?
- effect and sample size same, no difference whether big 0 group or big 2 group.
set third group mean to 4, but also vary middle group (eg., 1 or 3), does that have an effect ?
- effect sizes f = .408 (4), .425 (1/3), increase with middle group away from middle.
change procedure: repeat for between variance 2.67 (balanced: 0, 2, 4) and within variance 16 ?
- sample size 21*3=63, for f = .408 (1/7th explained = 1 between / 6 within)

41 Multiple Groups: Pairwise

assume one control, and two treatments
- interested in all three pairwise comparisons → maybe Tukey
  - typically run aposteriori, after omnibus shows effect
- use multiple t-tests with corrected \(\alpha\) for multiple testing
  GPower: t-tests/means difference two independent groups
apply Bonferroni correction for original 3 group example (0, 2, 4)
- what samples sizes are necessary for all three pairwise tests ?
- what if biggest difference ignored (C-T2), because know that easier to detect ?
- with original 64 sized groups, what is the power (both situations above) ?

Solution for multiple groups pairwise

GPower: t-tests/means difference two independent groups
apply Bonferroni correction for original 3 group example (0, 2, 4)
what samples sizes are necessary for all three pairwise tests ?
- 0-2 and 2-4 → d=.5, 0-4 → d=1
- divide \(\alpha\) by 3 → .05/3=.0167
- sample size 86 * 2 for 0-2 and 2-4, 23 * 2 for 0-4 → 86 * 3 = 258
what if biggest difference ignored (C-T2), because know that easier to detect ?
- divide \(\alpha\) by 2 → .05/2=.025
- sample size 78 * 2 for 0-2 and 2-4 → 78 * 3 = 234 (24 less)
with original 64 sized groups, what is the power (both situations above) ?
- .6562 for 3 tests (\(\alpha\)=.0167)
- .7118 for 2 tests (\(\alpha\)=.0250)
- post-hoc test → power-loss (lower \(\alpha\) → higher \(\beta\))

42 Multiple Groups: Contrasts

contrasts are linear combinations → planned comparison
- eg., 1 * T1 -1 * C \(\neq\) 0 & 1 * T2 -1 * C \(\neq\) 0
- eg., .5 * (1 * T1 + 1 * T2) -1 * C \(\neq\) 0
effect sizes for planned comparisons must be calculated !!
- variance ratios
- standard deviation of contrasts → between variance
- compare between variance for contrast with within variance
each contrast
- requires 1 degree of freedom
- combines a specific number of levels
multiple testing correction may be required

group means \(\mu_i\)
pre-specified coefficients \(c_i\)
sample sizes \(n_i\)
total sample size \(N\)

\(\sigma_{contrast} = \frac{|\sum{\mu_i * c_i}|}{\sqrt{N \sum_i^k c_i^2 / n_i}}\)

43 Multiple Groups: Contrasts (continued)

GPower: one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
obtain effect sizes for contrasts (assume equally sized for convenience)
- \(\sigma_{contrast}\) T1-C: \(\frac{(-1*0 + 1*2 + 0*4)}{\sqrt(2*((-1)^2+1^2+0^2))} = 1\); \(\sigma_{error}\) = 4 → \(f\)=.25
- \(\sigma_{contrast}\) T2-C:\(~= 2\); \(\sigma_{error}\) = 4 → \(f\)=.5
- \(\sigma_{contrast}\) (T1+T2)/2-C:\(~= 1.4142\); \(\sigma_{error}\) = 4 → \(f\)=.3535
sample size for each contrast, each 1 df
- what samples sizes for either contrast 1 or contrast 2 ?
- what samples sizes for both contrast 1 and contrast 2 combined ?
- if taking that sample size, what will be the power for T1-T2 ?
- what samples size for contrast 3 ?

Solution for multiple groups contrasts

GPower: one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
what samples sizes for either contrast 1 or contrast 2 ?
- variance explained \(1^2\) or \(2^2\)
- for T1-C \(f\) = \(\sqrt{1^2/4^2}\) = .25 = d/2 → 128 (64 C - 64 T1)
- for T2-C \(f\) = \(\sqrt{2^2/4^2}\) = .50 = d/2 → 34 (17 C - 17 T2)
what samples sizes for both contrast 1 and contrast 2 combined ?
- multiple testing, consider Bonferroni correction → /2
- for T1-C 155, for T2-C 41 → total 175 (78 C, 77 T1, 20 T2)
if taking that sample size, what will be the power for T1-T2 ?
- post-hoc, 77 and 20, with d=.5 and \(\alpha\) = .5 → power \(\approx\) .5
what samples size for contrast 3 ?
- variance contrast \(1.4142^2\)
- 3 groups, little impact if any
- for .5*(T1+T2) - C \(f\) = \(\sqrt{2/16}\) = .3535 → 65 (22 C, 21 T1, 22 T2)

44 Multiple Factors

multiple main effects and possibly interaction effects (eg., treatment and type)
- main effects (average effects, additive) & interaction (factor level specific effects)
- note: numerator degrees of freedom → main effect (nr-1), interaction (nr1-1)*(nr2-1)
- \(\eta^2\) = \(f^2 / (1+f^2)\), remember \(f = d/2\) for two groups
- note: get effect sizes for two way anova: http://apps.icds.be/effectSizes/
GPower: multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
- determine \(\eta^2\) and sample size for reference example, remember the between group variance ?
- use the app: use for means only values 0 and 2, and 4 and 6 if necessary
  - for treatment use C-T1-T2, for type (second predictor) use B1-B2
  - get \(\eta^2\) for treatment effect but no type effect ? recognize \(f\) ?
  - specify such that types differ, not treatment → \(f\) and sample size ?
  - specify such that treatment effect only for one type → \(f\) and sample size ?
  - specify effect for both treatment and type, without interaction → \(f\) and sample size ?

Solution for multiple factors

GPower: multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
determine sample size for reference example, remember the between group variance ?
- between group variance 1, within 16, sample size 128 (numerator df = 2-1)
- 2 x 2 with 0-2 → \(\eta^2\) as expected = .0588
get \(\eta^2\) for treatment effect but no type effect ? recognize \(f\) ?
- 0-2-4 for both types → \(f\) = .4082 of the omnibus F-test (compare all groups)
specify such that types differ, not treatment → \(f\) and sample size ?
- 0-0-0 versus 2-2-2 → \(f\) = .25 of t-test (compare two groups)
specify such that treatment effect only for one type → \(f\) and sample size ?
- 0-2-4 versus 0-0-0 → \(f\) = .2041, .25 and .2041
  - detect interaction (num df = 2) = 235 total (40 per combination)
  - detect only treatment effect (num df = 2) = 235 total (79 each group, 79/2 per combination)
  - detect only type effect (num df = 1) = 128 total (64 each group, 64/3 per combination)
  - detect both both main effects = 40 each combination ~ max(79/2,64/3)
specify effect for both treatment and type, without interaction → \(f\) and sample size ?
- 0-2-4 versus 2-4-6 → \(f\) = .4082, .25 and 0, sample size = 21 per combination

45 Repeated Measures

if repeated measures → account for correlations within
possible to focus on:
- within: similar to dependent t-test for multiple measurements
- between: group comparison, each based on multiple measurements
- interaction: difference between changes over measurements (within)
correlation within unit (eg., within subject)
- informative within unit (like paired t-test)
- redundancy on information between units (observations less informative)
beware: effect size could include or exclude correlation
GPower: repeated measures (F-test / Means, repeated measures…)
- correlation not yet included → Options: ‘as in GPower 3.0’
- correlation already included → Options: ‘as in SPSS’
- suggested youtube: https://www.youtube.com/watch?v=CEQUNYg80Y0

46 Repeated Measures Within

GPower: repeated measures (F-test / Means, repeated measures within factors)
use effect size f = .25 (1/16 explained versus unexplained)
- mimic dependent t-test, correlation .5 !
- mimic independent t-test, but only use 1 group !
- double number of groups to 2, or 4 (cor = .5), what changes ?
- double number of measurements to 4 (cor = .5), impact ?
- compare impact double number of measurements for correlations .5 with .25 ?

Solution for repeated measures within

GPower: repeated measures (F-test / Means, repeated measures within factors)
mimic dependent t-test, correlation .5 !
- only 1 group, 2 repeated measures, correlation .5 → 34 x 2 measurements
mimic independent t-test, but only use 1 group !
- only 1 group, 2 repeated measures, correlation 0 → 65 x 2 measurements
double number of groups to 2, or 4 (cor = .5), what changes ?
- number of groups not relevant for within group comparison
- but requires estimation, changed degrees of freedom
double number of measurements to 4 (cor = .5), impact ?
- sample size reduces from 34 to 24, but 34x2=68, 24*4=96
with 4 measurements (double) take halve the correlation (0.25), impact ?
- sample size 35, nearly 34
- 2 repeated measurements with corr .5, about same sample size as 4 repeats with corr .25

47 Repeated Measures Between

GPower: repeated measures (F-test / Means, repeated measures between factors)
use effect size f = .25 (1/16 explained versus unexplained)
- compare 2 groups, each 2 measurements… impact on sample size when correlation 0, .25 and .5 ?
- double number of groups to 2, or 4 (cor = .5), what changes ?
- double number of measurements to 4 (cor = .5), impact ?
- compare impact number of measurements for different correlations .5 with .25 ?
- mimic independent t-test ?

Solution for repeated measures between

GPower: repeated measures (F-test / Means, repeated measures between factors)
use effect size f = .25 (1/16 explained versus unexplained)
compare 2 groups, each 2 measurements… impact on sample size when correlation 0, .25 and .5 ?
- increase in correlations results in increase in sample size (redundancy)
double number of groups to 2, or 4 (cor = .5), what changes ?
- increase in number of groups, small increase (estimation required) IF same effect size \(f\)
double number of measurements to 4 (cor = .5), impact ?
- increase in number of measurements, increases total number, but reduces number of units
compare impact number of measurements for different correlations .5 with .25 ?
- increase stronger if correlations stronger
mimic independent t-test ?
- 128 units, if .99 correlation with fully redundant second set
- 132 (66/2 * 2), if 0 correlation with need to estimate four group averages and correlation

48 Repeated Measures Interaction Within x Between

GPower: repeated measures (F-test / Means, repeated measures within-between factors)
option: calculate effect sizes: http://apps.icds.be/effectSizes/
- for sd = 4, with group with average 0-2-4, and with non-responsive (all 0):
- compare effect sizes for interaction with correlation .5 and 0, conclude ?
- compare sample sizes for those 2 effect sizes with correlation .5 or 0 ?

Solution for repeated measures interaction within x between

GPower: repeated measures (F-test / Means, repeated measures within-between factors)
option: calculate effect sizes: http://apps.icds.be/effectSizes/
for sd = 4, with group with average 0-2-4, and with non-responsive (all 0):
compare effect sizes for interaction with correlation .5 and 0, conclude ?
- with 0 correlation → \(f\) for interaction = .25
- with .5 correlation → \(f\) = .3536
compare sample sizes for those 2 effect sizes with correlation .5 or 0 ?
- for \(f\) = .25, sample sizes are 54x2 (cor=0) and 28x2 (cor=.5)
- for \(f\) = .3535, sample sizes are 28x2 (cor=0) and 16x2 (cor=.5)
- either include .5 correlation to calculate effect size OR sample size

49 Correlations

if comparing two independent correlations
use Fisher Z transformations to normalize first
- z = .5 * log(\(\frac{1+r}{1-r}\)) → q = z1-z2
GPower: z-tests / correlation & regressions: 2 indep. Pearson r’s
- with correlation coefficients .7844 and .5, what are the effect & sample sizes ?
- with the same difference, but stronger correlations, eg., .9844 and .7, what changes ?
- with the same difference, but weaker correlations, eg., .1 and .3844, what changes ?
note that dependent correlations are more difficult, see manual

Solution for correlations

GPower: z-tests / correlation & regressions: 2 indep. Pearson r’s
with correlation coefficients .7844 and .5, what are the effect & sample sizes ?
- effect size q = 0.5074, sample size 64*2 = 128
- \(.5*log((1+.7844)/(1-.7844)) - .5*log((1+.5)/(1-.5))\)
- notice: effect size q \(\approx\) d, same sample size
with the same difference, but stronger correlations, eg., .9844 and .7, what changes ?
- effect size q = 1.5556, sample size 10*2 = 20
- same difference but bigger effect (higher correlations more easy to differentiate)
with the same difference, but weaker correlations, eg., .1 and .3844, what changes ?
- effect size q = 0.3048, sample size 172*2 = 344
- same difference, negative, and smaller effect (lower correlations more difficult to differentiate)

50 Proportions

if comparing two independent proportions → bounded between 0 and 1
GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
effect sizes in odds ratio, relative risk, difference proportion
- for odds ratio 3 and p2 = .50, what is p1 ? and for odds ratio 1/3 ?
- what is the sample size to detect a difference for both situations ?
- for odds ratio 3 and p2 = .75, determine p1 and sample size, how does it compare with before ?
- for odds ratio 1/3 and p2 = .25, determine p1 and sample size, how does it compare with before ?
- compare sample size for a .15 difference, at p1=.5 ?

Solution for proportions

GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
for odds ratio 3 and p2 = .50, what is p1 ? and for odds ratio 1/3 ?
- odds ratio 3 → with p2 = .5 or odds_2 = 1, odds_1 = 3 thus p1 = 3/(3+1) = .75
what is the sample size to detect a difference for both situations ?
- 128, same for .5 versus .25 or .75 (unlike correlation)
for odds ratio 3 and p2 = .75, determine p1 and sample size, how does it compare with before ?
- p1 to .9, difference of .15, sample size increases to 220
for odds ratio 1/3 and p2 = .25, determine p1 and sample size, how does it compare with before ?
- p1 to .1, difference of .15, sample size increases to 220
compare sample size for a .15 difference, at p1=.5 ?
- sample size even higher, to 366, increase not because smaller difference

51 Exercise proportions

GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)

for odds ratio = 2, with p2 reference probability .6
plot power over proportions .5 to 1
include 5 curves, sample sizes 328, 428, 528…
with type I error .05
explain curve minimum, relation sample size ?
repeat for one-tailed, difference ?

Solution for proportions

GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)

for odds ratio = 2, with p2 reference probability .6

plot power over proportions .5 to 1

include 5 curves, sample sizes 328, 428, 528…

with type I error .05

explain curve minimum, relation sample size ?

power for proportion compared to reference .6
minimum is type I error probability
sample size determines impact

repeat for one-tailed, difference ?

one-tailed, increases power (both sides !?)

52 Dependent Proportions

if comparing two dependent proportions → categorical shift
- if only two categories, McNemar test: compare \(p_{12}\) with \(p_{21}\)
- information from changes only → discordant pairs
- effect size as odds ratio → ratio of discordance
- like other exact tests, choice assignment alpha
GPower: McNemar test (exact / proportions, difference 2 dependent proportions)
- assume odds ratio equal to 2, equal sized, type I and II errors .05 and .2, two-way !
- what is the sample size for .25 proportion discordant, .5, and 1 ?
- odds ratio 1 versus .5, (prop discordant = .25), what are \(p_12\) and \(p_21\) and sample sizes ?
- repeat for third alpha option, and consider total sample size, what happens ?

Solution for dependent proportions

GPower: McNemar test (exact / proportions, difference 2 dependent proportions)
assume odds ratio equal to 2, equal sized, type I and II errors .05 and .2, two-way !
what is the sample size for .25 proportion discordant, .5, and 1 ?
- 288 (.25), 144 (.5), 73~144/2 (.99) → decrease in sample size with increased discordance
odds ratio .5 or 4, (prop discordant = .25), what are \(p_{12}\) and \(p_{21}\) and sample sizes ?
- same as 2 but reverse \(p_{12}\) and \(p_{21}\), with sample size 288
- with 4 as odds ratio, larger effect, requires smaller sample size, only 80
- odds ratio = \(p_{12}\) / \(p_{21}\)
repeat for third alpha option, with odds ratio 4, what happens ?
- changed lower / upper critical N, lower sample size
- BUT, is because lower power, closer to requested .8

53 Not Included

various statistical tests difficult to specify in GPower
- various statistics / parametervalues that are difficult to guestimate
- manual for more complex tests not always very elaborate
various statistical tests not included in GPower
- eg., survival analysis
- many tools online, most dedicated to a particular model
various statistical tests no formula to offer sample size
- simulation may be the only tool
  - iterate many times: generate and analyze → proportion of rejections
  - generate: simulated outcome ← model and uncertainties
  - analyze: simulated outcome → model and parameter estimates + statistics

54 Simulation Example t-test

gr <- rep(c('T','C'),64)
y <- ifelse(gr=='C',0,2)
dta <- data.frame(y=y,X=gr)
cutoff <- qt(.025,nrow(dta))
 
my_sim_function <- function(){
    dta$y <- dta$y+rnorm(length(dta$X),0,4)     # generate (with sd=4)
    res <- t.test(data=dta,y~X)                 # analyze
    c(res$estimate %*% c(-1,1),res$statistic,res$p.value)
}
sims <- replicate(10000,my_sim_function())      # many iterations
dimnames(sims)[[1]] <- c('diff','t.stat','p.val')

mean(sims['p.val',] < .05)  # p-values  0.8029
mean(sims['t.stat',] < cutoff)  # t-statistics 0.8029
mean(sims['diff',] > sd(sims['diff',])*cutoff*(-1)) # differences 0.8024

55 Focus / Simplify

complex statistical models
- simulate BUT it requires programming and a thorough understanding of the model
- alternative: focus on essential elements → simplify the aim
sample size calculations (design) for simpler research aim
- not necessarily equivalent to final statistical testing / estimation
- requires justification to convince yourself and/or reviewers
  - successful already if simple aim is satisfied
  - ignored part is not too costly
example:
- statistics: group difference evolution 4 repeated measurements → mixed model
- focus: difference treatment and control last time point is essential → t-test
- argument: first 3 measurements low cost, interesting to see change

56 Conclusion

sample size calculation is a design issue, not a statistical one
building blocks: sample & effect sizes, type I & II errors
- establish any of these building blocks, conditional on the rest
effect sizes express the amount of signal compared to the background noise
GPower deals with not too complex models
- more complex complex models imply more complex specification
- simplify using a focus, if justifiable → then GPower can get you a long way

Methodological and statistical support to help make a difference

ICDS provides complementary support in methodology and statistics to our research community, for both individual researchers and research groups, in order to get the best out of them
ICDS aims to address all questions related to quantitative research, and to further enhance the quality of both the research and how it is communicated

website: https://www.icds.be/ includes information on who we serve, and how

booking: https://www.icds.be/consulting/ for individual consultations

Sample Size Calculation with GPower

Wilfried Cools & Sven Van Laere

01 Sample Size Calculation

02 Sample Size Calculation: demarcation

03 Sample Size Calculation: a difficult design issue

04 Simple Example

05 Reference Example

06 A formula you could use

07 GPower: the building blocks in action

08 GPower: a useful tool

09 GPower input

10 GPower output

11 Protocol: reference example

12 Building Blocks

13 GPower Statistical Tests

14 Central Ho and Non-Central Ha Distributions

15 Note: Divide by N Perspective as alternative

16 Note: Ho and Ha, asymmetry in statistical testing

17 Type I/II Error Probability

18 Exercise on Errors, create plot

19 Exercise on Errors, interpret plot

20 Decide Type I/II Error Probability

21 Control Type I Error

22 for fun: P(effect exists | test says so)

23 Effect Sizes, in principle

24 Effect Sizes, in literature

25 Effect Sizes, in GPower (Determine)

26 Exercise on Effect Sizes, ingredients Cohen’s d

27 Exercise on Effect Sizes, plot

28 Exercise on Effect Size, imbalance

29 Effect Sizes, how to determine them in theory

30 Effect Sizes, how to determine them in practice

31 Relation Sample & Effect Size, type I & II Errors

32 Exercise on Type of Power Analysis

Solution for Type of Power Analysis

33 getting your hands dirty

34 GPower, beyond the independent t-test

35 Dependence between groups

Solution for dependence between groups

36 Non-parametric distribution

Solution for non-parametric distribution

37 A relations perspective, regression analysis

Solution on a relations perspective

38 A variance ratio perspective, ANOVA

Solution on a variance ratio perspective

39 A variance ratio perspective on multiple groups

40 Multiple Groups: Omnibus

Solution for multiple groups omnibus

41 Multiple Groups: Pairwise

Solution for multiple groups pairwise

42 Multiple Groups: Contrasts

43 Multiple Groups: Contrasts (continued)

Solution for multiple groups contrasts

44 Multiple Factors

Solution for multiple factors

45 Repeated Measures

46 Repeated Measures Within

Solution for repeated measures within

47 Repeated Measures Between

Solution for repeated measures between

48 Repeated Measures Interaction Within x Between

Solution for repeated measures interaction within x between

49 Correlations

Solution for correlations

50 Proportions

Solution for proportions

51 Exercise proportions

Solution for proportions

52 Dependent Proportions

Solution for dependent proportions

53 Not Included

54 Simulation Example t-test

55 Focus / Simplify

56 Conclusion