Workshop offered by the Interfaculty Center Data processing and Statistics (icds.be).
Current draft (Apr 04, 2021) aims to introduce researchers to the key ideas in sample size calculation that would help them design their study (55 pages). Our target audience is primarily the research community at VUB / UZ Brussel.
We invite you to help us improve this document by sending us feedback
wilfried.cools@vub.be or anonymously at icds.be/consulting (right side, bottom)
01 Sample Size Calculation
- our program
- part I: understand the reasoning
- introduce building blocks
- implement on t-test
- part II: explore more complex situations
- beyond the t-test
- simple but common
- not one formula for all → GPower to the rescue
02 Sample Size Calculation: demarcation
- how many observations will be sufficient ?
- avoid too many, because typically observations imply a cost
- money / time → limited resources
- risk / harm → ethical constraints
- depends on the aim of the study
- research aim → statistical inference
- linked to statistical inference (using standard error)
- testing → power [probability to detect effect]
- estimation → accuracy [size of confidence interval]
03 Sample Size Calculation: a difficult design issue
- before data collection, during design of study
- requires understanding: future data, analysis, inference (effect size, focus, …)
- conditional on assumptions & decisions
- not always possible nor meaningful !
- easier for experiments (control), less for observational studies
- easier for confirmatory studies, much less for exploratory studies
- not possible for predictive models, because no standard error
- NO retrospective power analyses → OK for future study only
Hoenig, J., & Heisey, D. (2001). The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.
- alternative justifications:
- common practice, feasibility → non-statistical (importance, low cost, …)
04 Simple Example
- experimental - confirmatory
- evaluation of radiotherapy to reduce a tumor in mice
- comparing treatment group with control (=conditions)
- tumor induced, random assignment treatment or control (equal if no effect)
- after 20 days, measurement of tumor size (=observations)
- intended analysis: unpaired t-test to compare averages for treatment and control
- SAMPLE SIZE CALCULATION:
- IF average tumor size for treatment at least 20% less than control (4 vs. 5mm)
- THEN how many observations, sufficient to detect that difference (significance) ?
05 Reference Example
- sample sizes easy and meaningful to calculate for well understood problems
- apriori specifications
- intend to perform a statistical test
- comparing 2 equally sized groups
- to detect difference of at least 2
- assuming an uncertainty of 4 SD on each mean
- which results in an effect size of .5
- evaluated on a Student t-distribution
- allowing for a type I error prob. of .05 \((\alpha)\)
- allowing for a type II error prob. of .2 \((\beta)\)
- sample size
conditional on specifications being true
|
https://apps.icds.be/shinyt/
|
07 GPower: the building blocks in action
- SIZES: effect size, sample size
- ERRORS:
- Type I (\(\alpha\)) defined on distribution Ho
- Type II (\(\beta\)) evaluated on distribution Ha
- calculate sample size based on effect size, and type I / II error
10 GPower output
- sample size (\(n\)) = 64 x 2 = (
128 )
- degrees of freedom (\(df\)) = 126 (128 - 2)
- critical t = 1.979
- decision boundary given \(\alpha\) and \(df\)
qt(.975,126)
- non centrality parameter (\(\delta\)) = 2.8284
- shift
Ha (true) away from Ho (null) 2/(4*sqrt(2))*sqrt(64)
- distributions: central
Ho and non-central Ha
- power ≥ .80 (1-\(\beta\)) = 0.8015
| |
| |
11 Protocol: reference example
- Protocol: summary for future reference or communication
- File/Edit save or print file (copy-paste)
t tests - Means: Difference between two independent means (two groups)
Analysis: A priori: Compute required sample size
Input: Tail(s) = Two Effect size d = 0.5000000 α err prob = 0.05 Power (1-β err prob) = .8 Allocation ratio N2/N1 = 1
| |
Output: Noncentrality parameter δ = 2.8284271 Critical t = 1.9789706 Df = 126 Sample size group 1 = 64 Sample size group 2 = 64 Total sample size = 128 Actual power = 0.8014596
| |
12 Building Blocks
- distributions: Ho & Ha ~ test dependent shape
- sizes: sample size & effect size ~ shift between Ho & Ha
- errors: type I error & type II error ~ cut-off at Ho & Ha
13 GPower Statistical Tests
14 Central Ho and Non-Central Ha Distributions
Ho acts as \(\color{red}{benchmark}\) → eg., no difference
- set \(\color{green}{cut off}\) on
Ho ~ t(ncp=0,df) using \(\alpha\),
- reject
Ho if test returns implausible value
Ha acts as \(\color{blue}{truth}\) → eg., difference of .5 SD
Ha ~ t(ncp!=0,df)
ncp as violation of Ho → shift (location/shape)
ncp : non-centrality parameter combines
- assumed
effect size (target or signal)
- conditional on
sample size (information)
ncp : determines overlap → power ↔︎ sample size
|
15 Note: Divide by N Perspective as alternative

- divide by n: sample size ~ standard deviation
- non-centrality parameter: sample size ~ location
| |
\(n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}\)
\(n = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2}\)
\(n = 62.79\)

|
16 Note: Ho and Ha, asymmetry in statistical testing
Ha
is NOT interchangeable with Ho
- cut-off at
Ho
using \(\alpha\)
- in statistics → observe test statistics (
Ha
unknown)
- in sample size calculation → assume
Ha
- in statistics → if fail to reject then remain in doubt
- absence of evidence \(\neq\) evidence of absence
- p-value → P(statistic|
Ho
) != P(Ho
|statistic)
- example: evidence for insignificant \(\eta\) same as for \(\eta\) * 2
- equivalence testing →
Ha
for ‘no effect’
- reject
Ho
that smaller than 0 - |\(\delta\)| AND bigger than 0 + |\(\delta\)|
- acts as two superiority tests combined
17 Type I/II Error Probability
- inference test based on cut-off’s (density → AUC=1)
- type I error: incorrectly reject
Ho (false positive):
- cut-off at
Ho , error prob. \(\alpha\) controlled
- one/two tailed → one/both sides informative ?
- type II error: incorrectly fail to reject
Ho (false negative):
- cut-off at
Ho , error prob. \(\beta\) depends on Ha
Ha assumed known in a power analyses
- power = 1 - \(\beta\) = probability correct rejection (true positive)
- inference versus truth
- infer: effect exists vs. unsure
- truth: effect exist vs. does not
|
|
|
infer=Ha
|
infer=Ho
|
sum
|
truth=Ho
|
\(\alpha\)
|
1-\(\alpha\)
|
1
|
truth=Ha
|
1-\(\beta\)
|
\(\beta\)
|
1
|
18 Exercise on Errors, create plot
~ reference example
- create plot
(X-Y plot for range of values)
- plot sample size by type I error
- set plot to 4 curves
- for power .8 in steps of .05
- set \(\alpha\) on x-axis
- from .01 to .2 in steps of .01
- use effect size .5
|
- notice Table option

|
19 Exercise on Errors, interpret plot
- where on the red curve (right)
type II error = 4 * type I error ?
- when smaller effect size (.25), what changes ?
- switch power and sample size (32 in step of 32)
what is relation type I and II error ?

|

- what would be difference between curves for \(\alpha\) = 0 ?
|
20 Decide Type I/II Error Probability
- popular choices
- \(\alpha\) often in range .01 - .05 → 1/100 - 1/20
- \(\beta\) often in range .2 to .1 → power = 80% to 90%
- \(\alpha\) & \(\beta\) inversely related
- \(\alpha\) & \(\beta\) often selected in 1/4 ratio
type I error is 4 times worse !!
- which error you want to avoid most ?
- cheap aids test ? → avoid type II
- heavy cancer treatment ? → avoid type I
- probability for errors always exists
|
|
21 Control Type I Error
- multiple testing
- inflates type I error \(\alpha\)
- family of tests: \(1-(1-\alpha)^k\) → correct, eg., Bonferroni (\(\alpha/k\))
- interim analysis (analyze and proceed) → correct, eg., alpha spending
- interim analysis
- plan in advance
- O’Brien-Flemming bounds, more efficient than Bonferroni
- NOT GPower
- determine boundaries with PASS, R (ldbounds), …
22 for fun: P(effect exists | test says so)
- power → P(test says there is effect | effect exists)
- \(P(infer=Ha|truth=Ho) = \alpha\)
- \(P(infer=Ho|truth=Ha) = \beta\)
- \(P(infer=Ha|truth=Ha) = power\)
- \(P(\underline{truth}=Ha|\underline{infer}=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha)}\) → Bayes Theorem
- \(P(truth=Ha|infer=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha|truth=Ha) * P(truth=Ha) + P(infer=Ha|truth=Ho) * P(truth=Ho)}\)
- \(P(truth=Ha|infer=Ha) = \frac{power * P(truth=Ha)}{power * P(truth=Ha) + \alpha * P(truth=Ho)}\) → depends on prior probabilities
- IF very low probability model is true, eg., .01 ? → \(P(truth=Ha) = .01\)
- THEN probability effect exists if test says so is low, in this case only .14 !!
- \(P(truth=Ha|infer=Ha) = \frac{.8 * .01}{.8 * .01 + .05 * .99} = .14\)
23 Effect Sizes, in principle
- estimate/guestimate of minimal magnitude of interest
- typically standardized: signal to noise ratio (noise provides scale)
- eg., difference on scale of pooled standard deviation
- eg., effect size \(d\)=.5 means .5 standard deviations
- part of non-centrality (as is sample size) → pushing away
Ha
- ~ practical significance (as opposed to statistical significance ~ sample size)
- 2 main families of effect sizes (test specific)
d-family
(differences) and r-family
(associations)
- transform one into other, eg., d = .5 → r = .243
\(\hspace{20 mm}d = \frac{2r}{\sqrt{1-r^2}}\) \(\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}\) \(\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}\)
- NOT p-value ~ partly effect size, but also partly sample size
24 Effect Sizes, in literature
- Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.

- Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed).
- famous Cohen conventions but beware, just rules of thumb
|
- more than 70 different effect sizes… most of them related
- Ellis, P. D. (2010). The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results.

|
25 Effect Sizes, in GPower (Determine)
- effect sizes are test specific
- t-test → group means and sd’s
- one-way anova →
variance explained & error
- regression →
again other parameters
- . . . .
- GPower helps with
Determine
- sliding window
- one or more effect size specifications
|
|
26 Exercise on Effect Sizes, ingredients Cohen’s d
For the reference example :
- change mean values from 0 and 2 to 4 and 6, what changes ?
- change sd values to 2 for each, what changes ?
- effect size ?
- total sample size ?
- critical t ?
- non-centrality ?
- change sd values to 8 for each, what changes ?
- change sd to 2 and 5.3, or 1 and 5.5,
how does it compare to 4 and 4 ?
|
|
27 Exercise on Effect Sizes, plot
- plot powercurve: power by effect size
- compare 6 sample sizes: 34 in steps of 34
- for a range of effect sizes in between .2 and 1.2
- use \(\alpha\) equal to .05
- pinpoint the situations from previous section on the plot (sd=4 and 2).
- how does power change when doubling the effect size ?
|
- powercurve → X-Y plot for range of values

|
28 Exercise on Effect Size, imbalance
For the reference example :
compare for allocation ratios 1, .5, 2, 10, 50
repeat for effect size 1, and compare
? no idea why n1 \(\neq\) n2 
|

after calculate plot, to change allocation ratio
|
29 Effect Sizes, how to determine them in theory
- choice of effect size matters → justify choice !!
- choice of effect size depends on aim of the study
- realistic (eg., previously observed effect) → replicate
- important (eg., minimally relevant effect)
- NOT significant → meaningless, dependent on sample size
- choice of effect size dependent on statistical test of interest
- for independent t-test → means and standard deviations
- possible alternative: variance explained, eg., 1 versus 16+1
30 Effect Sizes, how to determine them in practice
- experts / patients → use if possible → importance
- literature (earlier study / systematic review) → beware of publication bias → realistic
- pilot → guestimate dispersion estimate (not effect size → small sample)
- internal pilot → conditional power (sequential)
- guestimate uncertainty…
- sd from assumed range, assume normal and divide by 6
- sd for proportions at conservative .5
- sd from control, assume treatment the same
...
- turn to Cohen → use if everything else fails (rules of thumb)
- eg., .2 - .5 - .8 for Cohen’s d
31 Relation Sample & Effect Size, type I & II Errors
- building blocks:
- sample size (\(n\))
- effect size (\(\Delta\))
- alpha (\(\alpha\))
- power (\(1-\beta\))
- each parameterconditional on others
|
- GPower → type of power analysis
- Apriori: \(n\)
~ \(\alpha\), power , \(\Delta\)
- Post Hoc:
power ~ \(\alpha\), \(n\), \(\Delta\)
- Compromise:
power , \(\alpha\) ~ \(\beta\:/\:\alpha\), \(\Delta\), \(n\)
- Criterion: \(\alpha\)
~ power , \(\Delta\), \(n\)
- Sensitivity: \(\Delta\)
~ \(\alpha\), power , \(n\)
|
32 Exercise on Type of Power Analysis
- retrieve power given n, \(\alpha\) and \(\Delta\) of
reference
case
- then, for power .8, take half the sample size, how does \(\Delta\) change ?
- then, set \(\beta\)/\(\alpha\) ratio to 4, what is \(\alpha\) & \(\beta\) ? what is the critical value ?
- then, keep \(\beta\)/\(\alpha\) ratio to 4 for effect size .7, what is \(\alpha\) & \(\beta\) ? critical value ?
Solution for Type of Power Analysis
- retrieve power given n, \(\alpha\) and \(\Delta\) of
reference
case
- then, for power .8, take half the sample size, how does \(\Delta\) change ?
- use sensitivity 32x2 (d=.7114)
- \(\Delta\) from .5 to .7115 = .2115
- bigger effect \(\Delta\) compensates loss of sample size n
- then, set \(\beta\)/\(\alpha\) ratio to 4, what is \(\alpha\) & \(\beta\) ? what is the critical value ?
- use compromise 32x2
- \(\alpha\) =.09 and \(\beta\) =.38, critical value 1.6994
- then, keep \(\beta\)/\(\alpha\) ratio to 4 for effect size .7
- use compromise 32x2
- \(\alpha\) =.05 and \(\beta\) =.2, critical value 1.9990
33 getting your hands dirty

# calculator
m1=0;m2=2;s1=4;s2=4
alpha=.025;N=128
var=.5*s1^2+.5*s2^2
d=abs(m1-m2)/sqrt(2*var)*sqrt(N/2)
tc=tinv(1-alpha,N-1)
power=1-nctcdf(tc,N-1,d)
|
- in
R
- qt → get quantile on
Ho (\(Z_{1-\alpha/2}\))
- pt → get probability on
Ha (non-central)
.n <- 64
.df <- 2*.n-2
.ncp <- 2 / (4 * sqrt(2)) * sqrt(.n)
.power <- 1 -
pt(
qt(.975,df=.df),
df=.df, ncp=.ncp
) -
pt( qt(.025,df=.df), df=.df, ncp=.ncp)
round(.power,4)
## [1] 0.8015
|
34 GPower, beyond the independent t-test
- so far, comparing two independent means
- selected topics with small exercises
- dependent instead of independent
- non-parametric instead of assuming normality
- relations instead of groups (regression)
- correlations
- proportions, dependent and independent
- more than 2 groups (compare jointly, pairwise, focused)
- more than 1 predictor
- repeated measures
- GPower manual 27 tests: effect size, non-centrality parameter and example !!
35 Dependence between groups
if 2 dependent groups (eg., before/after treatment) → account for correlations
correlation typically obtained from pilot data, earlier research
GPower: matched pairs (t-test / means, difference 2 dependent means)
- use
reference example
, and assume correlation .5 to compare with reference effect size, ncp, n
- how many observations if no correlation exists (think then try) ? effect size ?
- what changes with correlation .875 (think: more or less n, higher or lower effect size) ?
- what would the power be with the reference sample size, n=128, but now cor=.5 ?
Solution for dependence between groups
- GPower: matched pairs (t-test / means, difference 2 dependent means)
- use
reference example
, and assume correlation .5 to compare with reference effect size, ncp, n
- \(\Delta\) looks same, n much smaller = 34
- different type of effect size: dz ~ d / \(\sqrt{2*(1-\rho)}\)
- also note: 34x2 measurements
- how many observations if no correlation exists (think then try) ? effect size ?
- 65, approx. same as INdependent means → 64 (*2=128) but also estimate the correlation
- \(\Delta\) = dz = .3535 (~ d = .5)
- what changes with correlation .875 (think: more or less n, higher or lower effect size) ?
- effect size * 2 → sample size from 34 to 10 (almost / 4)
- what would the power be with the reference sample size, correlation .5 ? what is the ncp ?
- post - hoc power, 64 * 2 measurements, with .5 correlation
- power > .976, ncp > 4,
36 Non-parametric distribution
expect non-normally distributed residuals, not possible to avoid (eg., transformations)
only considers ranks or uses permutations → price is efficiency and flexibility
requires parent distribution (alternative hypothesis), ‘min ARE’ should be default
GPower: two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)
- use
reference example
, with normal parent distribution, how much efficiency is lost ?
- for a parent distribution ‘min ARE’, how much efficiency is lost ?
Solution for non-parametric distribution
- GPower: two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)
- use
reference example
, with normal parent distribution, how much efficiency is lost ?
- requires a few more observations (3 more per group)
- less than 5 % loss (~134/128)
- for a parent distribution ‘min ARE’, how much efficiency is lost ?
- requires several more observations
- more than 15 % loss (~148/128)
- min ARE is safest choice without extra information, least efficient
37 A relations perspective, regression analysis
differences between groups → relation observations & grouping (categorization)
example → d = .5 → r = .243 (note: slope \(\beta = {r*\sigma_y} / {\sigma_x}\))
- .243*sqrt(\(4^2+1^2\))/sqrt(\(.5^2\)) = 2
GPower: regression coefficient (t-test / regression, one group size of slope)
- determine slope \(\beta\) and \(\sigma_y\) for reference values, d=.5 (hint:d~r), SD = 4 and \(\sigma_x\) = .5 (1/0)
- calculate sample size
- what happens with slope and sample size if predictor values are taken as 1/-1 ?
- determine \(\sigma_y\) for slope 6, \(\sigma_x\) = .5, and SD = 4, would it increase the sample size ?
Solution on a relations perspective
- GPower: regression coefficient (t-test / regression, one group size of slope)
- determine slope \(\beta\) and \(\sigma_y\) for reference values, d=.5, SD = 4 and \(\sigma_x\) = .5 (1/0)
- \(\sigma_x\) = \(\sqrt{.25}\) = .5 (binary, 2 groups: 0 and 1) → slope = 2, \(\sigma_y\) = 4.12 = \(\sqrt{4^2+1^2}\)
- calculate sample size
- 128, same as for reference example, now with effect size slope H1 given 1/0 predictor values
- what happens with slope and sample size if predictor values are taken as 1/-1 ?
- \(\beta\) is 1, a difference of 2 over 2 units instead of 1
- no difference in sample size, compensated by variance of design
- determine \(\sigma_y\) for slope 6, \(\sigma_x\) = .5, and SD = 4, would it increase the sample size ?
- \(\sigma_y\) = 5 = \(\sqrt{4^2+3^2}\) (assuming balanced data)
- bigger effect → smaller sample size, only 17
38 A variance ratio perspective, ANOVA
- difference between groups or relation → ratio between and within group variance
- GPower: regression coefficient (t-test / regression, fixed model single regression coef)
- use
reference example
, regression style (sd of effect and error, but squared)
- calculate sample size, compare effect sizes ?
- what if also other predictors in the model ?
- what if 3 predictors extra reduce residual variance to 50% ?
- note:
- partial \(R^2\) = variance predictor / total variance
- \(f^2\) = variance predictor / residual variance = \({R^2/{(1-R^2)}}\)
Solution on a variance ratio perspective
- GPower: regression coefficient (t-test / regression, fixed model single regression coef)
- use
reference example
, regression style (sd of effect and error, but squared)
- calculate sample size, compare effect sizes ?
- 128, same as for reference example, now with \(f^2\) = \(.25^2\) = .0625 (d=.5,r=.243)
- what if also other predictors in the model ?
- very little impact → loss of degree of freedom
- ignore that predictors explain variance → reduce residual variance
- what if 3 predictors extra reduce residual variance to 50% ?
- less noise → bigger effect size
- sample size much less (65)
39 A variance ratio perspective on multiple groups
- multiple groups → not one effect size
d
- F-test statistic & effect size
f , ratio of variances \(\sigma_{between}^2 / \sigma_{within}^2\)
- difference between multiple groups summarized in variance \(\sigma_{between}^2\)
- example: one control and two treatments
reference example + 1 group
- sd within each group, for all groups (C,T1,T2) = 4
- means C=0, T1=2 and for example T2=4
|

|
40 Multiple Groups: Omnibus
- difference between some groups → at least two differ
- GPower: one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)
- effect size f, with numerator/denominator df
- obtain sample size for
reference example
, just 2 groups C and T1 (size=64)!
- play with sizes, how does size matter ?
- include third group, with mean 2, what are sample sizes (compare with 2 groups)?
- set third group mean to 0, how does it compare with mean 2 (think and try)?
- set third group mean to 4, but also vary middle group (eg., 1 or 3), does that have an effect ?
- change procedure: repeat for between variance 2.67 (balanced: 0, 2, 4) and within variance 16 ?
Solution for multiple groups omnibus
- GPower: one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)
- obtain sample size for
reference example
, just 2 groups C and T1 (size=64)!
- 128, same again, despite different effect size (f) and distribution
- size used only to include imbalance
- include third group, with mean 2, what are sample sizes (compare with 2 groups)?
- effect sizes f = .236; sample size 177 (59*3), requires more observations
- set third group mean to 0, how does it compare with mean 2 (think and try)?
- effect and sample size same, no difference whether big 0 group or big 2 group.
- set third group mean to 4, but also vary middle group (eg., 1 or 3), does that have an effect ?
- effect sizes f = .408 (4), .425 (1/3), increase with middle group away from middle.
- change procedure: repeat for between variance 2.67 (balanced: 0, 2, 4) and within variance 16 ?
- sample size 21*3=63, for f = .408 (1/7th explained = 1 between / 6 within)
41 Multiple Groups: Pairwise
assume one control, and two treatments
- interested in all three pairwise comparisons → maybe Tukey
- typically run aposteriori, after omnibus shows effect
- use multiple t-tests with corrected \(\alpha\) for multiple testing
GPower: t-tests/means difference two independent groups
apply Bonferroni correction for original 3 group example (0, 2, 4)
- what samples sizes are necessary for all three pairwise tests ?
- what if biggest difference ignored (C-T2), because know that easier to detect ?
- with original 64 sized groups, what is the power (both situations above) ?
Solution for multiple groups pairwise
- GPower: t-tests/means difference two independent groups
- apply Bonferroni correction for original 3 group example (0, 2, 4)
- what samples sizes are necessary for all three pairwise tests ?
- 0-2 and 2-4 → d=.5, 0-4 → d=1
- divide \(\alpha\) by 3 → .05/3=.0167
- sample size 86 * 2 for 0-2 and 2-4, 23 * 2 for 0-4 → 86 * 3 = 258
- what if biggest difference ignored (C-T2), because know that easier to detect ?
- divide \(\alpha\) by 2 → .05/2=.025
- sample size 78 * 2 for 0-2 and 2-4 → 78 * 3 = 234 (24 less)
- with original 64 sized groups, what is the power (both situations above) ?
- .6562 for 3 tests (\(\alpha\)=.0167)
- .7118 for 2 tests (\(\alpha\)=.0250)
- post-hoc test → power-loss (lower \(\alpha\) → higher \(\beta\))
42 Multiple Groups: Contrasts
- contrasts are linear combinations → planned comparison
- eg., 1 * T1 -1 * C \(\neq\) 0 & 1 * T2 -1 * C \(\neq\) 0
- eg., .5 * (1 * T1 + 1 * T2) -1 * C \(\neq\) 0
- effect sizes for planned comparisons must be calculated !!
- variance ratios
- standard deviation of contrasts → between variance
- compare between variance for contrast with within variance
- each contrast
- requires 1 degree of freedom
- combines a specific number of levels
- multiple testing correction may be required
|
group means \(\mu_i\)
pre-specified coefficients \(c_i\)
sample sizes \(n_i\)
total sample size \(N\)
\(\sigma_{contrast} = \frac{|\sum{\mu_i * c_i}|}{\sqrt{N \sum_i^k c_i^2 / n_i}}\)
|
43 Multiple Groups: Contrasts (continued)
- GPower: one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
- obtain effect sizes for contrasts (assume equally sized for convenience)
- \(\sigma_{contrast}\) T1-C: \(\frac{(-1*0 + 1*2 + 0*4)}{\sqrt(2*((-1)^2+1^2+0^2))} = 1\); \(\sigma_{error}\) = 4 → \(f\)=.25
- \(\sigma_{contrast}\) T2-C:\(~= 2\); \(\sigma_{error}\) = 4 → \(f\)=.5
- \(\sigma_{contrast}\) (T1+T2)/2-C:\(~= 1.4142\); \(\sigma_{error}\) = 4 → \(f\)=.3535
- sample size for each contrast, each 1 df
- what samples sizes for either contrast 1 or contrast 2 ?
- what samples sizes for both contrast 1 and contrast 2 combined ?
- if taking that sample size, what will be the power for T1-T2 ?
- what samples size for contrast 3 ?
Solution for multiple groups contrasts
- GPower: one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
- what samples sizes for either contrast 1 or contrast 2 ?
- variance explained \(1^2\) or \(2^2\)
- for T1-C \(f\) = \(\sqrt{1^2/4^2}\) = .25 = d/2 → 128 (64 C - 64 T1)
- for T2-C \(f\) = \(\sqrt{2^2/4^2}\) = .50 = d/2 → 34 (17 C - 17 T2)
- what samples sizes for both contrast 1 and contrast 2 combined ?
- multiple testing, consider Bonferroni correction → /2
- for T1-C 155, for T2-C 41 → total 175 (78 C, 77 T1, 20 T2)
- if taking that sample size, what will be the power for T1-T2 ?
- post-hoc, 77 and 20, with d=.5 and \(\alpha\) = .5 → power \(\approx\) .5
- what samples size for contrast 3 ?
- variance contrast \(1.4142^2\)
- 3 groups, little impact if any
- for .5*(T1+T2) - C \(f\) = \(\sqrt{2/16}\) = .3535 → 65 (22 C, 21 T1, 22 T2)
44 Multiple Factors
- multiple main effects and possibly interaction effects (eg., treatment and type)
- main effects (average effects, additive) & interaction (factor level specific effects)
- note: numerator degrees of freedom → main effect (nr-1), interaction (nr1-1)*(nr2-1)
- \(\eta^2\) = \(f^2 / (1+f^2)\), remember \(f = d/2\) for two groups
- note: get effect sizes for two way anova: http://apps.icds.be/effectSizes/
- GPower: multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
- determine \(\eta^2\) and sample size for
reference example
, remember the between group variance ?
- use the app: use for means only values 0 and 2, and 4 and 6 if necessary
- for treatment use C-T1-T2, for type (second predictor) use B1-B2
- get \(\eta^2\) for treatment effect but no type effect ? recognize \(f\) ?
- specify such that types differ, not treatment → \(f\) and sample size ?
- specify such that treatment effect only for one type → \(f\) and sample size ?
- specify effect for both treatment and type, without interaction → \(f\) and sample size ?
Solution for multiple factors
- GPower: multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
- determine sample size for
reference example
, remember the between group variance ?
- between group variance 1, within 16, sample size 128 (numerator df = 2-1)
- 2 x 2 with 0-2 → \(\eta^2\) as expected = .0588
- get \(\eta^2\) for treatment effect but no type effect ? recognize \(f\) ?
- 0-2-4 for both types → \(f\) = .4082 of the omnibus F-test (compare all groups)
- specify such that types differ, not treatment → \(f\) and sample size ?
- 0-0-0 versus 2-2-2 → \(f\) = .25 of t-test (compare two groups)
- specify such that treatment effect only for one type → \(f\) and sample size ?
- 0-2-4 versus 0-0-0 → \(f\) = .2041, .25 and .2041
- detect interaction (num df = 2) = 235 total (40 per combination)
- detect only treatment effect (num df = 2) = 235 total (79 each group, 79/2 per combination)
- detect only type effect (num df = 1) = 128 total (64 each group, 64/3 per combination)
- detect both both main effects = 40 each combination ~ max(79/2,64/3)
- specify effect for both treatment and type, without interaction → \(f\) and sample size ?
- 0-2-4 versus 2-4-6 → \(f\) = .4082, .25 and 0, sample size = 21 per combination
45 Repeated Measures
- if repeated measures → account for correlations within
- possible to focus on:
- within: similar to dependent t-test for multiple measurements
- between: group comparison, each based on multiple measurements
- interaction: difference between changes over measurements (within)
- correlation within unit (eg., within subject)
- informative within unit (like paired t-test)
- redundancy on information between units (observations less informative)
- beware: effect size could include or exclude correlation
- GPower: repeated measures (F-test / Means, repeated measures…)
46 Repeated Measures Within
GPower: repeated measures (F-test / Means, repeated measures within factors)
use effect size f = .25 (1/16 explained versus unexplained)
- mimic dependent t-test, correlation .5 !
- mimic independent t-test, but only use 1 group !
- double number of groups to 2, or 4 (cor = .5), what changes ?
- double number of measurements to 4 (cor = .5), impact ?
- compare impact double number of measurements for correlations .5 with .25 ?
Solution for repeated measures within
- GPower: repeated measures (F-test / Means, repeated measures within factors)
- mimic dependent t-test, correlation .5 !
- only 1 group, 2 repeated measures, correlation .5 → 34 x 2 measurements
- mimic independent t-test, but only use 1 group !
- only 1 group, 2 repeated measures, correlation 0 → 65 x 2 measurements
- double number of groups to 2, or 4 (cor = .5), what changes ?
- number of groups not relevant for within group comparison
- but requires estimation, changed degrees of freedom
- double number of measurements to 4 (cor = .5), impact ?
- sample size reduces from 34 to 24, but 34x2=68, 24*4=96
- with 4 measurements (double) take halve the correlation (0.25), impact ?
- sample size 35, nearly 34
- 2 repeated measurements with corr .5, about same sample size as 4 repeats with corr .25
47 Repeated Measures Between
GPower: repeated measures (F-test / Means, repeated measures between factors)
use effect size f = .25 (1/16 explained versus unexplained)
- compare 2 groups, each 2 measurements… impact on sample size when correlation 0, .25 and .5 ?
- double number of groups to 2, or 4 (cor = .5), what changes ?
- double number of measurements to 4 (cor = .5), impact ?
- compare impact number of measurements for different correlations .5 with .25 ?
- mimic independent t-test ?
Solution for repeated measures between
- GPower: repeated measures (F-test / Means, repeated measures between factors)
- use effect size f = .25 (1/16 explained versus unexplained)
- compare 2 groups, each 2 measurements… impact on sample size when correlation 0, .25 and .5 ?
- increase in correlations results in increase in sample size (redundancy)
- double number of groups to 2, or 4 (cor = .5), what changes ?
- increase in number of groups, small increase (estimation required) IF same effect size \(f\)
- double number of measurements to 4 (cor = .5), impact ?
- increase in number of measurements, increases total number, but reduces number of units
- compare impact number of measurements for different correlations .5 with .25 ?
- increase stronger if correlations stronger
- mimic independent t-test ?
- 128 units, if .99 correlation with fully redundant second set
- 132 (66/2 * 2), if 0 correlation with need to estimate four group averages and correlation
48 Repeated Measures Interaction Within x Between
GPower: repeated measures (F-test / Means, repeated measures within-between factors)
option: calculate effect sizes: http://apps.icds.be/effectSizes/
- for sd = 4, with group with average 0-2-4, and with non-responsive (all 0):
- compare effect sizes for interaction with correlation .5 and 0, conclude ?
- compare sample sizes for those 2 effect sizes with correlation .5 or 0 ?
Solution for repeated measures interaction within x between
- GPower: repeated measures (F-test / Means, repeated measures within-between factors)
- option: calculate effect sizes: http://apps.icds.be/effectSizes/
- for sd = 4, with group with average 0-2-4, and with non-responsive (all 0):
- compare effect sizes for interaction with correlation .5 and 0, conclude ?
- with 0 correlation → \(f\) for interaction = .25
- with .5 correlation → \(f\) = .3536
- compare sample sizes for those 2 effect sizes with correlation .5 or 0 ?
- for \(f\) = .25, sample sizes are 54x2 (cor=0) and 28x2 (cor=.5)
- for \(f\) = .3535, sample sizes are 28x2 (cor=0) and 16x2 (cor=.5)
- either include .5 correlation to calculate effect size OR sample size
49 Correlations
if comparing two independent correlations
use Fisher Z transformations to normalize first
- z = .5 * log(\(\frac{1+r}{1-r}\)) → q = z1-z2
GPower: z-tests / correlation & regressions: 2 indep. Pearson r’s
- with correlation coefficients .7844 and .5, what are the effect & sample sizes ?
- with the same difference, but stronger correlations, eg., .9844 and .7, what changes ?
- with the same difference, but weaker correlations, eg., .1 and .3844, what changes ?
note that dependent correlations are more difficult, see manual
Solution for correlations
- GPower: z-tests / correlation & regressions: 2 indep. Pearson r’s
- with correlation coefficients .7844 and .5, what are the effect & sample sizes ?
- effect size q = 0.5074, sample size 64*2 = 128
- \(.5*log((1+.7844)/(1-.7844)) - .5*log((1+.5)/(1-.5))\)
- notice: effect size q \(\approx\) d, same sample size
- with the same difference, but stronger correlations, eg., .9844 and .7, what changes ?
- effect size q = 1.5556, sample size 10*2 = 20
- same difference but bigger effect (higher correlations more easy to differentiate)
- with the same difference, but weaker correlations, eg., .1 and .3844, what changes ?
- effect size q = 0.3048, sample size 172*2 = 344
- same difference, negative, and smaller effect (lower correlations more difficult to differentiate)
50 Proportions
if comparing two independent proportions → bounded between 0 and 1
GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
effect sizes in odds ratio, relative risk, difference proportion
- for odds ratio 3 and p2 = .50, what is p1 ? and for odds ratio 1/3 ?
- what is the sample size to detect a difference for both situations ?
- for odds ratio 3 and p2 = .75, determine p1 and sample size, how does it compare with before ?
- for odds ratio 1/3 and p2 = .25, determine p1 and sample size, how does it compare with before ?
- compare sample size for a .15 difference, at p1=.5 ?
Solution for proportions
- GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
- for odds ratio 3 and p2 = .50, what is p1 ? and for odds ratio 1/3 ?
- odds ratio 3 → with p2 = .5 or odds_2 = 1, odds_1 = 3 thus p1 = 3/(3+1) = .75
- what is the sample size to detect a difference for both situations ?
- 128, same for .5 versus .25 or .75 (unlike correlation)
- for odds ratio 3 and p2 = .75, determine p1 and sample size, how does it compare with before ?
- p1 to .9, difference of .15, sample size increases to 220
- for odds ratio 1/3 and p2 = .25, determine p1 and sample size, how does it compare with before ?
- p1 to .1, difference of .15, sample size increases to 220
- compare sample size for a .15 difference, at p1=.5 ?
- sample size even higher, to 366, increase not because smaller difference
51 Exercise proportions
- GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
- for odds ratio = 2, with p2 reference probability .6
- plot power over proportions .5 to 1
- include 5 curves, sample sizes 328, 428, 528…
- with type I error .05
- explain curve minimum, relation sample size ?
- repeat for one-tailed, difference ?
|
|
Solution for proportions
- GPower: Fisher Exact Test (exact / proportions, difference 2 independent proportions)
- for odds ratio = 2, with p2 reference probability .6
- plot power over proportions .5 to 1
- include 5 curves, sample sizes 328, 428, 528…
- with type I error .05
- explain curve minimum, relation sample size ?
- power for proportion compared to reference .6
- minimum is type I error probability
- sample size determines impact
- repeat for one-tailed, difference ?
- one-tailed, increases power (both sides !?)
|
|
52 Dependent Proportions
if comparing two dependent proportions → categorical shift
- if only two categories, McNemar test: compare \(p_{12}\) with \(p_{21}\)
- information from changes only → discordant pairs
- effect size as odds ratio → ratio of discordance
- like other exact tests, choice assignment alpha
GPower: McNemar test (exact / proportions, difference 2 dependent proportions)
- assume odds ratio equal to 2, equal sized, type I and II errors .05 and .2, two-way !
- what is the sample size for .25 proportion discordant, .5, and 1 ?
- odds ratio 1 versus .5, (prop discordant = .25), what are \(p_12\) and \(p_21\) and sample sizes ?
- repeat for third alpha option, and consider total sample size, what happens ?
Solution for dependent proportions
GPower: McNemar test (exact / proportions, difference 2 dependent proportions)
assume odds ratio equal to 2, equal sized, type I and II errors .05 and .2, two-way !
what is the sample size for .25 proportion discordant, .5, and 1 ?
- 288 (.25), 144 (.5), 73~144/2 (.99) → decrease in sample size with increased discordance
odds ratio .5 or 4, (prop discordant = .25), what are \(p_{12}\) and \(p_{21}\) and sample sizes ?
- same as 2 but reverse \(p_{12}\) and \(p_{21}\), with sample size 288
- with 4 as odds ratio, larger effect, requires smaller sample size, only 80
- odds ratio = \(p_{12}\) / \(p_{21}\)
repeat for third alpha option, with odds ratio 4, what happens ?
- changed lower / upper critical N, lower sample size
- BUT, is because lower power, closer to requested .8
53 Not Included
- various statistical tests difficult to specify in GPower
- various statistics / parametervalues that are difficult to guestimate
- manual for more complex tests not always very elaborate
- various statistical tests not included in GPower
- eg., survival analysis
- many tools online, most dedicated to a particular model
- various statistical tests no formula to offer sample size
- simulation may be the only tool
- iterate many times: generate and analyze → proportion of rejections
- generate: simulated outcome ← model and uncertainties
- analyze: simulated outcome → model and parameter estimates + statistics
54 Simulation Example t-test
gr <- rep(c('T','C'),64)
y <- ifelse(gr=='C',0,2)
dta <- data.frame(y=y,X=gr)
cutoff <- qt(.025,nrow(dta))
my_sim_function <- function(){
dta$y <- dta$y+rnorm(length(dta$X),0,4) # generate (with sd=4)
res <- t.test(data=dta,y~X) # analyze
c(res$estimate %*% c(-1,1),res$statistic,res$p.value)
}
sims <- replicate(10000,my_sim_function()) # many iterations
dimnames(sims)[[1]] <- c('diff','t.stat','p.val')
mean(sims['p.val',] < .05) # p-values 0.8029
mean(sims['t.stat',] < cutoff) # t-statistics 0.8029
mean(sims['diff',] > sd(sims['diff',])*cutoff*(-1)) # differences 0.8024
55 Focus / Simplify
- complex statistical models
- simulate BUT it requires programming and a thorough understanding of the model
- alternative: focus on essential elements → simplify the aim
- sample size calculations (design) for simpler research aim
- not necessarily equivalent to final statistical testing / estimation
- requires justification to convince yourself and/or reviewers
- successful already if simple aim is satisfied
- ignored part is not too costly
- example:
- statistics: group difference evolution 4 repeated measurements → mixed model
- focus: difference treatment and control last time point is essential → t-test
- argument: first 3 measurements low cost, interesting to see change
56 Conclusion
- sample size calculation is a design issue, not a statistical one
- building blocks: sample & effect sizes, type I & II errors
- establish any of these building blocks, conditional on the rest
- effect sizes express the amount of signal compared to the background noise
- GPower deals with not too complex models
- more complex complex models imply more complex specification
- simplify using a focus, if justifiable → then GPower can get you a long way

Methodological and statistical support to help make a difference
ICDS provides complementary support in methodology and statistics to our research community, for both individual researchers and research groups, in order to get the best out of them
ICDS aims to address all questions related to quantitative research, and to further enhance the quality of both the research and how it is communicated
website: https://www.icds.be/ includes information on who we serve, and how
booking: https://www.icds.be/consulting/ for individual consultations