## Sample size calculation with GPower

Wilfried Cools (ICDS) & Sven Van Laere (BiSI)
https://www.icds.be/

## sample size calculation: why & how

• what about my sample size ?
• more is better ≡ observation → information
• but increasingly less so
• but limited resources (time/money)
• but ethical and practical considerations
• how many is good enough ?
• depends on what you want
• workshop → how to calculate it ?
• understand the reasoning
• apply for simple but common situations
• not one simple formula for all → GPower to the rescue

## sample size calculation: a design issue

• testing → power [probability to detect existing effects]
• estimation → accuracy [size of confidence intervals]
• before data collection, during design of study
• requires understanding: future data, analysis, inference (effect size, focus, ...)
• conditional on assumptions & decisions
• not always possible nor meaningful !
• easier for experiments (control), less for observational studies
• easier for confirmatory studies, much less for exploratory studies
• NO retrospective power analyses → OK for future study only
Hoenig, J., & Heisey, D. (2001). The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55, 19–24.

## simple example

• evaluation of radiotherapy to reduce a tumor in mice
• comparing treatment group with control (=conditions)
• tumor induced, random assignment treatment or control (equal if no effect)
• after 20 days, measurement of tumor size (=observations)
• analysis:
• unpaired t-test to compare averages for treatment and control
• goal:
• if the average size in treatment is at least 20% less than control
then we want to detect it (significance)
• the main issue:
• how to calculate the required sample size to determine the effect aimed for ?

## overview

• PART I: building blocks in action for t-test
• sizes: effect size, sample size
• errors: type I ($\alpha$), type II ($\beta$)
• distributions: Ho, Ha
• criterion: confidence (estimation), power (testing)
• PART II: moving beyond independent t-test
• dependent groups
• non-parametric distributions
• multiple groups (ANOVA: omnibus, pairwise, focused)
• proportions, correlations, ...

## reference example

• sample sizes easy and meaningful to calculate for well understood problems
• apriori specifications
• intend to perform a statistical test
• comparing 2 equally sized groups
• to detect difference of at least 2
• assuming an uncertainty of 4 SD on each mean
• which results in an effect size of .5
• evaluated on a Student t-distribution
• allowing for a type I error prob. of .05 $(\alpha)$
• allowing for a type II error prob. of .2 $(\beta)$
• sample size conditional on specifications being true

## a formula you could use

• for this particular case:
• sample size (n → ?)
• difference (d=signal → 2)
• uncertainty ($\sigma$=noise → 4)
• type I errors ($\alpha$ → .05, so $Z_{\alpha/2}$ → -1.96)
• type II errors ($\beta$ → .2, so $Z_\beta$ → -0.84)

$n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}$ $n = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2} = 62.79$

• sample size = 2 groups x 63 observations = 126
• note: formula's are test and statistic specific but logic remains same
• this and other formula's implemented in various tools, our focus: GPower

## GPower: a useful tool

• popular and well established
• free @ http://www.gpower.hhu.de/
• implements wide variety of tests
• implements various visualizations
• documented fairly well
• note: not all tests are included !

## GPower: the building blocks in action

• For this example, calculate sample size based on
• effect size (difference of interest, scaled on standard deviation)
• type I and type II error

## GPower input

• t-test : difference two indep. means
• apriori: calculate sample size
• effect size = standardized difference
• Cohen's $d$
• $d$ = |difference| / SD_pooled
• $d$ = |0-2| / 4 = .5
• $\alpha$ = .05, 2 - tailed ($\alpha$/2 → .025 & .975)
• $power = 1-\beta$ = .8
• allocation ratio = 1
• ~ reference example

## GPower output

• sample size ($n$) = 64 x 2 = (128)
• degrees of freedom ($df$) = 126 (128 - 2)
• plot showing null Ho and alternative Ha distribution
• in GPower central and non-central distribution
• Ho & critical value → decision boundaries
• critical t = 1.979, qt(.975,126)
• Ha, shift with non-centrality parameter → truth
• non centrality parameter ($\delta$) = 2.8284
2/(4*sqrt(2))*sqrt(64)
• power ≥ .80 (1-$\beta$) = 0.8015

## reference example protocol

t tests - Means: Difference between two independent means (two groups)
Analysis: A priori: Compute required sample size

Input: Tail(s) = Two
Effect size d = 0.5000000
α err prob = 0.05
Power (1-β err prob) = .8
Allocation ratio N2/N1 = 1

Output: Noncentrality parameter δ = 2.8284271
Critical t = 1.9789706
Df = 126
Sample size group 1 = 64
Sample size group 2 = 64
Total sample size = 128
Actual power = 0.8014596

## GPower distributions

• distribution based test selection
• Exact Tests (8)
• $t$-tests (11) → reference
• $z$-tests (2)
• $\chi^2$-tests (7)
• $F$-tests (16)
• focus on the density functions

• design based test selection
• correlation & regression (15)
• means (19) → reference
• proportions (8)
• variances (2)
• focus on the type of parameters

## Ho and Ha distributions

• Ho acts as $\color{red}{benchmark}$ → eg., no difference
• Ho ~ t(0,df) $\color{green}{cut off}$ using $\alpha$,
• reject Ho if test returns implausible value
• Ha acts as $\color{blue}{truth}$ → eg., difference of .5 SD

• Ha ~ t(ncp,df)
• ncp as violation of Ho → shift (location/shape)

## non-centrality: Ho → Ha

• ncp : non-centrality parameter

• shift between Ho and Ha
• assumed effect size (target or signal)
• conditional on sample size (information)
• overlap → power or sample size
using $\alpha$ on Ho and $\beta$ on Ha
• Ha is NOT interchangeable with Ho

• absence of evidence $\neq$ evidence of absence
• equivalence testing (Ha for 'no effect')

## divide by n perspective on distributions

• remember: $n = \frac{(Z_{\alpha/2}+Z_\beta)^2 * 2 * \sigma^2}{d^2}$ $n = \frac{(-1.96-0.84)^2 * 2 * 4^2}{2^2}$ $n = 62.79$

## no Ha for estimation

• focus on estimation, plausible values of effect, not testing or power
• sample size without type II error $\beta$, power, Ho or Ha
• ~ divide by n perspective but shifted to estimate
• precision analysis → set maximum width confidence interval
• let E = maximum half width of confidence interval to accept
• for confidence level $1-\alpha$
• $n = z^2_{\alpha/2} * \sigma^2 * 2 / \:E^2$ (for 2 groups)
• equivalence with statistical testing
• if 0 (or other reference) outside confidence bounds → significant
• NOT GPower

## type I/II error probability

• distribution cut-off's (density → AUC=1)
• decide whether to reject Ho assuming Ha
• two types of error
• P(infer=Ha|truth=Ho) = $\alpha$
• P(infer=Ho|truth=Ha) = $\beta$
• two types of correct inference
• P(infer=Ho|truth=Ho) = $1-\alpha$
• P(infer=Ha|truth=Ha) = $1-\beta$ → power
• cut-off 'known' Ho for statistical test
• two tailed → both sides informative on Ho
• one tailed → one side not informative on Ho

 infer=Ha infer=Ho sum truth=Ho $\alpha$ 1-$\alpha$ 1 truth=Ha 1-$\beta$ $\beta$ 1

## error exercise : create plot

• create plot
(X-Y plot for range of values)
• plot sample size by type I error
• set plot to 4 curves
• for power .8 in steps of .05
• set $\alpha$ on x-axis
• from .01 to .2 in steps of .01
• use effect size .5

notice Table option

## error exercise : interpret plot

• where on the red curve (right)
type II error = 4 * type I error ?
• when smaller effect size (.25), what changes ?
• switch power and sample size (32 in step of 32)
what is relation type I and II error ?

• where on the yellow curve (left)
type II error = 4 * type I error ?

• for allocation rate 4, compare plots

## decide type I/II error probability

• rules of thumb ?
• $\alpha$ in range .01 - .05 → 1/100 - 1/20
• $\beta$ in range .1 to .2 → power = 80% to 90%
• $\alpha$ & $\beta$ inversely related
• if $\alpha = 0$ → never reject, no power
• if power 99% → high $\alpha$ for same sample size
• determine the balance
• which error you want to avoid most ?
• cheap aids test ? → avoid type II
• heavy cancer treatment ? → avoid type I
• $\alpha$ & $\beta$ often selected in 1/4 ratio
type I error is 4 times worse !!

## interim analyses: control type I error

• analyze and proceed ? (peeking)
• multiple testing as data is collected
• inflates type I error $\alpha$
• correct $\alpha$
• interim analysis specific $\alpha_i$ with overall $\alpha$ under control
• suggested technique: alpha spending
• use O'Brien - Flemming bounds
• NOT GPower

## for fun: P(effect exists | test says so)

• power → P(test says there is effect | effect exists)
• $P(infer=Ha|truth=Ho) = \alpha$
• $P(infer=Ho|truth=Ha) = \beta$
• $P(infer=Ha|truth=Ha) = power$
• $P(\underline{truth}=Ha|\underline{infer}=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha)}$ → Bayes Theorem
• $P(truth=Ha|infer=Ha) = \frac{P(infer=Ha|truth=Ha) * P(truth=Ha)}{P(infer=Ha|truth=Ha) * P(truth=Ha) + P(infer=Ha|truth=Ho) * P(truth=Ho)}$
• $P(truth=Ha|infer=Ha) = \frac{power * P(truth=Ha)}{power * P(truth=Ha) + \alpha * P(truth=Ho)}$ → depends on prior probabilities
• IF very low probability model is true, eg., .01 ? → $P(truth=Ha) = .01$
• THEN probability effect exists if test says so is low, in this case only 14% !!
• $P(truth=Ha|infer=Ha) = \frac{.8 * .01}{.8 * .01 + .05 * .99} = .14$

## effect sizes

• degree with which a certain phenomenon holds (~ Ho is false)
• part of non-centrality (as is sample size) → shift in GPower
• signal to noise ratio
• typically not just the signal, to provide scale
• eg., difference on scale of pooled standard deviation
• bigger effect → more easy to detect it (pushing away Ha)
• 2 main families of effect sizes → test specific
• differences d-family // association r-family
• transformations, eg., d = .5 → r = .243
• $d = \frac{2r}{\sqrt{1-r^2}}$; $\hspace{20 mm}r = \frac{d}{\sqrt{d^2+4}}$; $\hspace{20 mm}d = ln(OR) * \frac{\sqrt{3}}{\pi}$

## effect sizes of Cohen

• Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.
• famous Cohen conventions
• beware, just rules of thumb
• Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed).

## effect sizes of d and r family

• most important effect size of
• d: dichotomous - continuous
• r: correlation - proportion variance
• Ellis, P. D. (2010). The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results.
• more than 70 different effect sizes... most of them related to each other
• NOT p-value ~ partly effect size, partly sample size or power
• do not simply compare p-values !

## effect sizes in GPower (Determine)

• often very difficult to specify
• GPower offers help with Determine
• difference group means
0-2 → signal ~ minimally relevant (or expected)
• standard deviations (sd)
4 each group → expected noise ~ natural diversity
• written to Effect Size d
.5 → difference in sd
• Reminder: effect size statistic depends on statistical test

## effect size exercise : ingredients cohen d

For the reference example:

• change mean values from 0 and 2 to 4 and 6, what changes ?
• change sd values to 2 for each, what changes ?
• effect size ?
• total sample size ?
• non-centrality ?
• critical t ?
• change sd values to 6 for each, what changes ?

## effect size exercise : plot

• plot power by effect size
• set plot to 6 curves
• for sample sizes, 34 in steps of 34
• set effect sizes on x-axis
• from .2 to 1.2 in steps of .05
• use $\alpha$ equal to .05
• create plot
(X-Y plot for range of values)
• determine (approximately) the three situations from previous slide on the plot
• how does power change when doubling the effect size, eg., from .5 to 1 ?

## effect size exercise : imbalance

For the reference example:

• change allocation ratio from 1
• to 2, .5, 3 and 4, what to conclude ?
• ratio 2 and .5 ?
• imbalance + 1 or * 2 ?
• ? no idea why n1 $\neq$ n2

## effect sizes, how to determine them in theory

• choice of effect size matters → justify choice
• choice of effect size
• NOT significant → meaningless, dependent on sample size
• realistic (eg., previously observed effect) → replicate
• important (eg., minimally relevant effect)
• use Determine to get started (check the manual)
• for independent t-test → means and standard deviations
• possible alternative is to use variance explained, eg., 1 versus 16

## effect sizes, how to determine them in practice

• experts / patients → use if possible → importance
• literature (earlier study / systematic review) → realistic
• pilot → guestimate dispersion estimate, but very small sample size
• internal pilot → stopping rule (sequential/conditional)
• turn to Cohen → use if everything else fails (rules of thumb)
• guestimate the input parameters, what can you do ?
• sd from assumed range / 6 assuming normal distribution
• sd for proportions (& percentages) at conservative .5
• sd from control, assume treatment the same

## relation samples & effect size, errors I & II

• building blocks:
• sample size ($n$)
• effect size ($\Delta$)
• alpha ($\alpha$)
• power ($1-\beta$)
• each parameter
conditional on others
• GPower → type of power analysis
• Apriori: $n$ ~ $\alpha$, power, $\Delta$
• Post Hoc: power ~ $\alpha$, $n$, $\Delta$
• Compromise: power, $\alpha$ ~ $\beta\:/\:\alpha$, $\Delta$, $n$
• Criterion: $\alpha$ ~ power, $\Delta$, $n$
• Sensitivity: $\Delta$ ~ $\alpha$, power, $n$

## type of power analysis exercise

• for given example, step through...
• retrieve power given n, $\alpha$ and $\Delta$
• [1] for power .8, take half the sample size, how does $\Delta$ change ?
• [2] set $\beta$/$\alpha$ ratio to 4, what is $\alpha$ & $\beta$ ? what is the critical value ?
• [3] keep $\beta$/$\alpha$ ratio to 4 for effect size .5, what is $\alpha$ & $\beta$ ? critical value ?
• [1] .5 to .7115 = .2115, bigger effect size compensate for loss of sample size (sensitivity)

• [2] critical value 1.9990 with errors approx. .05 and .2 (compromise)

• [3] critical value 1.6994 with errors approx. double, .09 and .38

# calculator
m1=0;m2=2;s1=4;s2=4
alpha=.025;N=128
var=.5*s1^2+.5*s2^2
d=abs(m1-m2)/sqrt(2*var)
d=d*sqrt(N/2)
tc=tinv(1-alpha,N-1)
power=1-nctcdf(tc,N-1,d)

• in R, assuming normality
• qt → get quantile on Ho ($Z_{1-\alpha/2}$)
• pt → get probability on Ha (non-central)
.n <- 64
.df <- 2*.n-2
.ncp <- 2 / (4 * sqrt(2)) * sqrt(.n)
.power <- 1 -
pt(
qt(.975,df=.df),
df=.df, ncp=.ncp
) -
pt( qt(.025,df=.df), df=.df, ncp=.ncp)
round(.power,4)

## [1] 0.8015


## GPower beyond independent t-test

• so far, comparing two independent means
• selected topics with small exercises
• non-parametric instead of assuming normality
• relations instead of groups (regression)
• correlations
• proportions, dependent and independent
• more than 2 groups (compare jointly, pairwise, focused)
• more than 1 predictor
• repeated measures
• GPower manual 27 tests: effect size, non-centrality parameter and example !!

## dependence between groups

• if 2 dependent groups (eg., before/after treatment) → account for correlations
• matched pairs (t-test / means, difference 2 dependent means)
• use reference example
• [1] use correlation .5 to compare (effect size, ncp, n)
• [2] how many observations if no correlation exists (reference example) ?
• [3] difference in sample size for correlation .875 ?
• [4] set original sample size (n=64*2) and effect size (dz=.5), compare ?
• [1] $\Delta$ looks same: $\sqrt{2*(1-\rho)}$, n much smaller (1 group), ncp bit bigger
• [2] approx. independent means, here 65 (estimate the correlation)
• [3] effect size * 2 → sample size from 34 to 10
• [4] -posthoc- power > .975: for 64 subjects 2 measurements, ncp > 4

## non-parametric distribution

• expect non-normally distributed residuals, avoid normality assumption
• only considers ranks or uses permutations → price is efficiency
• avoid when possible, eg., transformations
• two groups → Wilcoxon-Mann-Whitney (t-test / means, diff. 2 indep. means)
• use reference example
• [1] how about n ? compared to parametric → what is % loss efficiency ?
• [2] change parent distribution to 'min ARE' ? what now ?
• [1] a few more observations (3 more per group), less than 5 % loss

• [2] several more observations, less efficient, more than 13 % loss (min ARE)

## relations instead of group differences

• differences between groups → relation observations & categorization
• example → d = .5 → r = .243
• note: slope $\beta = {r*\sigma_y} / {\sigma_x}$
• regression coefficient (t-test / regression, one group size of slope)
• sample size for comparing the slope Ha with 0 (=Ho)
• [1] determine slope ($\beta$, with $\sigma_y$ = 4.12 and $\sigma_x$ = .5)
• [2] calculate sample size
• [3] what if $\sigma_x$ (predictor values) or $\sigma_y$ (effect and error) increase ?
• [1] for r = .243, $\sigma_y$ = $\sqrt{17}$ = 4.12, and $\sigma_x$ = $\sqrt{.25}$ = .5 (binary) → 2

• [2] 128, same as for reference example, now with effect size $\beta$

• [3] sample size decreases with $\sigma_x$ (opposite $\sigma_y$ ~ effect size), for same slope

## relations: a variance perspective

• between and within group variance → relation observations & categorization
• regression coefficient (t-test / regression, fixed model single regression coef)
• use reference example, regression style
• variance within 4$^2$ and between 1$^2$, totaling $\sigma_y^2$ = 17
• [1] calculate sample size, compare effect sizes ?
• [2] what if also other predictors in the model ?
• [1] 128, same as for reference example, now with f$^2$ = .25$^2$ = .0625.

• [2] loss of degree of freedom, very little impact

• note: $f^2={R^2/{(1-R^2)}}$

## more groups to compare, 4 cases

• simple example: assume one control, and two treatments
• if more than two groups, several options
• test whether at least one differs → omnibus F-test (variances)
• test whether all differ from eachother → pairwise comparisons
• test whether selected pairs differ → contrast (t-test)
• test whether linear combinations of pairs differ → contrasts (t-tests)
eg., control versus each of the average of treatments

• if multiple tests → inflation of type I error ($\alpha$)
• correct $\alpha$ or p-value, eg., using Bonferroni
• make more tentative inferences

## F-test statistic

• multiple groups → not one effect size d
• F-test statistic & effect size f
• f is ratio of variances $\sigma_{between}^2 / \sigma_{within}^2$
• example: one control and two treatments
• reference example + 1 group
• within group observations normally distributed
• means C=0, T1=2 and T2=4
• sd for all groups (C,T1,T2) = 4

## more groups: omnibus

• for one control and two treatments → test that at least one differs
• one-way Anova (F-test / Means, ANOVA - fixed effects, omnibus, one way)
• effect size f, with numerator/denominator df (derived from $\eta^2$)
• start from reference example,
• [1] what is the sample size ? ncp ? critical F ? does size matter ?
• [2] set extra group, either mean 1 or 4, what are the effect / sample sizes ?
and with the mean of middle group away from global mean ?
• [4] derive effect size with variance between .5 and within 3, 1 and 6 ?
• [1] different effect size (f), distribution, same sample size 128 (size ~ imbalance)

• [2] n=237 or 63, with ncp 9.87 or 10.5, middle group obscures and vise verse
effect size increases, picks up difference 0 and 4!

• [3] same effect size, so, same sample size, 63, ncp 10.5 (1/7th explained)

## more groups: pairwise

• assume one control, and two treatments
• interested in all three pairwise comparisons → maybe Tukey
• typically run aposteriori, after omnibus shows effect
• use t-test with correction of $\alpha$ for multiple testing
• apply Bonferroni correction for original 3 group example
• [1] resulting sample size for three tests ?
• [2] what if biggest difference is ignored, sample size ?
• [3] with original 64 sized groups, what is the power ?
• [1] divide $\alpha$ by 3 (86*2) → overall 86*3 = 258

• [2] or divide by 2 (78*2) (biggest difference implied) → overall 78*3 = 234

• [3] .6562 when /3 or .7118 when /2, power-loss

## sample size calculation benefit from focus

• better to focus during the design on specific questions

• only consider the main comparisons in focus (eg. primary endpoints)
• only interested in comparing two treatments → t-test
• only consider smallest of relevant effects, largest sample size
• set up contrasts (next slide)
• sample size calculations (design)

• not necessarily equivalent to statistics
• requires justification to convince yourself and/or reviewers
• example:

• statistics: group difference evolution 4 repeated measurements → mixed model
• power: difference treatment and control last time point → t-test

## more groups: contrasts

• assume one control and two treatments
• set up 2 contrasts for T1 - C and T2 - C
• set up 1 contrast for average(T1,T2) - C
• each contrast requires 1 degree of freedom
• each contrast combines a specific number of levels
• effect sizes for planned comparisons must be calculated !!
• contrasts (linear combination)
• standard deviation of contrasts

$\sigma_{contrast} = \frac{|\sum{\mu_i * c_i}|}{\sqrt{N \sum_i^k c_i^2 / n_i}}$
with group means $\mu_i$, pre-specified coefficients $c_i$, sample sizes $n_i$ and total sample size $N$

## more groups: contrasts exercise

• one-way ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
• obtain effect sizes for contrasts (assume equally sized for convenience)
• $\sigma_{contrast}$ T1-C: $\frac{(-1*0 + 1*2 + 0*4)}{\sqrt(2*((-1)^2+1^2+0^2))} = 1$; T2-C:$~= 2$; (T1+T2)/2-C:$~= 1.4142$
• with $\sigma$ = 4 → ratio of variances for effect sizes f .25, .5, .3536
• sample size for each contrast, each 1 df and 2 groups
• [1] contrasts nrs. 1 or 2
• [2] contrasts nrs. 1 AND 2
• [3] contrasts nr. 3
• [1] total sample size 128 (again!!) as $d=2f$, 64 C and 64 T1, or 34 = 17 C and 17 T2

• [2] same with Bonferroni correction → 155 and 41 → 78 C, 78 T1, 21 T2

• [3] total sample size 65 → 22 in each group

## multiple factors

• multiway ANOVA (F-test / Means, ANOVA-fixed effects,special,main,interaction)
• multiple main effects and interaction effects
• interaction: group specific difference between groups
• degrees of freedom (#A-1)*(#B-1)
• main effects: if no interaction (#X-1)
• get effect sizes for two way anova
https://icds.shinyapps.io/effectsizes/
• sample size for reference example, assume a second predictor is trivial
• [1] what is partial $\eta^2$ ?
• [2] sample size ?
• [1] .0588 (0-2, sd=4)

• [2] 128 again, with 2 groups (158 with 3 groups, df=2)

## dependence within groups (repeated)

• if repeated measures → account for correlations
• repeated measures (F-test / Means, repeated measures...)
• 3 main types
• within: like dependent t-test for 2 or more measurements
• between: use of multiple measurements per group
• interaction: difference of change over groups
• correlation within subject (unit)
• informative on group differences within subject
• redundancy for between group differences

## repeated measures within

• possible to have only 1 group (within subject comparison)
• use effect size f = .25 (1/16 explained versus unexplained)
• [1] use zero correlation to compare with sample size independent t-test
• [2] for one group use correlation .5, compare sample size dependent t-test
• [3] double number of groups to 2
• [4] double number of measurements to 4 (correlation 0 and .5), impact ?
• number of groups = 1, number of measurements = 2, sample size = [1] 65 and [2] 34
• [3] changed degrees of freedom, sample size could have changed (power, crit. F)
• [4] impact of more measurements bigger with higher correlation

## repeated measures between

• use effect size f = .25 (1/16 for variance or 2/4 for means)
• [1] use correlation 0 and .5 with 2 groups and 2 measurements, sample size ?
• [2] for correlation .5, compare 2 or 4 measurements, sample size ?
• [3] double number of groups to 2
• [1] sample size higher when higher correlation (66x2=132 for 0, 98x2=196 for .5)
• [2] sample size lower when more measurements, unless correlation is 1 (82x2=164)
• [3] more groups require higher sample size

## repeated measures within x between

• use effect size f = .25 (1/16 for variance)
• [1] use correlation 0, compare 2 groups 2 measurements with rep. between ?
• [2] use correlation 0.5, compare 2 groups 2 measurements with rep. within ?
• [3] use correlation .5, compare 2 groups and 4 measurements, sample size ?
• [4] repeat with 4 groups and 4 measurements, sample size ?
• [1] same with indep and [2] same with dependent
• [3] more groups, higher sample size (identical to within)
• [4] difference between within and between

## correlations

• when comparing two independent correlations
• z-tests / correlation & regressions: 2 indep. Pearson r's
• makes use of Fisher Z transformations → z = .5 * log($\frac{1+r}{1-r}$) → q = z1-z2
• [1] assume correlation coefficients .7844 and .5 effect size & sample size ?
• [2] assume .9844 and .7, effect size & sample size ?
• [3] assume .1 and .3844 effect size & sample size ?
• [1] effect size q = 0.5074, sample size 64*2 = 128

• [2] effect size q = 1.5556, sample size 10*2 = 20, same difference, bigger effect

• [3] effect size q = -0.3048, sample size 172*2 = 344, negative and smaller effect

• note that dependent correlations are more difficult, see manual

## proportions

• comparing two independent proportions → bounded between 0 and 1
• Fisher Exact Test (exact / proportions, difference 2 independent proportions)
• effect sizes in odds ratio, relative risk, difference proportion
• [1] for odds ratio 2, p2 = .60, what is p1 ?
• [2] sample size for equal sized, and type I and II .05 and .2 ?
• [3] sample size when .95 and .8 (difference of .15) and .05 and .2 ?
• [1] odds ratio 2 * (.6/.4) = 3 (odds), 3/3+1 = .75
• [2] total sample size 328, [3] total sample size 164, either at .05 or .95
• treat as if unbounded, ok within .2 - .8, variance is p*(1-p) → maximally .25 !!

• [4] use t-test for difference of .15
• [4] effect size .3, sample size 352 (> 328)

## proportions exercise

• Fisher Exact Test
• power over proportions .5 to 1
• 5 curves, sample sizes 328, 428, 528...
• type I error .05
• [1] what happens ?
• [2] repeat for one-tailed, what is different ?
• [1] power for proportion compared to reference .6, sample size determines impact

• [2] one-tailed, increases power, both sides (absolute value difference)
tail to choose on the correct side

## dependent proportions

• when comparing two dependent proportions

• McNemar test (exact / proportions, difference 2 dependent proportions)

• include correlations implicitly, discordant pairs → change
• effect size as odds ratio → ratio of discordance ?!
• assume odds ratio equal to 2, equal sized, type I and II errors .05 and .2, two-way

• [1] what is the sample size for .25 proportion discordant, and [2] .5, and [3] 1
• [4] for odds ratio 4 or .25, how the proportion p12 and p21 change ?
• [5] repeat for third alpha option, and consider total sample size, what happens ?
• [1] total sample size 288 & [2] 144 & [3] impossible, but limits to 72

• [4] for proportion discordant, 1 to 4 or 4 to 1

• [5] sample size differs because side effects

## conclusion: keep it simple, keep it real

• sample size calculation is a design issue, not a statistical one
• building blocks: sample & effect sizes, type I & II errors, each conditional on rest
• effect sizes express the amount of signal compared to the background noise
• complex models imply complex sample size calculations, if at all possible
• GPower deals with not too complex models
• simplify using a focus, if justifiable → then GPower can get you a long way
• use more complex specification for more complex sample size calculations
• leave GPower, simulation is always an option