Data Wrangling with R’s tidyverse
Compiled Sep 06, 2020
Current draft aims to introduce researchers to data manipulation in R with the dplyr
, tidyr
, and stringr
packages of the tidyverse
ecosystem.
Our target audience is primarily the research community at VUB / UZ Brussel, those who have some basic experience in R and want to know more.
We invite you to help improve this document by sending us feedback
wilfried.cools@vub.be or anonymously at icds.be/consulting (right side, bottom)
Data manipulation is inherent to data analysis, not just a precursor.
Data manipulation is best done with coding (as opposed to manual changes), provides the best guarantee to.
Data manipulation is easier and more intuitive when maintaining tidy data.
Focus in current draft is on R.
In particular, focus on the tidyverse
package (Hadley Wickham etal.), an ecosystem that includes.
dplyr
for manipulating data frames [main focus]tidyr
for tidying data [check Data Representation]stringr
for dealing with textsreadr
for reading in data [highlighted]tibble
for data representation [highlighted]forcats
for dealing with factorsggplot
for visualizing data [separate draft]purrr
for functional programming (advanced)
Find convenient cheat sheets at https://rstudio.com/resources/cheatsheets/.
Package tidyverse
is a very good extension of base R.
Install (at least once) and load (once per R session) the tidyverse
package.
install.packages('tidyverse')
The individual packages that are loaded are listed, as are their conflicts.
library(tidyverse)
Conflicts arise loaded packages use the same function names. Resolve such conflicts for example by referencing the package with ::
, eg., stat::filter( )
.
Conflicts can be checked for tidyverse.
tidyverse_conflicts( )
| -- Conflicts ---------------------------------------------------- tidyverse_conflicts() --
| x dplyr::filter() masks stats::filter()
| x dplyr::lag() masks stats::lag()
The tidyverse
ecosystem includes
broom, cli, crayon, dbplyr, dplyr, forcats, ggplot2, haven, hms, httr, jsonlite, lubridate, magrittr, modelr, pillar, purrr, readr, readxl, reprex, rlang, rstudioapi, rvest, stringr, tibble, tidyr, xml2, tidyverse.
tidyverse_packages( )
Most of these packages should be loaded explicitly (not included in library(tidyverse)
).
Data: the tibble
is the tidyverse data type, defined in the tibble
package.
dataframe
, a list of equally sized vectors
tibble
is a dataframe
, enhanced for convenience and consistency
tibble
with tribble( )
functionclass( )
shows both data.frame
and tbl_df
(mytibble <- tribble(
~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
))
| # A tibble: 3 x 2
| colA colB
| <chr> <dbl>
| 1 a 1
| 2 b 2
| 3 c 3
class(mytibble)
| [1] "tbl_df" "tbl" "data.frame"
- compare with dataframe:
mydf <- data.frame(colA=c('a','b','c'),colB=1:3)
class(mydf)
| [1] "data.frame"
Process: the magrittr
package offer pipes %>%
mean(mytibble$colB)
mytibble %>% summarize(mean(colB))
x1 <- rnorm(10); x2 <- rnorm(10)
sqrt(sum((x1-x2)^2))
| [1] 4.174785
(x1-x2)^2 %>% sum( ) %>% sqrt( )
| [1] 4.174785
Exemplary data (part of base R), mtcars
, are used.
data( )
functionstr( )
and the first 6 observations with head( )
functiondata(mtcars)
str(mtcars)
| 'data.frame': 32 obs. of 11 variables:
| $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
| $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
| $ disp: num 160 160 108 258 360 ...
| $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
| $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
| $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
| $ qsec: num 16.5 17 18.6 19.4 17 ...
| $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
| $ am : num 1 1 1 0 0 0 0 0 0 0 ...
| $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
| $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
| mpg cyl disp hp drat wt qsec vs am gear carb
| Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
| Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
| Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
| Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
| Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
| Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note: calling data( )
without arguments shows all the available data currently in reach.
tidyverse
look at the data works with glimpse( )
glimpse(mtcars)
| Observations: 32
| Variables: 11
| $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, ...
| $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, ...
| $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 16...
| $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, ...
| $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, ...
| $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3....
| $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18...
| $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, ...
| $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, ...
| $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, ...
| $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, ...
Exemplary data manipulation is depicted.
mtcars
data,
mtcars %>%
select(mpg, cyl, hpow=hp, am) %>%
filter(mpg > 15) %>%
mutate(mpgr = mpg/hpow) %>%
group_by(cyl, am) %>%
summarize(min=min(mpgr)) %>%
pivot_wider(names_from=am,values_from=min) %>%
select(cyl,am0=`0`,am1=`1`)
| # A tibble: 3 x 3
| # Groups: cyl [3]
| cyl am0 am1
| <dbl> <dbl> <dbl>
| 1 4 0.222 0.196
| 2 6 0.145 0.113
| 3 8 0.0844 0.0598
The dplyr
package (from tidyverse) is of interest.
The main -verbs- (see example above)
The verb to structure data (see example above)
The verbs to enhance control on scope (advanced)
Additional dplyr verbs:
Verbs to extend data
Final comment: only the core of dplyr is discussed, much more is possible and you will find on the Net.
Grouping prepares data for group specific operations.
tst <- mtcars %>% group_by(am,vs)
glimpse(tst,width=100)
| Observations: 32
| Variables: 11
| Groups: am, vs [4]
| $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2...
| $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4...
| $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 167.6, 275.8...
| $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, ...
| $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.07...
| $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3.440, 4.070...
| $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18.90, 17.40...
| $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1...
| $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1...
| $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5...
| $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2...
n( )
)tst %>% summarize(n( ))
| # A tibble: 4 x 3
| # Groups: am [2]
| am vs `n()`
| <dbl> <dbl> <int>
| 1 0 0 12
| 2 0 1 7
| 3 1 0 6
| 4 1 1 7
ungroup( )
, good practice to avoid side effectstst <- tst %>% ungroup( )
tst %>% summarize(n( ))
| # A tibble: 1 x 1
| `n()`
| <int>
| 1 32
.add
and .drop
argumentsmpg
in 3 groupstst <- mtcars %>% group_by(mpg3 = cut(mpg, 3))
tst %>% summarize(n( ))
| # A tibble: 3 x 2
| mpg3 `n()`
| <fct> <int>
| 1 (10.4,18.2] 14
| 2 (18.2,26.1] 13
| 3 (26.1,33.9] 5
Filtering returns rows using matching conditions.
mpg
could be set with a minimum of 30mtcars %>% filter(mpg > 30)
| mpg cyl disp hp drat wt qsec vs am gear carb
| Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
| Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
| Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
| Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
mpg
above 20 AND qsec
below or equal to 18.mtcars %>% filter(mpg > 20, qsec <= 18)
| mpg cyl disp hp drat wt qsec vs am gear carb
| Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
| Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
| Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
| Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
mpg
above 30 OR qsec
below 20 AND am
equal to 0.mtcars %>% filter(mpg > 30 | qsec > 20, am==0)
| mpg cyl disp hp drat wt qsec vs am gear carb
| Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
| Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
| Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
slice_head( )
), for each gender (group your data) !?slice_head
?slice_sample( )
to randomly select 5 observations !slice_max( )
to select 3 observations with highest values on height
!mass
and get the top 3 for each species !
Extract columns (variables) by name, rename and/or reorder them.
mpg
could be selected.mtcars %>% select(mpg)
| mpg
| Mazda RX4 21.0
| Mazda RX4 Wag 21.0
| Datsun 710 22.8
| Hornet 4 Drive 21.4
| Hornet Sportabout 18.7
| Valiant 18.1
| Duster 360 14.3
| Merc 240D 24.4
| Merc 230 22.8
| Merc 280 19.2
| Merc 280C 17.8
| Merc 450SE 16.4
| Merc 450SL 17.3
| Merc 450SLC 15.2
| Cadillac Fleetwood 10.4
| Lincoln Continental 10.4
| Chrysler Imperial 14.7
| Fiat 128 32.4
| Honda Civic 30.4
| Toyota Corolla 33.9
| Toyota Corona 21.5
| Dodge Challenger 15.5
| AMC Javelin 15.2
| Camaro Z28 13.3
| Pontiac Firebird 19.2
| Fiat X1-9 27.3
| Porsche 914-2 26.0
| Lotus Europa 30.4
| Ford Pantera L 15.8
| Ferrari Dino 19.7
| Maserati Bora 15.0
| Volvo 142E 21.4
# mtcars$mpg
dplyr
use pull( )
mtcars %>% pull(mpg)
| [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7
| [18] 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
qsec
and mpg
(top 6 observations).mtcars %>% select(qsec,mpg) %>% head( )
| qsec mpg
| Mazda RX4 16.46 21.0
| Mazda RX4 Wag 17.02 21.0
| Datsun 710 18.61 22.8
| Hornet 4 Drive 19.44 21.4
| Hornet Sportabout 17.02 18.7
| Valiant 20.22 18.1
mtcars %>% select(3,1) %>% head( )
| disp mpg
| Mazda RX4 160 21.0
| Mazda RX4 Wag 160 21.0
| Datsun 710 108 22.8
| Hornet 4 Drive 258 21.4
| Hornet Sportabout 360 18.7
| Valiant 225 18.1
mtcars %>% select(-c(3:6)) %>% head( )
| mpg cyl qsec vs am gear carb
| Mazda RX4 21.0 6 16.46 0 1 4 4
| Mazda RX4 Wag 21.0 6 17.02 0 1 4 4
| Datsun 710 22.8 4 18.61 1 1 4 1
| Hornet 4 Drive 21.4 6 19.44 1 0 3 1
| Hornet Sportabout 18.7 8 17.02 0 0 3 2
| Valiant 18.1 6 20.22 1 0 3 1
- making use of helper functions, selections can be more automated.
- use partial string matching - directly with contains( )
- using regular expressions with matches( )
.
- example: extract columns with names that include the string ar
(show 6).
mtcars %>% select(contains('ar')) %>% head( )
| gear carb
| Mazda RX4 4 4
| Mazda RX4 Wag 4 4
| Datsun 710 4 1
| Hornet 4 Drive 3 1
| Hornet Sportabout 3 2
| Valiant 3 1
ar
but with at least one element before and after it (show 6).mtcars %>% select(matches('.ar.')) %>% head( )
| carb
| Mazda RX4 4
| Mazda RX4 Wag 4
| Datsun 710 1
| Hornet 4 Drive 1
| Hornet Sportabout 2
| Valiant 1
rename( )
cyl
into cyl468
to reflect its values, same for vs
and am
, and select it together with mpg
(show 6).mtcars %>% select(mpg,cyl468=cyl,vs01=vs,am01=am) %>% head( )
| mpg cyl468 vs01 am01
| Mazda RX4 21.0 6 0 1
| Mazda RX4 Wag 21.0 6 0 1
| Datsun 710 22.8 4 1 1
| Hornet 4 Drive 21.4 6 1 0
| Hornet Sportabout 18.7 8 0 0
| Valiant 18.1 6 1 0
cyl
, vs
and am
as before but without selection (show 6).mtcars %>% rename(cyl468=cyl,vs01=vs,am01=am) %>% head( )
| mpg cyl468 disp hp drat wt qsec vs01 am01 gear carb
| Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
| Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
| Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
| Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
| Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
| Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
select( )
will include the grouping variables by default.
group_cols( )
functionvs
and am
, and extract only those columns.mtcars %>% group_by(vs,am) %>% select(group_cols( ))
| # A tibble: 32 x 2
| # Groups: vs, am [4]
| vs am
| <dbl> <dbl>
| 1 0 1
| 2 0 1
| 3 1 1
| 4 1 0
| 5 0 0
| 6 1 0
| 7 0 0
| 8 1 0
| 9 1 0
| 10 1 0
| # ... with 22 more rows
:
operator for consecutive columns hair and eye color !color
(check help files on helper functions, use ?language) !homeworld
to home_world
!rename( )
function !where( )
and is.numeric( )
!height
, mass
and/or size
, with any_of( )
!
Create new variables based on existing ones.
mpg2
is the mpg
value squared (show 6).mtcars %>% mutate(mpg2=mpg^2) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb mpg2
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 441.00
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 441.00
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 519.84
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 457.96
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 349.69
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 327.61
mpg
variable is overwritten with its value squared (show 6).mtcars %>% mutate(mpg=mpg^2) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb
| 1 441.00 6 160 110 3.90 2.620 16.46 0 1 4 4
| 2 441.00 6 160 110 3.90 2.875 17.02 0 1 4 4
| 3 519.84 4 108 93 3.85 2.320 18.61 1 1 4 1
| 4 457.96 6 258 110 3.08 3.215 19.44 1 0 3 1
| 5 349.69 8 360 175 3.15 3.440 17.02 0 0 3 2
| 6 327.61 6 225 105 2.76 3.460 20.22 1 0 3 1
NEWVAR
is the mpg
value multiplied by the vs value (show 6).mtcars %>% mutate(NEWVAR=mpg*vs) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb NEWVAR
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.0
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.0
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 22.8
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.0
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 18.1
NEWVAR
is the mpg
value multiplied by the vs value and this new variable is divided by the disp
variable (show 6).mtcars %>% mutate(NEWVAR=mpg*vs,NEWVAR2=NEWVAR/disp) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb NEWVAR NEWVAR2
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.0 0.00000000
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.0 0.00000000
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 22.8 0.21111111
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4 0.08294574
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.0 0.00000000
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 18.1 0.08044444
transmute( )
selects them too.mtcars %>% transmute(NEWVAR=mpg*vs,NEWVAR2=NEWVAR/disp) %>% head( )
| NEWVAR NEWVAR2
| 1 0.0 0.00000000
| 2 0.0 0.00000000
| 3 22.8 0.21111111
| 4 21.4 0.08294574
| 5 0.0 0.00000000
| 6 18.1 0.08044444
mpg
using cumsum( )
(show 6).mtcars %>% mutate(NEWVAR=cumsum(mpg)) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb NEWVAR
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 42.0
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 64.8
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 86.2
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 104.9
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 123.0
mpg
is between 20 and 22 (show 6).mtcars %>% mutate(NEWVAR=between(mpg,20,22)) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb NEWVAR
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 TRUE
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 TRUE
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 FALSE
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 TRUE
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 FALSE
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 FALSE
mpg
values, with the rownumber( )
function. When arranged by mpg
this is more clear.mtcars %>% mutate(id=row_number(mpg)) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb id
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 19
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 20
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 24
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 15
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 14
mtcars %>% mutate(id=row_number(mpg)) %>% arrange(mpg) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb id
| 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 1
| 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 2
| 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 3
| 4 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 4
| 5 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 5
| 6 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8 6
vs
and am
, and mutate only those columns.mtcars %>% group_by(vs,am) %>% mutate(id=row_number(mpg)) %>% head( )
| # A tibble: 6 x 12
| # Groups: vs, am [4]
| mpg cyl disp hp drat wt qsec vs am gear carb id
| <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
| 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 4
| 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 5
| 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 2
| 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 4
| 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 11
| 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 2
vs
and am
, there will be a 1 (first), 2 (second)… for id
.
height_m
with height
divided by 100 !mass
/ height_m
to the power 2 !height_m
and BMI
!male
and female
labels with m
and f
(use recode( )
) !ifelse( )
))!
Reduce sets of values into their summaries, based on grouped data.
mpg
values can be obtained.mtcars %>% summarize(myAverage=mean(mpg))
| myAverage
| 1 20.09062
mpg
values can be obtainedmtcars %>% summarize(myAvMpg=mean(mpg),mySdMpg=sd(mpg),myAvDisp=mean(disp),mySdDisp=sd(disp))
| myAvMpg mySdMpg myAvDisp mySdDisp
| 1 20.09062 6.026948 230.7219 123.9387
summarize( )
.
mpg
values can be obtained for each level of vs
mtcars %>% group_by(vs) %>% summarize(myAverage=mean(mpg))
| # A tibble: 2 x 2
| vs myAverage
| <dbl> <dbl>
| 1 0 16.6
| 2 1 24.6
mpg
values can be obtained, for multiple variablesmtcars %>% group_by(vs,am) %>% summarize(myAvMpg=mean(mpg),mySdMpg=sd(mpg),myAvDisp=mean(disp),mySdDisp=sd(disp))
| # A tibble: 4 x 6
| # Groups: vs [2]
| vs am myAvMpg mySdMpg myAvDisp mySdDisp
| <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
| 1 0 0 15.0 2.77 358. 71.8
| 2 0 1 19.8 4.01 206. 95.2
| 3 1 0 20.7 2.47 175. 49.1
| 4 1 1 28.4 4.76 89.8 18.8
vs
, can be obtained with n( )
, or using the special verb count( )
.mtcars %>% group_by(vs) %>% count( )
| # A tibble: 2 x 2
| # Groups: vs [2]
| vs n
| <dbl> <int>
| 1 0 18
| 2 1 14
mtcars %>% group_by(vs) %>% summarize(mycount=n( ))
| # A tibble: 2 x 2
| vs mycount
| <dbl> <int>
| 1 0 18
| 2 1 14
vs
and am
can be obtained with n_distinct( )
nth( )
mtcars %>% group_by(vs,am) %>% summarize(nrDist=n_distinct(mpg),`3th`=nth(mpg,3))
| # A tibble: 4 x 4
| # Groups: vs [2]
| vs am nrDist `3th`
| <dbl> <dbl> <int> <dbl>
| 1 0 0 10 16.4
| 2 0 1 5 26
| 3 1 0 7 24.4
| 4 1 1 6 30.4
height
into the average height (some missing values need to be dealt with, check ?mean) !species
and sex
, and include the average mass
!
The across( )
function allows for selection of variables within the summarize( )
or mutate( )
function.
accross( )
replace the earlier functions *_at, *_if and *_all.am
and vs
into a factor before calling the structure with str( )
mtcars %>% select(mpg,cyl,am,vs) %>% mutate(across(c('am','vs'),factor)) %>% str( )
| 'data.frame': 32 obs. of 4 variables:
| $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
| $ cyl: num 6 6 4 6 8 6 8 4 4 6 ...
| $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
| $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
cyl
and vs
with a :
operatormtcars %>% select(mpg,cyl,am,vs) %>% mutate(across(cyl:vs,factor)) %>% str( )
| 'data.frame': 32 obs. of 4 variables:
| $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
| $ cyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
| $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
| $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
ar
mtcars %>% select(mpg,cyl,gear,carb) %>% mutate(across(contains("ar"),factor)) %>% str( )
| 'data.frame': 32 obs. of 4 variables:
| $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
| $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
| $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
| $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
accross( )
function allows for applying a list of functions
descr <- list(
md = ~median(.x, na.rm = TRUE),
av = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE)
)
mtcars %>% mutate(across(c(1,3), descr)) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb mpg_md mpg_av mpg_sd disp_md
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 19.2 20.09062 6.026948 196.3
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 19.2 20.09062 6.026948 196.3
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 19.2 20.09062 6.026948 196.3
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 19.2 20.09062 6.026948 196.3
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 19.2 20.09062 6.026948 196.3
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 19.2 20.09062 6.026948 196.3
| disp_av disp_sd
| 1 230.7219 123.9387
| 2 230.7219 123.9387
| 3 230.7219 123.9387
| 4 230.7219 123.9387
| 5 230.7219 123.9387
| 6 230.7219 123.9387
select( )
, selections can be more automated.
all_of( )
, where( )
, matches( )
, starts_with( )
mutate( )
and summarize( )
where( )
) into their minimum and maximum (some missing values need to be dealt with) !
Different datafiles can be combined into one datafile using common variables that serve as key (cfr. relational databases).
mtcyl
, with a 2 cylinder but no 8 cylinder unlike the mtcars
(4,6,8)mtcyl <- tribble(
~cyl,~type,
2,'small',
4,'medium',
6,'large'
)
mtcars
and mtcyl
but ignore the irrelevant cyl
equal to 2 (not part of mtcars
), with a left_join( )
cyl
equal to 8 turns out missing, because it is not specified in the -right- datafilemtcars %>% left_join(mtcyl) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb type
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 large
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 large
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 medium
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 large
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 <NA>
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 large
mtcars
and mtcyl
but ignore the cyl
equal to 8 because it lacks information on type, with a right_join( )
cyl
equal to 2 is included, but turns out missing for most variables because it is not specified in the -left- datafilemtcars %>% right_join(mtcyl) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb type
| 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 large
| 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 large
| 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 medium
| 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 large
| 5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 large
| 6 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 medium
mtcars
and mtcyl
for only those observations with the linking variable cyl
in both files, with an right_join( )
mtcars %>% inner_join(mtcyl) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb type
| 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 large
| 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 large
| 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 medium
| 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 large
| 5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 large
| 6 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 medium
mtcars
and mtcyl
keeping all available information, with a full_join( )
showing selected rows 1 to 3, 5, 7 and 33mtcars %>% full_join(mtcyl) %>% slice(c(1:3,5,7,33))
| mpg cyl disp hp drat wt qsec vs am gear carb type
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 large
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 large
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 medium
| 4 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 <NA>
| 5 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 <NA>
| 6 NA 2 NA NA NA NA NA NA NA NA NA small
semi_join( )
, nest_join( )
, anti_join()
, which are described in the help files.
band_members
and band_instruments
are probably loaded into your workspace automatically as part of the tidyverse !semi_join( )
and anti_join( )
and interpret what happens.
Compare the structure of the mtcars
data with a glimpse at that data.
Compare a select of mpg
with a pull of mpg
.
Check the help file and pull out the second before last column.
Select all columns except the am
.
Select all columns except the am
and vs
.
Keep only columns mpg
, cyl
and disp
, but rename mpg
to miles_gallon
.
Insist, keep only columns mpg
, cyl
and disp
, but rename mpg
to miles per gallon
.
Keep only the consecutive columns in between disp
and wt
, in addition to mpg
as a last column, use a :
.
Create a variable for the row names.
Change the mpg
(miles per gallon) into kpl
(kilometers per liter) with 1 mpg is 0.425 km/l, using mutate( )
.
Select about 10% of the observations, twice, check the help file on using sample_frac( )
.
note that the matrix way could be:
Select the 10th to 15th row, check the help file on using slice( )
.
Select the distinct combinations only, for variables am
and vs
.
Check the help files to determine how to keep all variables (for each first observation of that combination).
Filter the data to retain only cases with mpg > 20 and hp above or equal to 110.
Filter the data to retain only the Datsun 710.
Exemplary data are read in using a copy-paste and tidied.
read_delim( )
data are pivoted and disentangled using separate( )
and unite( )
repeated.txt
, a tab-delimited text file, by copy-pastingmyrepeated <- read_delim(clipboard(),delim='\t')
| # A tibble: 3 x 7
| id `t1 score` `t1 posit` `t2 score` `t2 posit` `t3 score` `t3 posit`
| <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
| 1 id1 1 x NA y 4 x
| 2 id2 2 y 3 x NA <NA>
| 3 id3 1 x 2 y 5 x
id
and scores at time points t1
to t3
, and pivot to get scores spread over rows, with times named type
id
and positions at time points t1
to t3
and pivot to get positions spread over rows, with times named type
time
join
, using id as key, but first eliminate at least one of the inconsistent type
variables| # A tibble: 9 x 3
| id type score
| <chr> <chr> <dbl>
| 1 id1 t1 score 1
| 2 id1 t2 score NA
| 3 id1 t3 score 4
| 4 id2 t1 score 2
| 5 id2 t2 score 3
| 6 id2 t3 score NA
| 7 id3 t1 score 1
| 8 id3 t2 score 2
| 9 id3 t3 score 5
| # A tibble: 9 x 3
| id type posit
| <chr> <chr> <chr>
| 1 id1 t1 posit x
| 2 id1 t2 posit y
| 3 id1 t3 posit x
| 4 id2 t1 posit y
| 5 id2 t2 posit x
| 6 id2 t3 posit <NA>
| 7 id3 t1 posit x
| 8 id3 t2 posit y
| 9 id3 t3 posit x
| # A tibble: 9 x 4
| id time type score
| <chr> <chr> <chr> <dbl>
| 1 id1 t1 score 1
| 2 id1 t2 score NA
| 3 id1 t3 score 4
| 4 id2 t1 score 2
| 5 id2 t2 score 3
| 6 id2 t3 score NA
| 7 id3 t1 score 1
| 8 id3 t2 score 2
| 9 id3 t3 score 5
| # A tibble: 9 x 4
| id time type posit
| <chr> <chr> <chr> <chr>
| 1 id1 t1 posit x
| 2 id1 t2 posit y
| 3 id1 t3 posit x
| 4 id2 t1 posit y
| 5 id2 t2 posit x
| 6 id2 t3 posit <NA>
| 7 id3 t1 posit x
| 8 id3 t2 posit y
| 9 id3 t3 posit x
| # A tibble: 9 x 5
| id time score type posit
| <chr> <chr> <dbl> <chr> <chr>
| 1 id1 t1 1 posit x
| 2 id1 t2 NA posit y
| 3 id1 t3 4 posit x
| 4 id2 t1 2 posit y
| 5 id2 t2 3 posit x
| 6 id2 t3 NA posit <NA>
| 7 id3 t1 1 posit x
| 8 id3 t2 2 posit y
| 9 id3 t3 5 posit x
| # A tibble: 7 x 4
| id time score posit
| <chr> <chr> <dbl> <chr>
| 1 id1 t1 1 x
| 2 id1 t3 4 x
| 3 id2 t1 2 y
| 4 id2 t2 3 x
| 5 id3 t1 1 x
| 6 id3 t2 2 y
| 7 id3 t3 5 x
scores <- myrepeated %>% select(id,`t1 score`,`t2 score`,`t3 score`) %>%
pivot_longer(-id,names_to='type',values_to='score') %>%
separate(type,c('time','type')))
positions <- myrepeated %>% select(id,`t1 posit`,`t2 posit`,`t3 posit`) %>%
pivot_longer(-id,names_to='type',values_to='posit') %>%
separate(type,c('time','type')))
longform <- scores %>%
select(-type) %>%
full_join(positions)) %>%
select(-type) %>%
filter(!is.na(score),!is.na(posit)))
It is possible to switch back to a wider data representation, for example to calculate correlations. Maybe fill in the missing values NA as 0 values.
| # A tibble: 3 x 6
| id t1_x t3_x t1_y t2_x t2_y
| <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
| 1 id1 1 4 NA NA NA
| 2 id2 NA NA 2 3 NA
| 3 id3 1 5 NA NA 2
longform %>% pivot_wider(values_from=score,names_from=c(time,posit))
# longform %>% pivot_wider(values_from=score,names_from=c(time,posit),values_fill=list(score=0))
In preparation of data manipulation (dplyr), other packages are of interest.
After the transformation, the data should be ready for
The tidyr (package within tidyverse) is focused on
The main -verbs- (see example above)
pivot_wider( )
and pivot_longer( )
: turn multiple columns or rows into one, making datafiles longer or widerseparate( )
and extract( )
: create multiple columns from one column using delimiters or regular expressions
Turning long form data into wide form and vise verse, is called pivoting.
Pivoting can go wider and longer.
pivoting can turn wide data into longer data.
iris
dataset, with 4 values for each unit within each species, showing only the first 6 observationslong_iris <- iris %>% pivot_longer(-Species,names_to='type',values_to='score')
long_iris %>% head( )
| # A tibble: 6 x 3
| Species type score
| <fct> <chr> <dbl>
| 1 setosa Sepal.Length 5.1
| 2 setosa Sepal.Width 3.5
| 3 setosa Petal.Length 1.4
| 4 setosa Petal.Width 0.2
| 5 setosa Sepal.Length 4.9
| 6 setosa Sepal.Width 3
long_iris %>% pivot_wider(values_from=score,names_from=type)
long_iris <- iris %>% mutate(id=1:n()) %>% pivot_longer(-c(Species,id),names_to='type',values_to='score')
long_iris %>% head( )
| # A tibble: 6 x 4
| Species id type score
| <fct> <int> <chr> <dbl>
| 1 setosa 1 Sepal.Length 5.1
| 2 setosa 1 Sepal.Width 3.5
| 3 setosa 1 Petal.Length 1.4
| 4 setosa 1 Petal.Width 0.2
| 5 setosa 2 Sepal.Length 4.9
| 6 setosa 2 Sepal.Width 3
pivoting can take long data into wider data.
wide_iris <- long_iris %>% pivot_wider(values_from=score,names_from=type)
wide_iris
| # A tibble: 150 x 6
| Species id Sepal.Length Sepal.Width Petal.Length Petal.Width
| <fct> <int> <dbl> <dbl> <dbl> <dbl>
| 1 setosa 1 5.1 3.5 1.4 0.2
| 2 setosa 2 4.9 3 1.4 0.2
| 3 setosa 3 4.7 3.2 1.3 0.2
| 4 setosa 4 4.6 3.1 1.5 0.2
| 5 setosa 5 5 3.6 1.4 0.2
| 6 setosa 6 5.4 3.9 1.7 0.4
| 7 setosa 7 4.6 3.4 1.4 0.3
| 8 setosa 8 5 3.4 1.5 0.2
| 9 setosa 9 4.4 2.9 1.4 0.2
| 10 setosa 10 4.9 3.1 1.5 0.1
| # ... with 140 more rows
world_bank_pop
dataset that is part of the tidyr
packagepivot the dataset to have univariate data for the scores over the different years
us_rent_income
dataset is also part of the tidyr
packagepivot the dataset to have a multivariate version with variables for all estimate x moe combinations
Splitting up information within a variable, or combining over variables, often helps dealing with messy data.
Columns (variables) can be split and united.
long_iris_x <- long_iris %>% separate(type,c('PT','lw'))
long_iris_x %>% head( )
| # A tibble: 6 x 5
| Species id PT lw score
| <fct> <int> <chr> <chr> <dbl>
| 1 setosa 1 Sepal Length 5.1
| 2 setosa 1 Sepal Width 3.5
| 3 setosa 1 Petal Length 1.4
| 4 setosa 1 Petal Width 0.2
| 5 setosa 2 Sepal Length 4.9
| 6 setosa 2 Sepal Width 3
unite_iris <- long_iris_x %>% unite('myType',PT:lw,sep='-')
unite_iris
| # A tibble: 600 x 4
| Species id myType score
| <fct> <int> <chr> <dbl>
| 1 setosa 1 Sepal-Length 5.1
| 2 setosa 1 Sepal-Width 3.5
| 3 setosa 1 Petal-Length 1.4
| 4 setosa 1 Petal-Width 0.2
| 5 setosa 2 Sepal-Length 4.9
| 6 setosa 2 Sepal-Width 3
| 7 setosa 2 Petal-Length 1.4
| 8 setosa 2 Petal-Width 0.2
| 9 setosa 3 Sepal-Length 4.7
| 10 setosa 3 Sepal-Width 3.2
| # ... with 590 more rows
tidyr
package includes other functions for more involved programming and simulation studies
expand( )
, crossover( )
, nesting( )
, best check the helpfile.
mtcars
should still be loaded into your workspacerownames_to_column( )
function, it consists of car type information, car subtype and subtype specification.
when using your own data, they have to be imported into the workspace.
*.RData
) can be loaded with the load( )
functionreadr
package deals with most common data, readxl
is dedicated to notorious excel, and haven
addresses the link with the main statistical software SAS, spss and Stata.
A common function of interest for import is read_delim( )
from the readr
package.
myrepeated <- read_delim(file='repeated.txt',delim='\t') # if repeated.txt is in the working directory (getwd( ) ?)
myrepeated <- read_delim(clipboard(),delim='\t')
myrepeated <- read_delim(file.choose(),delim='\t')
file.choose( )
and clipboard( )
can be used with other functions as well, like the ones discussed next.?read_delim( )
on how to set different parameters and gain flexibility to read in data.
Dedicated to Excel, the read_excel( )
function facilitates reading Excel files.
RealData_clean.xlsx
file making use of the defaultsread_excel('RealData_clean.xlsx')
read_delim( )
?read_excel
on alternative parameter values to extract specific columns, specifics Excel-tabs, …
Haven is dedicated to the major statistical software packages, SPSS, SAS and STATA.
read_sav( )
read the SPSS version of the iris data (part of the examples in the haven package)pathToIrisSpssData <- system.file("examples", "iris.sav", package = "haven")
read_sav(pathToIrisSpssData)
read_sas( )
read the SAS version of the iris data, the dataset has a sas7dat
extensionpath <- system.file("examples", "iris.sas7bdat", package = "haven")
read_sas(path)
read_dta( )
or read_stata( )
read the Stata version of the iris data, the dataset has a dta
extensionpath <- system.file("examples", "iris.dta", package = "haven")
read_dta(path)
write_
prefix, for dta
, sas
and sav
mtcars
into sas format.write_sas(mtcars,'mytryinSAS.sas7bdat')
Current draft provides a primer on data manipulation, tidy data and the importing of data, which are the main steps in preparation of most real data analyses and visualizations. It is strongly advised to play with the techniques discussed above to get some proficiency in using it, as it would add significantly to the flexibility of whatever you want to further do with your data.
A different draft addresses what is tidy data, with a focus on how data should be registered. A next draft will address how to visualize data, using the ggplot( )
function.
Several tidyverse packages are not yet discussed, which does suggest they are not useful but they are more specific. The consistency within tidyverse should give you a push though, to study the other packages yourself when of interest.
Base R still is a proper alternative to the tidyverse package, so be aware that others may do things differently.