Data Wrangling with R’s tidyverse
Compiled Sep 06, 2020
Current draft aims to introduce researchers to data manipulation in R with the dplyr, tidyr, and stringr packages of the tidyverse ecosystem.
Our target audience is primarily the research community at VUB / UZ Brussel, those who have some basic experience in R and want to know more.
We invite you to help improve this document by sending us feedback
wilfried.cools@vub.be or anonymously at icds.be/consulting (right side, bottom)
Data manipulation is inherent to data analysis, not just a precursor.
Data manipulation is best done with coding (as opposed to manual changes), provides the best guarantee to.
Data manipulation is easier and more intuitive when maintaining tidy data.
workflow: tidyverse lingo
Focus in current draft is on R.
In particular, focus on the tidyverse package (Hadley Wickham etal.), an ecosystem that includes.
dplyr for manipulating data frames [main focus]tidyr for tidying data [check Data Representation]stringr for dealing with textsreadr for reading in data [highlighted]tibble for data representation [highlighted]forcats for dealing with factorsggplot for visualizing data [separate draft]purrr for functional programming (advanced)
Find convenient cheat sheets at https://rstudio.com/resources/cheatsheets/.
Package tidyverse is a very good extension of base R.
Install (at least once) and load (once per R session) the tidyverse package.
install.packages('tidyverse')
The individual packages that are loaded are listed, as are their conflicts.
library(tidyverse)
Conflicts arise loaded packages use the same function names. Resolve such conflicts for example by referencing the package with ::, eg., stat::filter( ).
Conflicts can be checked for tidyverse.
tidyverse_conflicts( )
| -- Conflicts ---------------------------------------------------- tidyverse_conflicts() --
| x dplyr::filter() masks stats::filter()
| x dplyr::lag() masks stats::lag()
The tidyverse ecosystem includes
broom, cli, crayon, dbplyr, dplyr, forcats, ggplot2, haven, hms, httr, jsonlite, lubridate, magrittr, modelr, pillar, purrr, readr, readxl, reprex, rlang, rstudioapi, rvest, stringr, tibble, tidyr, xml2, tidyverse.
tidyverse_packages( )
Most of these packages should be loaded explicitly (not included in library(tidyverse)).
Data: the tibble is the tidyverse data type, defined in the tibble package.
dataframe, a list of equally sized vectors
tibble is a dataframe, enhanced for convenience and consistency
tibble with tribble( ) functionclass( ) shows both data.frame and tbl_df(mytibble <- tribble(
~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
))
| # A tibble: 3 x 2
| colA colB
| <chr> <dbl>
| 1 a 1
| 2 b 2
| 3 c 3
class(mytibble)
| [1] "tbl_df" "tbl" "data.frame"
- compare with dataframe:
mydf <- data.frame(colA=c('a','b','c'),colB=1:3)
class(mydf)
| [1] "data.frame"
Process: the magrittr package offer pipes %>%
mean(mytibble$colB)
mytibble %>% summarize(mean(colB))
x1 <- rnorm(10); x2 <- rnorm(10)
sqrt(sum((x1-x2)^2))
| [1] 4.174785
(x1-x2)^2 %>% sum( ) %>% sqrt( )
| [1] 4.174785
Exemplary data (part of base R), mtcars, are used.
data( ) functionstr( ) and the first 6 observations with head( ) functiondata(mtcars)
str(mtcars)
| 'data.frame': 32 obs. of 11 variables:
| $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
| $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
| $ disp: num 160 160 108 258 360 ...
| $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
| $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
| $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
| $ qsec: num 16.5 17 18.6 19.4 17 ...
| $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
| $ am : num 1 1 1 0 0 0 0 0 0 0 ...
| $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
| $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
| mpg cyl disp hp drat wt qsec vs am gear carb
| Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
| Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
| Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
| Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
| Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
| Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note: calling data( ) without arguments shows all the available data currently in reach.
tidyverse look at the data works with glimpse( )glimpse(mtcars)
| Observations: 32
| Variables: 11
| $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, ...
| $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, ...
| $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 16...
| $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, ...
| $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, ...
| $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3....
| $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18...
| $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, ...
| $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, ...
| $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, ...
| $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, ...
Exemplary data manipulation is depicted.
mtcars data,
mtcars %>%
select(mpg, cyl, hpow=hp, am) %>%
filter(mpg > 15) %>%
mutate(mpgr = mpg/hpow) %>%
group_by(cyl, am) %>%
summarize(min=min(mpgr)) %>%
pivot_wider(names_from=am,values_from=min) %>%
select(cyl,am0=`0`,am1=`1`)
| # A tibble: 3 x 3
| # Groups: cyl [3]
| cyl am0 am1
| <dbl> <dbl> <dbl>
| 1 4 0.222 0.196
| 2 6 0.145 0.113
| 3 8 0.0844 0.0598
The dplyr package (from tidyverse) is of interest.
The main -verbs- (see example above)
The verb to structure data (see example above)
The verbs to enhance control on scope (advanced)
Additional dplyr verbs:
Verbs to extend data
Final comment: only the core of dplyr is discussed, much more is possible and you will find on the Net.
Grouping prepares data for group specific operations.
tst <- mtcars %>% group_by(am,vs)
glimpse(tst,width=100)
| Observations: 32
| Variables: 11
| Groups: am, vs [4]
| $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2...
| $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4...
| $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 167.6, 275.8...
| $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, ...
| $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.07...
| $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3.440, 4.070...
| $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18.90, 17.40...
| $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1...
| $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1...
| $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5...
| $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2...
n( ))tst %>% summarize(n( ))
| # A tibble: 4 x 3
| # Groups: am [2]
| am vs `n()`
| <dbl> <dbl> <int>
| 1 0 0 12
| 2 0 1 7
| 3 1 0 6
| 4 1 1 7
ungroup( ), good practice to avoid side effectstst <- tst %>% ungroup( )
tst %>% summarize(n( ))
| # A tibble: 1 x 1
| `n()`
| <int>
| 1 32
.add and .drop argumentsmpg in 3 groupstst <- mtcars %>% group_by(mpg3 = cut(mpg, 3))
tst %>% summarize(n( ))
| # A tibble: 3 x 2
| mpg3 `n()`
| <fct> <int>
| 1 (10.4,18.2] 14
| 2 (18.2,26.1] 13
| 3 (26.1,33.9] 5
Filtering returns rows using matching conditions.
mpg could be set with a minimum of 30mtcars %>% filter(mpg > 30)
| mpg cyl disp hp drat wt qsec vs am gear carb
| Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
| Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
| Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
| Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
mpg above 20 AND qsec below or equal to 18.mtcars %>% filter(mpg > 20, qsec <= 18)
| mpg cyl disp hp drat wt qsec vs am gear carb
| Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
| Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
| Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
| Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
mpg above 30 OR qsec below 20 AND am equal to 0.mtcars %>% filter(mpg > 30 | qsec > 20, am==0)
| mpg cyl disp hp drat wt qsec vs am gear carb
| Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
| Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
| Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
slice_head( )), for each gender (group your data) !?slice_head ?slice_sample( ) to randomly select 5 observations !slice_max( ) to select 3 observations with highest values on height !mass and get the top 3 for each species !
Extract columns (variables) by name, rename and/or reorder them.
mpg could be selected.mtcars %>% select(mpg)
| mpg
| Mazda RX4 21.0
| Mazda RX4 Wag 21.0
| Datsun 710 22.8
| Hornet 4 Drive 21.4
| Hornet Sportabout 18.7
| Valiant 18.1
| Duster 360 14.3
| Merc 240D 24.4
| Merc 230 22.8
| Merc 280 19.2
| Merc 280C 17.8
| Merc 450SE 16.4
| Merc 450SL 17.3
| Merc 450SLC 15.2
| Cadillac Fleetwood 10.4
| Lincoln Continental 10.4
| Chrysler Imperial 14.7
| Fiat 128 32.4
| Honda Civic 30.4
| Toyota Corolla 33.9
| Toyota Corona 21.5
| Dodge Challenger 15.5
| AMC Javelin 15.2
| Camaro Z28 13.3
| Pontiac Firebird 19.2
| Fiat X1-9 27.3
| Porsche 914-2 26.0
| Lotus Europa 30.4
| Ford Pantera L 15.8
| Ferrari Dino 19.7
| Maserati Bora 15.0
| Volvo 142E 21.4
# mtcars$mpg
dplyr use pull( )mtcars %>% pull(mpg)
| [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7
| [18] 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
qsec and mpg (top 6 observations).mtcars %>% select(qsec,mpg) %>% head( )
| qsec mpg
| Mazda RX4 16.46 21.0
| Mazda RX4 Wag 17.02 21.0
| Datsun 710 18.61 22.8
| Hornet 4 Drive 19.44 21.4
| Hornet Sportabout 17.02 18.7
| Valiant 20.22 18.1
mtcars %>% select(3,1) %>% head( )
| disp mpg
| Mazda RX4 160 21.0
| Mazda RX4 Wag 160 21.0
| Datsun 710 108 22.8
| Hornet 4 Drive 258 21.4
| Hornet Sportabout 360 18.7
| Valiant 225 18.1
mtcars %>% select(-c(3:6)) %>% head( )
| mpg cyl qsec vs am gear carb
| Mazda RX4 21.0 6 16.46 0 1 4 4
| Mazda RX4 Wag 21.0 6 17.02 0 1 4 4
| Datsun 710 22.8 4 18.61 1 1 4 1
| Hornet 4 Drive 21.4 6 19.44 1 0 3 1
| Hornet Sportabout 18.7 8 17.02 0 0 3 2
| Valiant 18.1 6 20.22 1 0 3 1
- making use of helper functions, selections can be more automated.
- use partial string matching - directly with contains( ) - using regular expressions with matches( ).
- example: extract columns with names that include the string ar (show 6).
mtcars %>% select(contains('ar')) %>% head( )
| gear carb
| Mazda RX4 4 4
| Mazda RX4 Wag 4 4
| Datsun 710 4 1
| Hornet 4 Drive 3 1
| Hornet Sportabout 3 2
| Valiant 3 1
ar but with at least one element before and after it (show 6).mtcars %>% select(matches('.ar.')) %>% head( )
| carb
| Mazda RX4 4
| Mazda RX4 Wag 4
| Datsun 710 1
| Hornet 4 Drive 1
| Hornet Sportabout 2
| Valiant 1
rename( )
cyl into cyl468 to reflect its values, same for vs and am, and select it together with mpg (show 6).mtcars %>% select(mpg,cyl468=cyl,vs01=vs,am01=am) %>% head( )
| mpg cyl468 vs01 am01
| Mazda RX4 21.0 6 0 1
| Mazda RX4 Wag 21.0 6 0 1
| Datsun 710 22.8 4 1 1
| Hornet 4 Drive 21.4 6 1 0
| Hornet Sportabout 18.7 8 0 0
| Valiant 18.1 6 1 0
cyl, vs and am as before but without selection (show 6).mtcars %>% rename(cyl468=cyl,vs01=vs,am01=am) %>% head( )
| mpg cyl468 disp hp drat wt qsec vs01 am01 gear carb
| Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
| Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
| Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
| Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
| Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
| Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
select( ) will include the grouping variables by default.
group_cols( ) functionvs and am, and extract only those columns.mtcars %>% group_by(vs,am) %>% select(group_cols( ))
| # A tibble: 32 x 2
| # Groups: vs, am [4]
| vs am
| <dbl> <dbl>
| 1 0 1
| 2 0 1
| 3 1 1
| 4 1 0
| 5 0 0
| 6 1 0
| 7 0 0
| 8 1 0
| 9 1 0
| 10 1 0
| # ... with 22 more rows
: operator for consecutive columns hair and eye color !color (check help files on helper functions, use ?language) !homeworld to home_world !rename( ) function !where( ) and is.numeric( ) !height, mass and/or size, with any_of( ) !
Create new variables based on existing ones.
mpg2 is the mpg value squared (show 6).mtcars %>% mutate(mpg2=mpg^2) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb mpg2
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 441.00
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 441.00
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 519.84
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 457.96
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 349.69
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 327.61
mpg variable is overwritten with its value squared (show 6).mtcars %>% mutate(mpg=mpg^2) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb
| 1 441.00 6 160 110 3.90 2.620 16.46 0 1 4 4
| 2 441.00 6 160 110 3.90 2.875 17.02 0 1 4 4
| 3 519.84 4 108 93 3.85 2.320 18.61 1 1 4 1
| 4 457.96 6 258 110 3.08 3.215 19.44 1 0 3 1
| 5 349.69 8 360 175 3.15 3.440 17.02 0 0 3 2
| 6 327.61 6 225 105 2.76 3.460 20.22 1 0 3 1
NEWVAR is the mpg value multiplied by the vs value (show 6).mtcars %>% mutate(NEWVAR=mpg*vs) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb NEWVAR
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.0
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.0
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 22.8
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.0
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 18.1
NEWVAR is the mpg value multiplied by the vs value and this new variable is divided by the disp variable (show 6).mtcars %>% mutate(NEWVAR=mpg*vs,NEWVAR2=NEWVAR/disp) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb NEWVAR NEWVAR2
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.0 0.00000000
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.0 0.00000000
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 22.8 0.21111111
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4 0.08294574
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.0 0.00000000
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 18.1 0.08044444
transmute( ) selects them too.mtcars %>% transmute(NEWVAR=mpg*vs,NEWVAR2=NEWVAR/disp) %>% head( )
| NEWVAR NEWVAR2
| 1 0.0 0.00000000
| 2 0.0 0.00000000
| 3 22.8 0.21111111
| 4 21.4 0.08294574
| 5 0.0 0.00000000
| 6 18.1 0.08044444
mpg using cumsum( ) (show 6).mtcars %>% mutate(NEWVAR=cumsum(mpg)) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb NEWVAR
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 42.0
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 64.8
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 86.2
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 104.9
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 123.0
mpg is between 20 and 22 (show 6).mtcars %>% mutate(NEWVAR=between(mpg,20,22)) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb NEWVAR
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 TRUE
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 TRUE
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 FALSE
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 TRUE
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 FALSE
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 FALSE
mpg values, with the rownumber( ) function. When arranged by mpg this is more clear.mtcars %>% mutate(id=row_number(mpg)) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb id
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 19
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 20
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 24
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 15
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 14
mtcars %>% mutate(id=row_number(mpg)) %>% arrange(mpg) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb id
| 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 1
| 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 2
| 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 3
| 4 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 4
| 5 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 5
| 6 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8 6
vs and am, and mutate only those columns.mtcars %>% group_by(vs,am) %>% mutate(id=row_number(mpg)) %>% head( )
| # A tibble: 6 x 12
| # Groups: vs, am [4]
| mpg cyl disp hp drat wt qsec vs am gear carb id
| <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
| 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 4
| 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 5
| 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 2
| 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 4
| 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 11
| 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 2
vs and am, there will be a 1 (first), 2 (second)… for id.
height_m with height divided by 100 !mass / height_m to the power 2 !height_m and BMI !male and female labels with m and f (use recode( )) !ifelse( )))!
Reduce sets of values into their summaries, based on grouped data.
mpg values can be obtained.mtcars %>% summarize(myAverage=mean(mpg))
| myAverage
| 1 20.09062
mpg values can be obtainedmtcars %>% summarize(myAvMpg=mean(mpg),mySdMpg=sd(mpg),myAvDisp=mean(disp),mySdDisp=sd(disp))
| myAvMpg mySdMpg myAvDisp mySdDisp
| 1 20.09062 6.026948 230.7219 123.9387
summarize( ).
mpg values can be obtained for each level of vsmtcars %>% group_by(vs) %>% summarize(myAverage=mean(mpg))
| # A tibble: 2 x 2
| vs myAverage
| <dbl> <dbl>
| 1 0 16.6
| 2 1 24.6
mpg values can be obtained, for multiple variablesmtcars %>% group_by(vs,am) %>% summarize(myAvMpg=mean(mpg),mySdMpg=sd(mpg),myAvDisp=mean(disp),mySdDisp=sd(disp))
| # A tibble: 4 x 6
| # Groups: vs [2]
| vs am myAvMpg mySdMpg myAvDisp mySdDisp
| <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
| 1 0 0 15.0 2.77 358. 71.8
| 2 0 1 19.8 4.01 206. 95.2
| 3 1 0 20.7 2.47 175. 49.1
| 4 1 1 28.4 4.76 89.8 18.8
vs, can be obtained with n( ), or using the special verb count( ).mtcars %>% group_by(vs) %>% count( )
| # A tibble: 2 x 2
| # Groups: vs [2]
| vs n
| <dbl> <int>
| 1 0 18
| 2 1 14
mtcars %>% group_by(vs) %>% summarize(mycount=n( ))
| # A tibble: 2 x 2
| vs mycount
| <dbl> <int>
| 1 0 18
| 2 1 14
vs and am can be obtained with n_distinct( )nth( )mtcars %>% group_by(vs,am) %>% summarize(nrDist=n_distinct(mpg),`3th`=nth(mpg,3))
| # A tibble: 4 x 4
| # Groups: vs [2]
| vs am nrDist `3th`
| <dbl> <dbl> <int> <dbl>
| 1 0 0 10 16.4
| 2 0 1 5 26
| 3 1 0 7 24.4
| 4 1 1 6 30.4
height into the average height (some missing values need to be dealt with, check ?mean) !species and sex, and include the average mass !
The across( ) function allows for selection of variables within the summarize( ) or mutate( ) function.
accross( ) replace the earlier functions *_at, *_if and *_all.am and vs into a factor before calling the structure with str( )mtcars %>% select(mpg,cyl,am,vs) %>% mutate(across(c('am','vs'),factor)) %>% str( )
| 'data.frame': 32 obs. of 4 variables:
| $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
| $ cyl: num 6 6 4 6 8 6 8 4 4 6 ...
| $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
| $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
cyl and vs with a : operatormtcars %>% select(mpg,cyl,am,vs) %>% mutate(across(cyl:vs,factor)) %>% str( )
| 'data.frame': 32 obs. of 4 variables:
| $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
| $ cyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
| $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
| $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
armtcars %>% select(mpg,cyl,gear,carb) %>% mutate(across(contains("ar"),factor)) %>% str( )
| 'data.frame': 32 obs. of 4 variables:
| $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
| $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
| $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
| $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
accross( ) function allows for applying a list of functions
descr <- list(
md = ~median(.x, na.rm = TRUE),
av = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE)
)
mtcars %>% mutate(across(c(1,3), descr)) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb mpg_md mpg_av mpg_sd disp_md
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 19.2 20.09062 6.026948 196.3
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 19.2 20.09062 6.026948 196.3
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 19.2 20.09062 6.026948 196.3
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 19.2 20.09062 6.026948 196.3
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 19.2 20.09062 6.026948 196.3
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 19.2 20.09062 6.026948 196.3
| disp_av disp_sd
| 1 230.7219 123.9387
| 2 230.7219 123.9387
| 3 230.7219 123.9387
| 4 230.7219 123.9387
| 5 230.7219 123.9387
| 6 230.7219 123.9387
select( ), selections can be more automated.
all_of( ), where( ), matches( ), starts_with( )mutate( ) and summarize( )
where( )) into their minimum and maximum (some missing values need to be dealt with) !
Different datafiles can be combined into one datafile using common variables that serve as key (cfr. relational databases).
mtcyl, with a 2 cylinder but no 8 cylinder unlike the mtcars (4,6,8)mtcyl <- tribble(
~cyl,~type,
2,'small',
4,'medium',
6,'large'
)
mtcars and mtcyl but ignore the irrelevant cyl equal to 2 (not part of mtcars), with a left_join( )
cyl equal to 8 turns out missing, because it is not specified in the -right- datafilemtcars %>% left_join(mtcyl) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb type
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 large
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 large
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 medium
| 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 large
| 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 <NA>
| 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 large
mtcars and mtcyl but ignore the cyl equal to 8 because it lacks information on type, with a right_join( )
cyl equal to 2 is included, but turns out missing for most variables because it is not specified in the -left- datafilemtcars %>% right_join(mtcyl) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb type
| 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 large
| 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 large
| 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 medium
| 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 large
| 5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 large
| 6 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 medium
mtcars and mtcyl for only those observations with the linking variable cyl in both files, with an right_join( )
mtcars %>% inner_join(mtcyl) %>% head( )
| mpg cyl disp hp drat wt qsec vs am gear carb type
| 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 large
| 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 large
| 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 medium
| 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 large
| 5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 large
| 6 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 medium
mtcars and mtcyl keeping all available information, with a full_join( ) showing selected rows 1 to 3, 5, 7 and 33mtcars %>% full_join(mtcyl) %>% slice(c(1:3,5,7,33))
| mpg cyl disp hp drat wt qsec vs am gear carb type
| 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 large
| 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 large
| 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 medium
| 4 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 <NA>
| 5 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 <NA>
| 6 NA 2 NA NA NA NA NA NA NA NA NA small
semi_join( ), nest_join( ), anti_join(), which are described in the help files.
band_members and band_instruments are probably loaded into your workspace automatically as part of the tidyverse !semi_join( ) and anti_join( ) and interpret what happens.
Compare the structure of the mtcars data with a glimpse at that data.
Compare a select of mpg with a pull of mpg.
Check the help file and pull out the second before last column.
Select all columns except the am.
Select all columns except the am and vs.
Keep only columns mpg, cyl and disp, but rename mpg to miles_gallon.
Insist, keep only columns mpg, cyl and disp, but rename mpg to miles per gallon.
Keep only the consecutive columns in between disp and wt, in addition to mpg as a last column, use a :.
Create a variable for the row names.
Change the mpg (miles per gallon) into kpl (kilometers per liter) with 1 mpg is 0.425 km/l, using mutate( ).
Select about 10% of the observations, twice, check the help file on using sample_frac( ).
note that the matrix way could be:
Select the 10th to 15th row, check the help file on using slice( ).
Select the distinct combinations only, for variables am and vs.
Check the help files to determine how to keep all variables (for each first observation of that combination).
Filter the data to retain only cases with mpg > 20 and hp above or equal to 110.
Filter the data to retain only the Datsun 710.
Exemplary data are read in using a copy-paste and tidied.
read_delim( )data are pivoted and disentangled using separate( ) and unite( )
repeated.txt, a tab-delimited text file, by copy-pastingmyrepeated <- read_delim(clipboard(),delim='\t')
| # A tibble: 3 x 7
| id `t1 score` `t1 posit` `t2 score` `t2 posit` `t3 score` `t3 posit`
| <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
| 1 id1 1 x NA y 4 x
| 2 id2 2 y 3 x NA <NA>
| 3 id3 1 x 2 y 5 x
id and scores at time points t1 to t3, and pivot to get scores spread over rows, with times named typeid and positions at time points t1 to t3 and pivot to get positions spread over rows, with times named typetimejoin, using id as key, but first eliminate at least one of the inconsistent type variables| # A tibble: 9 x 3
| id type score
| <chr> <chr> <dbl>
| 1 id1 t1 score 1
| 2 id1 t2 score NA
| 3 id1 t3 score 4
| 4 id2 t1 score 2
| 5 id2 t2 score 3
| 6 id2 t3 score NA
| 7 id3 t1 score 1
| 8 id3 t2 score 2
| 9 id3 t3 score 5
| # A tibble: 9 x 3
| id type posit
| <chr> <chr> <chr>
| 1 id1 t1 posit x
| 2 id1 t2 posit y
| 3 id1 t3 posit x
| 4 id2 t1 posit y
| 5 id2 t2 posit x
| 6 id2 t3 posit <NA>
| 7 id3 t1 posit x
| 8 id3 t2 posit y
| 9 id3 t3 posit x
| # A tibble: 9 x 4
| id time type score
| <chr> <chr> <chr> <dbl>
| 1 id1 t1 score 1
| 2 id1 t2 score NA
| 3 id1 t3 score 4
| 4 id2 t1 score 2
| 5 id2 t2 score 3
| 6 id2 t3 score NA
| 7 id3 t1 score 1
| 8 id3 t2 score 2
| 9 id3 t3 score 5
| # A tibble: 9 x 4
| id time type posit
| <chr> <chr> <chr> <chr>
| 1 id1 t1 posit x
| 2 id1 t2 posit y
| 3 id1 t3 posit x
| 4 id2 t1 posit y
| 5 id2 t2 posit x
| 6 id2 t3 posit <NA>
| 7 id3 t1 posit x
| 8 id3 t2 posit y
| 9 id3 t3 posit x
| # A tibble: 9 x 5
| id time score type posit
| <chr> <chr> <dbl> <chr> <chr>
| 1 id1 t1 1 posit x
| 2 id1 t2 NA posit y
| 3 id1 t3 4 posit x
| 4 id2 t1 2 posit y
| 5 id2 t2 3 posit x
| 6 id2 t3 NA posit <NA>
| 7 id3 t1 1 posit x
| 8 id3 t2 2 posit y
| 9 id3 t3 5 posit x
| # A tibble: 7 x 4
| id time score posit
| <chr> <chr> <dbl> <chr>
| 1 id1 t1 1 x
| 2 id1 t3 4 x
| 3 id2 t1 2 y
| 4 id2 t2 3 x
| 5 id3 t1 1 x
| 6 id3 t2 2 y
| 7 id3 t3 5 x
scores <- myrepeated %>% select(id,`t1 score`,`t2 score`,`t3 score`) %>%
pivot_longer(-id,names_to='type',values_to='score') %>%
separate(type,c('time','type')))
positions <- myrepeated %>% select(id,`t1 posit`,`t2 posit`,`t3 posit`) %>%
pivot_longer(-id,names_to='type',values_to='posit') %>%
separate(type,c('time','type')))
longform <- scores %>%
select(-type) %>%
full_join(positions)) %>%
select(-type) %>%
filter(!is.na(score),!is.na(posit)))
It is possible to switch back to a wider data representation, for example to calculate correlations. Maybe fill in the missing values NA as 0 values.
| # A tibble: 3 x 6
| id t1_x t3_x t1_y t2_x t2_y
| <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
| 1 id1 1 4 NA NA NA
| 2 id2 NA NA 2 3 NA
| 3 id3 1 5 NA NA 2
longform %>% pivot_wider(values_from=score,names_from=c(time,posit))
# longform %>% pivot_wider(values_from=score,names_from=c(time,posit),values_fill=list(score=0))
workflow: tidyverse lingo
In preparation of data manipulation (dplyr), other packages are of interest.
After the transformation, the data should be ready for
The tidyr (package within tidyverse) is focused on
The main -verbs- (see example above)
pivot_wider( ) and pivot_longer( ): turn multiple columns or rows into one, making datafiles longer or widerseparate( ) and extract( ): create multiple columns from one column using delimiters or regular expressions
Turning long form data into wide form and vise verse, is called pivoting.
Pivoting can go wider and longer.
pivoting can turn wide data into longer data.
iris dataset, with 4 values for each unit within each species, showing only the first 6 observationslong_iris <- iris %>% pivot_longer(-Species,names_to='type',values_to='score')
long_iris %>% head( )
| # A tibble: 6 x 3
| Species type score
| <fct> <chr> <dbl>
| 1 setosa Sepal.Length 5.1
| 2 setosa Sepal.Width 3.5
| 3 setosa Petal.Length 1.4
| 4 setosa Petal.Width 0.2
| 5 setosa Sepal.Length 4.9
| 6 setosa Sepal.Width 3
long_iris %>% pivot_wider(values_from=score,names_from=type)
long_iris <- iris %>% mutate(id=1:n()) %>% pivot_longer(-c(Species,id),names_to='type',values_to='score')
long_iris %>% head( )
| # A tibble: 6 x 4
| Species id type score
| <fct> <int> <chr> <dbl>
| 1 setosa 1 Sepal.Length 5.1
| 2 setosa 1 Sepal.Width 3.5
| 3 setosa 1 Petal.Length 1.4
| 4 setosa 1 Petal.Width 0.2
| 5 setosa 2 Sepal.Length 4.9
| 6 setosa 2 Sepal.Width 3
pivoting can take long data into wider data.
wide_iris <- long_iris %>% pivot_wider(values_from=score,names_from=type)
wide_iris
| # A tibble: 150 x 6
| Species id Sepal.Length Sepal.Width Petal.Length Petal.Width
| <fct> <int> <dbl> <dbl> <dbl> <dbl>
| 1 setosa 1 5.1 3.5 1.4 0.2
| 2 setosa 2 4.9 3 1.4 0.2
| 3 setosa 3 4.7 3.2 1.3 0.2
| 4 setosa 4 4.6 3.1 1.5 0.2
| 5 setosa 5 5 3.6 1.4 0.2
| 6 setosa 6 5.4 3.9 1.7 0.4
| 7 setosa 7 4.6 3.4 1.4 0.3
| 8 setosa 8 5 3.4 1.5 0.2
| 9 setosa 9 4.4 2.9 1.4 0.2
| 10 setosa 10 4.9 3.1 1.5 0.1
| # ... with 140 more rows
world_bank_pop dataset that is part of the tidyr packagepivot the dataset to have univariate data for the scores over the different years
us_rent_income dataset is also part of the tidyr packagepivot the dataset to have a multivariate version with variables for all estimate x moe combinations
Splitting up information within a variable, or combining over variables, often helps dealing with messy data.
Columns (variables) can be split and united.
long_iris_x <- long_iris %>% separate(type,c('PT','lw'))
long_iris_x %>% head( )
| # A tibble: 6 x 5
| Species id PT lw score
| <fct> <int> <chr> <chr> <dbl>
| 1 setosa 1 Sepal Length 5.1
| 2 setosa 1 Sepal Width 3.5
| 3 setosa 1 Petal Length 1.4
| 4 setosa 1 Petal Width 0.2
| 5 setosa 2 Sepal Length 4.9
| 6 setosa 2 Sepal Width 3
unite_iris <- long_iris_x %>% unite('myType',PT:lw,sep='-')
unite_iris
| # A tibble: 600 x 4
| Species id myType score
| <fct> <int> <chr> <dbl>
| 1 setosa 1 Sepal-Length 5.1
| 2 setosa 1 Sepal-Width 3.5
| 3 setosa 1 Petal-Length 1.4
| 4 setosa 1 Petal-Width 0.2
| 5 setosa 2 Sepal-Length 4.9
| 6 setosa 2 Sepal-Width 3
| 7 setosa 2 Petal-Length 1.4
| 8 setosa 2 Petal-Width 0.2
| 9 setosa 3 Sepal-Length 4.7
| 10 setosa 3 Sepal-Width 3.2
| # ... with 590 more rows
tidyr package includes other functions for more involved programming and simulation studies
expand( ), crossover( ), nesting( ), best check the helpfile.
mtcars should still be loaded into your workspacerownames_to_column( ) function, it consists of car type information, car subtype and subtype specification.
when using your own data, they have to be imported into the workspace.
*.RData) can be loaded with the load( ) functionreadr package deals with most common data, readxl is dedicated to notorious excel, and haven addresses the link with the main statistical software SAS, spss and Stata.
A common function of interest for import is read_delim( ) from the readr package.
myrepeated <- read_delim(file='repeated.txt',delim='\t') # if repeated.txt is in the working directory (getwd( ) ?)
myrepeated <- read_delim(clipboard(),delim='\t')
myrepeated <- read_delim(file.choose(),delim='\t')
file.choose( ) and clipboard( ) can be used with other functions as well, like the ones discussed next.?read_delim( ) on how to set different parameters and gain flexibility to read in data.
Dedicated to Excel, the read_excel( ) function facilitates reading Excel files.
RealData_clean.xlsx file making use of the defaultsread_excel('RealData_clean.xlsx')
read_delim( )?read_excel on alternative parameter values to extract specific columns, specifics Excel-tabs, …
Haven is dedicated to the major statistical software packages, SPSS, SAS and STATA.
read_sav( ) read the SPSS version of the iris data (part of the examples in the haven package)pathToIrisSpssData <- system.file("examples", "iris.sav", package = "haven")
read_sav(pathToIrisSpssData)
read_sas( ) read the SAS version of the iris data, the dataset has a sas7dat extensionpath <- system.file("examples", "iris.sas7bdat", package = "haven")
read_sas(path)
read_dta( ) or read_stata( ) read the Stata version of the iris data, the dataset has a dta extensionpath <- system.file("examples", "iris.dta", package = "haven")
read_dta(path)
write_ prefix, for dta, sas and sav
mtcars into sas format.write_sas(mtcars,'mytryinSAS.sas7bdat')
Current draft provides a primer on data manipulation, tidy data and the importing of data, which are the main steps in preparation of most real data analyses and visualizations. It is strongly advised to play with the techniques discussed above to get some proficiency in using it, as it would add significantly to the flexibility of whatever you want to further do with your data.
workflow: tidyverse lingo
A different draft addresses what is tidy data, with a focus on how data should be registered. A next draft will address how to visualize data, using the ggplot( ) function.
Several tidyverse packages are not yet discussed, which does suggest they are not useful but they are more specific. The consistency within tidyverse should give you a push though, to study the other packages yourself when of interest.
Base R still is a proper alternative to the tidyverse package, so be aware that others may do things differently.