Data Visualization with R’s tidyverse
Compiled Sep 06, 2020
Current draft aims to introduce researchers to visualizing both data and statistics in R with the GGplot package.
Our target audience is primarily the research community at VUB / UZ Brussel, those who have some basic experience in R and want to know more.
We invite you to help improve this document by sending us feedback
wilfried.cools@vub.be or anonymously at icds.be/consulting (right side, bottom)
Data visualization is inherent to data analysis, not just a way of communicating the results.
Data visualization is best done with coding (as opposed to manual changes).
Data visualization is easier and more intuitive when maintaining tidy data.
Focus in current draft is on R.
ggplot2
of the tidyverse
package (Hadley Wickham etal.).
Install (at least once) and load (once per R session) the tidyverse
package, or the ggplot2
package.
install.packages('tidyverse')
library(tidyverse)
Find a convenient cheat sheet on data visualization at https://rstudio.com/resources/cheatsheets/
Highlighting both base R and ggplot to get a first impression.
To use the build-in iris data, include it with data( )
.
data(iris)
Have a peak at it’s contents with str( )
and head( )
.
Various visualizations of the data are possible, also in base R.
For example, consider the scatterplot, boxplot, histogram and dotplot.
par(mfrow=c(2,2))
plot(iris$Petal.Length,iris$Petal.Width,col=iris$Species,main='scatterplot')
boxplot(iris$Sepal.Length~iris$Species,main='boxplot')
hist(iris$Sepal.Width,main='histogram')
dotchart(iris$Sepal.Width,main='dotplot')
par(mfrow=c(1,1))
?par
and ?options
boxplot(iris$Sepal.Length~iris$Species,main='boxplot',
horizontal=TRUE, las=2, cex.axis=.75, ylab='',xlab='length',col=c(4,2,3));
Visualization, especially when slightly more complex, is often more intuitive with ggplot2
.
For example, consider the scatterplot, boxplot, and histogram.
p1 <- ggplot(data=iris,aes(y=Petal.Width,x=Petal.Length,col=Species)) + geom_point()
p2 <- ggplot(data=iris,aes(y=Sepal.Length,col=Species)) + geom_boxplot()
p3 <- ggplot(data=iris,aes(x=Sepal.Width)) + geom_histogram()
grid.arrange(p1, p2, p3, ncol=3)
To save the last generated plot, the ggsave( )
function is available.
ggsave('plotname.png',width=12,height=6)
Both types, and other types of visualization will do what they are supposed to.
Because ggplot is taking over, and because it is so much fun to work with, ggplot is presented.
GGplot philosophy: Grammar of Graphics (Leland Wilkinson)
ggplot( )
:
ggplot(data=iris,aes(y=Petal.Width,x=Petal.Length,col=Species)) + geom_point()
ggplot( )
creates a ggplot object
aes( )
function to assign variables from the dataframe to scales (x-axis, y-axis, color)geom_point( )
to visualize the ggplot object as a scatterplot
General structure includes data, functions and arguments.
ggplot( )
function to initialize the ggplot objectgeom_*( )
function to visualize data through their aestheticsstat_*( )
function largely similar to geom
but with focus on statisticsfacet_*( )
function for grouping / conditional visualizationtheme( )
, guides( )
, scale_*( )
, coord_*( )
ggplot( )
, geom_*( )
and stat_*( )
data
argument to specify data (=input)aes( )
function as argument to specify aesthetic mapping (bridging gap input and output)...
Grammer of Graphics sparked further developments
ggforce
, ggalt
, ggpubr
, ggraph
, tidygraph
, GGally
, ggcorrplot
, ggridges
, ...
.
First part addresses how to make a visualization,
afterwards it is considered in less detail how to further make refinements.
### step by step example
Use is made of the build-in mtcars
and already loaded build-in iris
dataset.
data(mtcars)
mtcars %>% head()
| mpg cyl disp hp drat wt qsec vs am gear carb
| Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
| Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
| Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
| Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
| Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
| Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The ggplot object is constructed.
mtcars
dataframempg
and disp
from the mtcars
datap1 <- ggplot(data=mtcars, aes(y=mpg,x=disp))
No visualization is made yet, only the object is created.
mtcars
A layer is added with the +
sign, adding a geometric function.
p2 <- ggplot(data=mtcars, aes(y=mpg,x=disp)) + geom_point()
geom_point( )
: geometric function without arguments to visualize a scatterplot
ggplot( )
ggplot( )
ggplot(data=mtcars, aes(x=disp)) + geom_point(aes(y=mpg))
geom_point( )
requires x and yggplot( )
aes( )
because linked to data (mpg)ggplot( )
grid.arrange(p1, p2, ncol=2)
Non-essential aesthetics can be included, for example color.
aes( )
, values for color are extracted from the variable gear
in mtcars
p1 <- ggplot(data=mtcars, aes(y=mpg,x=disp,color=gear)) + geom_point()
aes( )
, values for color are extracted from the categorical variable gear
in mtcars
p2 <- ggplot(data=mtcars, aes(y=mpg,x=disp,color=factor(gear))) + geom_point()
grid.arrange(p1, p2, ncol=2)
aes( )
of the geometric function, the default is overwritten.
geom_point( )
overwrites the numerical in ggplot( )
ggplot(data=mtcars, aes(y=mpg,x=disp,color=gear)) + geom_point(aes(color=factor(gear)))
aes( )
no relation with data exists
ggplot( )
is overwrittenaes( )
is essential for linking aesthetics with dataggplot(data=mtcars, aes(y=mpg,x=disp,color=gear)) + geom_point(color='#FF6600')
aes( )
aes( )
at .3 (30%)aes( )
assumes a link to a variable, not a percentage as such
p1 <- ggplot(data=mtcars, aes(y=mpg,x=disp)) + geom_point(aes(color=factor(gear)),alpha=.3)
p2 <- ggplot(data=mtcars, aes(y=mpg,x=disp)) + geom_point(aes(color=factor(gear),alpha=.3))
p3 <- ggplot(data=mtcars, aes(y=mpg,x=disp)) +
geom_point(aes(color=factor(gear),alpha=qsec/max(qsec)))
grid.arrange(p1, p2, p3, ncol=3)
geom_point( )
creates the dots, geom_line( )
connects them over the x-axisgeom_path( )
simply connects observations as presented in the data (re-ordering has an effect)p1 <- ggplot(data=mtcars, aes(y=mpg,x=disp)) + geom_line()
p2 <- ggplot(data=mtcars, aes(y=mpg,x=disp)) + geom_point() + geom_line()
p3 <- ggplot(data=mtcars, aes(y=mpg,x=disp)) + geom_point() + geom_path()
p4 <- ggplot(data=mtcars %>% arrange(drat), aes(y=mpg,x=disp)) + geom_point() + geom_path()
grid.arrange(p1, p2, p3, p4, ncol=4)
aes( )
in ggplot( )
controls part of the result of geometric functions
geom_line( )
ggplot( )
for default behavior, or inside the geom_*( )
for local behavior
geom_point( )
uses the locally specified color aestheticgeom_line( )
uses the default black (size made smaller than default)ggplot(aes( ))
p1 <- ggplot(data=mtcars, aes(y=mpg,x=disp,color=factor(gear))) + geom_point() + geom_path()
p2 <- ggplot(data=mtcars, aes(y=mpg,x=disp,color=factor(gear))) + geom_point() + geom_line()
p3 <- ggplot(data=mtcars, aes(y=mpg,x=disp)) +
geom_point(aes(color=factor(gear))) + geom_line(size=.3)
grid.arrange(p1, p2, p3, ncol=3)
geom_smooth( )
offers averaging and standard errors
?geom_smooth
for more detailspx <- ggplot(data=mtcars, aes(y=mpg,x=disp,color=factor(gear)))
p1 <- px + geom_point() + geom_smooth()
p2 <- px + geom_point() + geom_smooth(method='lm')
p3 <- ggplot(data=mtcars, aes(y=mpg,x=disp,color=factor(gear),group=1)) + geom_point() +
geom_smooth(method='lm',color="#FF6600")
grid.arrange(p1, p2, p3, ncol=3)
Function to create a ggplot object, ready for visualization.
Function to link variables (data) to an aesthetic (dimension).
ggplot( )
or geom_*( )
aes( )
to identify groups of observations
1
to combine all observations into one groupaes( )
are independent of the data
Function to use the internal ggplot representation and turn it into a visualization.
ggplot( )
aes( )
to overwrite the default aes( )
from ggplot( )
While a ggplot object, layered with geom_*( )
functions allows you to visualize data and statistics, there is so much more to ggplot.
Geometric functions add layers on top of the ggplot object, as do statistical transformation functions.
geom_*( )
has a default stat argument, and every stat_*( )
has a default geom argument
stat_smooth(geom="smooth")
and geom_smooth(stat="smooth")
are equivalent+ geom_point(stat='summary',fun.y='mean',shape=13,size=16)
+ stat_summary(geom='point',fun.y='mean',shape=13,size=16)
geom_*( )
, in some not like ecdf.p1 <- ggplot(data=mtcars, aes(y=mpg,x=factor(carb),color=factor(gear))) + geom_point() +
stat_summary(geom='point',fun.y='mean',shape=13,size=16)
p2 <- ggplot(mtcars, aes(mpg)) + stat_ecdf(aes(color=factor(cyl)),geom = "step")
grid.arrange(p1, p2, ncol=2)
aes( )
(see before)p0 <- ggplot(data=mtcars,aes(x=disp,fill=factor(gear)))
p1 <- p0 + geom_histogram(binwidth=200, position = position_dodge(width=50),alpha=.8)
p2 <- p0 + geom_histogram(binwidth=200, position = position_stack(),alpha=.8,col='black') +
theme(legend.position='none')
p3 <- p0 + geom_histogram(aes(y=..density..),binwidth=200,
position = position_stack(),alpha=.8,col='black') + theme(legend.position='none')
grid.arrange(p1, p2, p3, ncol=3)
..density..
geom_*( )
and stat_*( )
particular variables are pre-defined..
Each aesthetic has a scale, which serves as a legend that helps interpretation of the visualized values.
guides()
function for additional controlp1 <- ggplot(data=mtcars,aes(x=disp,fill=factor(gear))) +
geom_histogram(aes(y=..density..),binwidth=200, position = position_stack(),alpha=.8,col='black') +
theme(legend.position='none') + scale_fill_brewer()
p2 <- ggplot(data=mtcars,aes(y=mpg,x=cyl,color=drat)) + geom_point() +
scale_color_distiller(palette='Oranges')
grid.arrange(p1, p2, ncol=2)
Visualizations can be split for subgroups, facilitating conditional comparisons.
facet_grid( )
uses a grid, facet_wrap( )
keeps filling spaceggplot(data=mtcars,aes(y=mpg,x=drat,color=carb)) + geom_jitter()
facet_grid( )
requires a row and/or column specification
~
, use .
if none+
allows for multiple row and/or column specificationggplot(data=mtcars,aes(y=mpg,x=drat,color=carb)) + geom_jitter() + facet_grid(vs~gear)
ggplot(data=mtcars,aes(y=mpg,x=drat,color=carb)) + geom_jitter() + facet_grid(.~vs+cyl)
Control not related to data is specified with a theme.
?theme
):
element_text( )
element_text( )
, element_line( )
, element_rect( )
, margin( )
, …theme_minimal( )
and new can be createdlabs( )
is a simpler way to specify titles and labelsExamples will make it more clear.
(px <- ggplot(mtcars, aes(wt, mpg)) + geom_point() +
labs(title = "Fuel economy declines as weight increases",y='miles per gallon'))
element_text( )
, the background with the element_rect( )
px + theme(plot.title = element_text(size=rel(1.5)),
plot.background=element_rect(fill="#FF6600"))
p1 <- px + geom_point(color='#FF6600',size=3) +
theme(panel.background = element_rect(fill = "#003399", colour = "#FF6600")) +
theme(panel.border = element_rect(linetype = "dashed", fill = NA),
panel.grid.major = element_line(colour = "#FF6600"))
p2 <- p1 + theme(
panel.grid.major.y = element_line(colour = "black"),
panel.grid.minor.y = element_blank()
)
grid.arrange(p1,p2,ncol=2)
element_text( )
, element_line( )
and unit( )
are usedp1 <- px + theme(axis.line = element_line(size = 3, colour = "#FF6600")) +
theme(axis.text = element_text(colour = "#003399", size=12)) +
theme(axis.ticks.y = element_line(size = 5)) +
theme(axis.title.y = element_text(size = rel(1.5)))
p2 <- p1 + theme(
axis.ticks.length.y = unit(.25, "cm"),
axis.ticks.length.x = unit(-.25, "cm"),
axis.text.x = element_text(margin = margin(t = .3, unit = "cm"))
)
grid.arrange(p1,p2,ncol=2)
px <- ggplot(mtcars, aes(wt, mpg)) +
geom_point(aes(colour = factor(cyl), shape = factor(vs))) +
labs(
x = "Weight (1000 lbs)",
y = "Fuel economy (mpg)",
colour = "Cylinders",
shape = "Transmission"
)
p1 <- px + theme(legend.position='none')
p2 <- px + theme(legend.justification = "right",legend.position = "bottom")
p3 <- px + theme(
legend.position = c(.95, .95),
legend.justification = c("right", "top"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6)
)
grid.arrange(p1,p2,p3,ncol=3)
px + theme(legend.key = element_rect(fill = "#bbbbbb", colour = "#003399")) +
theme(legend.text = element_text(size = 14, colour = "#003399")) +
theme(legend.title = element_text(face = "bold"))
px <- ggplot(mtcars, aes(wt, mpg)) + geom_point() + facet_wrap(~ cyl)
p1 <- px + theme(strip.background = element_rect(colour = "black", fill = "white"))
p2 <- px + theme(strip.text.x = element_text(colour = "white", face = "bold"))
p3 <- px + theme(panel.spacing = unit(1, "lines"))
grid.arrange(p1,p2,p3,ncol=3)
Typically the default cartesian coordinate system is used, coord_cartesian( )
.
coord_*( )
function
xlim( )
and ylim( )
px <- ggplot(data=mtcars,aes(y=mpg,x=gear,color=factor(vs),group=vs))
p1 <- px + geom_smooth(method='lm')
p2 <- px + geom_smooth(method='lm') + coord_cartesian(xlim=c(2.5,4.5))
p3 <- px + geom_smooth(method='lm') + xlim(2.5,4.5)
grid.arrange(p1,p2,p3,ncol=3)
coord_flip( )
switches x and y-axiscoord_fixed( )
sets the ratio for x and y values
coord_polar( )
, coord_trans( )
, and various map related functionsp1 <- px + geom_smooth(method='lm') + coord_flip()
p2 <- px + geom_smooth(method='lm') + coord_fixed(.1)
p3 <- px + geom_smooth(method='lm') + coord_polar()
grid.arrange(p1,p2,p3,ncol=3)
Through various examples the cheat sheet is focused upon. https://rstudio.com/resources/cheatsheets/
Primitives are the basic building blocks.
geom_point( )
is used very oftengeom_ribbon( )
can be interesting for showing an intervalsIt is possible to draw a blank plot, with the mtcars
data to determine the limits. On that plot, using a different data frame for the coordinates, a polygon can be drawn, and finally using another data frame for other coordinates tiles can be included.
p1 <- ggplot(mtcars,aes(y=mpg,x=cyl)) + geom_blank()
tmp <- tribble(~x,~y,
6,17,
6,10,
5,10,
8,21)
tmp2 <- tribble(~x,~y,~w,
5,20,2,
6,30,5,
7,15,2)
p2 <- p1 + geom_polygon(data=tmp,aes(x=x,y=y),alpha=.3,color="#FF6600")
p3 <- p2 + geom_tile(data=tmp2,aes(x=x,y=y,width=2),alpha=.3,color="#003399")
p4 <- p3 + geom_segment(data=data.frame(x=5,y=15,xend=7,yend=20),aes(x=x,y=y,xend=xend,yend=yend),size=5,color="#FF6600")
grid.arrange(p1,p2,p3,p4,ncol=4)
mpg
dataset, part of the ggplot2 package, and have a look at itcty
per hwy
tmp <- data.frame(c1=c(10,15),c2=c(15,20),c3=c(25,30),c4=c(35,40))
tmp <- tribble(~xi,~yi,~xa,~ya,
10,15,25,35,
15,20,30,40
)
- plot them on top of the scatterplot (the argument inherit.aes=FALSE will avoid inheriting the aesthetics from the default data)
- add color to each separately, use '#FF6600' and '#003399'
Various visualizations address one particular variable, mostly continuous but possibly also discrete.
p1 <- ggplot(data=mtcars,aes(mpg)) + geom_freqpoly(binwidth=2.5)
p2 <- ggplot(data=mtcars,aes(mpg)) + geom_density() +
geom_density(kernel='triangular',color="#003399") + geom_density(kernel='optcosine',color="#FF6600")
p3 <- ggplot(data=mtcars,aes(mpg)) + geom_histogram(binwidth=2.5)
grid.arrange(p1,p2,p3,ncol=3)
A continuous variables is typically ‘binned’ into groups within which frequencies are obtained.
When using the geom_area( )
such binning must be explicitly included as stat argument.
Also note that the geom_qq( )
requires a sample argument instead of x, as positions are determined by their size.
A final plot shows a typical discrete one variable plot, the bar-plot with geom_bar( )
, which is similar to the histogram but which shows the actual values instead of a count per bin.
p1 <- ggplot(data=mtcars,aes(mpg)) + geom_area(stat='bin',binwidth=2.5)
p2 <- ggplot(data=mtcars,aes(sample=mpg)) + geom_qq()
p3 <- ggplot(data=mtcars,aes(factor(cyl))) + geom_bar(fill='#FF6600',color='#003399')
grid.arrange(p1,p2,p3,ncol=3)
mpg
datasetdispl
variableclass
variabledrv
variableVarious visualizations address the relation between two variables, whether discrete and/or continuous.
Especially for categorical data data could obscure other data. Deal with this using the position argument or with the geom_jitter( )
. Avoid combining both geom_point( )
and geom_jitter( )
as it would draw points each time.
p1 <- ggplot(data=mtcars,aes(cyl,gear)) + geom_point()
p2 <- ggplot(data=mtcars,aes(cyl,gear)) + geom_jitter(width=.2,height=.2)
p3 <- ggplot(data=mtcars,aes(cyl,gear)) + geom_point(color='#FF6600') + geom_jitter(width=.2,height=.2)
grid.arrange(p1,p2,p3,ncol=3)
The smooth function has been shown above, changing the confidence band into areas that capture that middle 50% (quantiles .25 and .75), and including a rug at the axes to capture the one dimensional distribution.
Instead of bullet indicators, the row names or any other set of labels can be used using the label argument (within the aes( )
when related to data). A bit of jitter is added to avoid overlap.
p1 <- ggplot(data=mtcars, aes(y=mpg,x=disp,color=factor(gear))) + geom_point() + geom_smooth(method='lm',se=FALSE) + geom_rug() + geom_quantile(quantiles=c(.25,.75),linetype=2)
p2 <- ggplot(mtcars, aes(wt, mpg)) + geom_text(aes(label=(rownames(mtcars))),size=2,position=position_jitter(width = .2,height=3, seed=256))
grid.arrange(p1,p2,ncol=2)
Like with one dimensional discrete data, bars can be obtained with geom_col( )
, or with geom_bar(stat='identity')
. Use of ‘identity’ causes the height to depend on the numbers in the data. It sums these numbers if not unique, be careful.
(tmp <- mtcars %>% group_by(gear,vs) %>% count())
| # A tibble: 6 x 3
| # Groups: gear, vs [6]
| gear vs n
| <dbl> <dbl> <int>
| 1 3 0 12
| 2 3 1 3
| 3 4 0 2
| 4 4 1 10
| 5 5 0 4
| 6 5 1 1
p1 <- ggplot(tmp, aes(fill=factor(vs), y=n, x=gear)) + geom_bar(position="stack", stat="identity")
p2 <- ggplot(tmp, aes(fill=factor(vs), y=n, x=gear)) + geom_col(position="dodge")
p3 <- ggplot(mtcars, aes(fill=factor(vs), y=mpg, x=gear)) + geom_col(position="dodge") + labs(y='sums all mpg values')
grid.arrange(p1,p2,p3,ncol=3)
An interesting visualization for continuous data, possibly for different groups, is the boxplot.
p1 <- ggplot(data=mtcars,aes(y=mpg,x=factor(cyl))) + geom_boxplot()
p2 <- ggplot(data=mtcars,aes(y=mpg,x=factor(cyl))) +
geom_boxplot(width=.25,alpha=.2,aes(fill=factor(cyl))) + geom_jitter(width=.05)
grid.arrange(p1,p2,ncol=2)
mpg
datasethwy
on cty
, and color by cyl
make sure that cyl
is categorical
geom_smooth( )
, use the lm
methodclass
, notice the restriction on the number of shapesmake sure the symbols do not differ by color (all black), only shape, but keep the regression lines
hwy
for each cyl
color the observations dependent on class
hwy
values in each class
groupuse a coloring of the bars to signal the relative contribution of all cyl
categories
Specialized functions facilitate visualization of errors.
When standard errors are obtained, along with fitted values, they are easily shown. Other intervals can be visualized this way too.
(tmp <- tribble(~set,~fit,~se,1,3,.2,2,2,.3,3,2,.4))
| # A tibble: 3 x 3
| set fit se
| <dbl> <dbl> <dbl>
| 1 1 3 0.2
| 2 2 2 0.3
| 3 3 2 0.4
px <- ggplot(data=tmp,aes(y=fit,x=set))
p1 <- px + geom_errorbar(aes(ymax=fit+2*se,ymin=fit-2*se))
p2 <- px + geom_errorbarh(aes(xmax=fit+2*se,xmin=fit-2*se))
p3 <- px + geom_crossbar(aes(ymax=fit+2*se,ymin=fit-2*se))
p4 <- px + geom_pointrange(aes(ymax=fit+2*se,ymin=fit-2*se))
grid.arrange(p1,p2,p3,p4,ncol=4)
Frequencies and densities can be obtained for two variables.
With geom_bin2d( )
the frequency of combinations is obtained, in this case for the factors cyl and gear. Because using bins also continuous variables can be used. The geom_density2d( )
is also used with continuous variables, with contours depending on for example the gear levels.
p1 <- ggplot(data=mtcars,aes(y=factor(cyl),x=factor(gear))) + geom_bin2d()
p2 <- ggplot(data=mtcars,aes(y=mpg,x=carb)) + geom_density2d(aes(colour = factor(gear)))
grid.arrange(p1,p2,ncol=2)
While typically a third variable is included using aesthetics, it can be done with the z dimension as well.
When a third dimension is linked to a combination of conditions, they are easily shown. For the example meaningless data are simulated, 100 observations of which the first 6 are shown.
tmp <- expand.grid(set1=1:10,set2=1:10); set.seed(123); tmp$score <- runif(100,0,1)
head(tmp)
| set1 set2 score
| 1 1 1 0.2875775
| 2 2 1 0.7883051
| 3 3 1 0.4089769
| 4 4 1 0.8830174
| 5 5 1 0.9404673
| 6 6 1 0.0455565
A heatmap is the most obvious use, to show for example correlations between many variables with colors instead of values. A small example is used instead. Notice that the argument for tile is a fill while for contour is is simply z.
p1 <- ggplot(data=tmp,aes(y=set2,x=set1)) + geom_tile(aes(fill=score))
p2 <- ggplot(data=tmp,aes(y=set2,x=set1)) + geom_contour(aes(z=score))
grid.arrange(p1,p2,ncol=2)
ggplot2 is an excellent package, read more about it
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (Second edition). Springer.
or at
to learn… just use it and keep on trying when you are stuck, check online, lots of help out there