ggplot2
package
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
2025-02-05
ggplot2
package.”aes()
to change graph’s aesthetics (e.g. colors, shapes).”ggplot2
to produce several kinds of visualizations (for continuous and/or discrete data).”ggplot2
may be the most powerful tool for data visualization tools in the data science.
A lot of python users are jealous about R community can use ggplot2
. See one comment on reddit r/datascience.
Comment
byu/datasliceYT from discussion
indatascience
Source code: here
We will use the data from the Gapminder Foundation, which gives access to global data as well as many tools to help explore it.
In these lessons we’re going to use some of these data to explore some of these data ourselves.
There are two files with data relating to socio-economic statistics: world data for 2010 only and the same data for 1960 to 2010 (see the setup page for instructions on how to download the data).
Column | Description |
---|---|
country | country name |
world_region | 6 world regions |
year | year that each datapoint refers to |
children_per_woman | total fertility rate |
life_expectancy | average number of years a newborn child would live if current mortality patterns were to stay the same |
income_per_person | gross domestic product per person adjusted for differences in purchasing power |
is_oecd | Whether a country belongs to the “OECD” (TRUE) or not (FALSE) |
income_groups | categorical classification of income groups |
population | total number of a country’s population |
main_religion | religion of the majority of population in 2008 |
child_mortality | death of children under 5 years old per 1000 births |
life_expectancy_female | life expectancy at birth, females |
life_expectancy_male | life expectancy at birth, males |
As usual when starting an analysis on a new script, let’s start by loading the packages and reading the data:
library(tidyverse)
# Read the data, specifying how missing values are encoded
gapminder2010 <- read_csv(here::here(root_path, "gapminder2010_socioeconomic.csv"), na = "")
glimpse(gapminder2010)
Rows: 193
Columns: 13
$ country <chr> "Afghanistan", "Angola", "Albania", "Andorra", …
$ world_region <chr> "south_asia", "sub_saharan_africa", "europe_cen…
$ year <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,…
$ children_per_woman <dbl> 5.82, 6.16, 1.65, NA, 1.87, 2.37, 1.55, 2.13, 1…
$ life_expectancy <dbl> 59.85, 59.94, 77.64, 82.29, 72.88, 75.82, 74.05…
$ income_per_person <dbl> 1672, 6360, 9928, 38982, 55363, 18912, 6703, 20…
$ is_oecd <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ income_groups <chr> "low_income", "lower_middle_income", "upper_mid…
$ population <dbl> 29185511, 23356247, 2948029, NA, 8549998, 40895…
$ main_religion <chr> "muslim", "christian", "muslim", "christian", "…
$ child_mortality <dbl> 87.95, 120.49, 13.30, 4.18, 8.48, 14.44, 18.52,…
$ life_expectancy_female <chr> "62.459", "58.033", "79.302", "-", "77.77", "78…
$ life_expectancy_male <dbl> 59.683, 52.848, 74.145, -999.000, 75.624, 71.83…
ggplot2
packageggplot2
package is included in tidyverse
.ggplot2
graphTo build a ggplot2
graph you need 3 basic pieces of information:
ggplot(data = ...)
data.frame
) that will be mapped to different aesthetics of the graph (e.g. axis, colors, shapes, etc.)
mapping = aes(x = ..., y=..., color=....)
geom_points()
or geom_path()
This translates into the following generic basic syntax:
children_per_woman
and life_expectancy
.ggplot2
works.ggplot
:1ggplot(data = gapminder2010)
<data.frame>
with the object name of the target data frame
That “worked” (as in, we didn’t get an error). But because we didn’t give ggplot()
any variables/columns to be mapped to aesthetic components of the graph, we just got an empty square or a blank canvas.
aes()
function:ggplot(data = gapminder2010,
2 mapping = aes(x = children_per_woman,
y = life_expectancy))
<column of the data.frame>
with the unquoted column name
That’s better, now we have x-axis and y-axis, which correspond to faimily size (column name: children_per_woman
) and life expectancy (column name: life_expectancy
).
Notice how ggplot()
defines the axis based on the range of data given. But it’s still not a very interesting graph, because we didn’t tell what it is we want to draw on the graph.
+
) geometries to our graph. For example, scatterplot use the “point” to represent data:ggplot(data = gapminder2010,
mapping = aes(x = children_per_woman,
3 y = life_expectancy)) +
4 geom_point()
+
add up a new layer onto the old layers
geom_<type of geometry>()
with geom_point()
Important
Notice how geom_point()
warns you that it had to remove some missing values (if the data is missing for at least one of the variables, then it cannot plot the points).
Aim: It would be useful to explore the pattern of missing data in these two variables. The naniar
package provides a ggplot geometry that allows us to do this, by replacing NA
values with values 10% lower than the minimum in the variable.
Try and modify the previous graph, using the geom_miss_point()
from this package. (hint: don’t forget to load the package first using library(naniar)
)
Questions: What can you conclude from this exploration? Are the data missing at random?
To open a “Install Packages” window below, press and then search “Install” in the dropdown menu.
library(naniar) # load the naniar package; this should be placed on top of the script
ggplot(data = gapminder2010,
mapping = aes(x = children_per_woman, y = life_expectancy)) +
geom_miss_point()
children_per_woman
than life_expectancy
.Here is a list of some frequently used geometries1:
geom_area()
draws an area plot, which is a line plot filled to the y-axis (filled lines). Multiple groups will be stacked on top of each other.
geom_bar(stat = "identity")
makes a bar chart. We need stat = “identity” because the default stat automatically counts values (so is essentially a 1d geom, see Section 5.4). The identity stat leaves the data unchanged. Multiple bars in the same location will be stacked on top of one another.
geom_line()
makes a line plot. The group aesthetic determines which observations are connected; see Chapter 4 for more detail. geom_line() connects points from left to right; geom_path() is similar but connects points in the order they appear in the data. Both geom_line() and geom_path() also understand the aesthetic linetype, which maps a categorical variable to solid, dotted and dashed lines.
geom_point()
produces a scatterplot. geom_point() also understands the shape aesthetic.
geom_polygon()
draws polygons, which are filled paths. Each vertex of the polygon requires a separate row in the data. It is often useful to merge a data frame of polygon coordinates with the data just prior to plotting. Chapter 6 illustrates this concept in more detail for map data.
geom_rect()
, geom_tile()
and geom_raster()
draw rectangles. geom_rect()
is parameterised by the four corners of the rectangle, xmin, ymin, xmax and ymax. geom_tile()
is exactly the same, but parameterised by the center of the rect and its size, x, y, width and height. geom_raster()
is a fast special case of geom_tile()
used when all the tiles are the same size. .
geom_text()
adds text to a plot. It requires a label aesthetic that provides the text to display, and has a number of parameters (angle, family, fontface, hjust and vjust) that control the appearance of the text.
df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) +
labs(x = NULL, y = NULL) + # Hide axis label
theme(plot.title = element_text(size = 15)) # Shrink plot title
p + geom_point() + ggtitle("geom_point")
p + geom_text() + ggtitle("geom_text")
p + geom_bar(stat = "identity") + ggtitle("geom_bar")
p + geom_tile() + ggtitle("geom_raster")
p + geom_line() + ggtitle("geom_line")
p + geom_area() + ggtitle("geom_area")
p + geom_path() + ggtitle("geom_path")
p + geom_polygon() + ggtitle("geom_polygon")
We can change more details about how geometries look like in several ways, for example their transparency, color, size, shape, etc.
In R, we added more arguments into the geom_point
function, like geom_point(color = "red")
To know which aesthetic components can be changed in a particular geometry, look at its documentation (e.g. ?geom_point
) and look under the “Aesthetics” section of the help page.
the documentation for ?geom_point
says:
alpha
argument in geom_point()
(alpha
varies between 0-1 with zero being transparent and 1 being opaque):ggplot(data = gapminder2010,
mapping = aes(x = children_per_woman,
y = life_expectancy)) +
4 geom_point(alpha = 0.5)
geom_point()
adjusts the characteristics of points
Adding transparency to points is useful when data is very packed, as you can then see which areas of the graph are more densely occupied with points.
Aim: Try changing the size, shape and color of the points (hint: web search “ggplot2 point shapes” to see how to make a triangle)
You can find out R colors’ name using colors()
functions. Below is the index for point shapes.
In the above exercise we changed the color of the points by defining it ourselves. However, it would be better if we colored the points based on a variable of interest.
For example, to explore our question of how different world regions really are, we want to color the countries in our graph accordingly.
We can do this by passing this information to the color
aesthetic inside the aes()
function:
For example, if we look at the points with red color, when world_region == america
, geom_point()
function maps the point’s color as red and x-axis as value of children_per_woman
and y-axis as the value of life_expectancy
:
children_per_woman | life_expectancy | world_region | color |
---|---|---|---|
2.37 | 75.82 | america | red |
2.13 | 76.61 | america | red |
1.87 | 73.42 | america | red |
2.72 | 73.59 | america | red |
3.20 | 70.78 | america | red |
1.81 | 74.38 | america | red |
Aesthetics: inside or outside aes()
?
The previous examples illustrate an important distinction between aesthetics defined inside or outside of aes()
:
aes()
ggplot(gapminder2010) +
geom_point(aes(x = children_per_woman,
y = life_expectancy,
1 color = world_region))
ggplot(gapminder2010) +
geom_point(aes(x = children_per_woman,
y = life_expectancy),
2 color = "red")
Question 1: What’s gone wrong with this code? Why are the points not blue?
The argument colour = "blue"
is included within the mapping argument, and as such, it is treated as an aesthetic, which is a mapping between a variable and a value. In the expression, colour = "blue"
, “blue” is interpreted as a categorical variable which only takes a single value “blue”. If this is confusing, consider how colour = 1:234 and colour = 1 are interpreted by aes().
Question 2: Make a boxplot that shows the distribution of children_per_woman
(y-axis) for each world_region
(x-axis). (Hint: using geom_boxplot()
)
Bonus: Color the inside of the boxplots by income_groups
.
ggplot2
will automatically split the data into groups and make a boxplot for each.ggplot(data = gapminder2010,
aes(x = world_region,
y = children_per_woman,
fill = income_groups)) +
geom_boxplot()
Some groups have too few observations (possibly only 1) and so we get odd boxplots with only a line representing the median, because there isn’t enough variation in the data to have distinct quartiles.
Also, the labels on the x-axis are all overlapping each other. We will see how to solve this later.
Often, we may want to overlay several geometries on top of each other. For example, add a violin plot together with a boxplot so that we get both representations of the data in a single graph.
The shape represents the density estimate of the variable: the more data points in a specific range, the larger the violin is for that range.
coord_flip()
, meaning “flip Cartesian coordinates”+
) another geometry to the graph:Note
The order in which you add the geometries defines the order they are “drawn” on the graph. For example, try swapping their order and see what happens.
ggplot(gapminder2010, aes(x = world_region, y = children_per_woman)) +
1 geom_boxplot(width = 0.2) +
geom_violin(scale = "width") +
coord_flip()
Notice how we’ve shortened our code by omitting the names of the options data = gapminder2010
and mapping = aes(...)
inside ggplot()
. Because the data is always the first thing in the first place given to ggplot()
and the mapping is always identified by the function aes()
, this is often written in the more compact form as we just did.
Let’s say that, in the graph above, we wanted to color the violins by world region, but keep the boxplots without color.
As we’ve learned, because we want to color our geometries based on data, this goes inside the aes()
part of the graph:
What if we want only violin plot to be colored not the boxplot
OK, this is not what we wanted. Both geometries (boxplots and violins) got colored.
It turns out that we can control aesthetics individually in each geometry, by puting the aes()
inside the geometry function itself. Like this:
ggplot(gapminder2010, aes(x = world_region,
1 y = children_per_woman)) +
2 geom_violin(aes(fill = world_region), scale = "width") +
geom_boxplot(width = 0.2)
aes(fill = world_region)
, from ggplot
which applies to all geometries - violine and boxplot
aes
inside geom_violin
geometry, which specify violin
aes
for grey color. The code name for grey color in R is "grey"
).Because we want to define the fill color of the violin “manually” it goes outside aes()
. Whereas for the violin we want the fill to depend on a column of data, so it goes inside aes()
.
ggplot(gapminder2010, aes(x = world_region, y = children_per_woman)) +
geom_violin(fill = "grey", scale = "width") +
geom_boxplot(aes(fill = world_region), width = 0.2)
Although this graph looks appealing, the color is redundant with the x-axis labels. So, the same information is being shown with multiple aesthetics. This is not necessarily incorrect, but we should generally avoid too much gratuitous use of color in graphs. At the very least we should remove the legend from this graph.
You can split your plot into multiple panels by using facetting. There are two types of facet functions:
facet_wrap()
arranges a one-dimensional sequence of panels to fit on one page.facet_grid()
allows you to form a matrix of rows and columns of panels.Inside facet_<type>
function, we assign a faceting variable which is basically a group variable with muliple levels with each level will be a panel plot.
Tip
For example, facet_wrap(vars(gender))
with gender
has two levels: male/female. Then it will output two panels of plots (left: male, right: female).
Both geometries allow to to specify faceting variables specified with vars()
. In general:
facet_wrap(facets = vars(facet_variable))
facet_grid(rows = vars(row_variable), cols = vars(col_variable))
.income_groups
:income_groups
and economic_organisation
, then we use facet_grid()
:facet_grid()
, you can organise the panels just by rows or just by columns. Try running this code yourself:# One column, facet by rows
ggplot(gapminder2010,
aes(x = children_per_woman, y = life_expectancy, color = world_region)) +
geom_point() +
facet_grid(rows = vars(is_oecd))
# One row, facet by column
ggplot(gapminder2010,
aes(x = children_per_woman, y = life_expectancy, color = world_region)) +
geom_point() +
facet_grid(cols = vars(is_oecd))
Often you want to change how the scales of your plot are defined. For example, we want to change the default color scheme from “red/yellow/green” to “black/dark grey/grey”.
In ggplot2
scales can refer to the x and y aesthetics, but also to other aesthetics such as color, shape, fill, etc.
We modify scales using the scale family of functions. These functions always follow the following naming convention: scale_<aesthetic>_<type>
, where:
<aesthetic>
refers to the aesthetic for that scale function (e.g. x
, y
, color
, fill
, shape
, etc.)<type>
refers to the type of aesthetic (e.g. discrete
, continuous
, manual
)Let’s see some examples.
Taking the graph from the previous exercise we can modify the x and y axis scales, for example to emphasis a particular range of the data and define the breaks of the axis ticks.
limits =
with a vector with the length as 2 set the lower and upper limits of x or y axis.
breaks =
with seq(0, 3, by = 1)
sets the x-axis ticks at the points of 0, 1, 2, 3
You can also apply transformations to the data. For example, consider the distribution of income across countries, represented using a histogram:
We can see that this distribution is highly skewed, with some countries having very large values, while others having very low values. One common data transformation to solve this issue is to log-transform our values. We can do this within the scale function:
Notice how the interval between the x-axis values is not constant anymore, we go from $1000 to $10,000 and then to $100,000. That’s because our data is now plotted on a log-scale.
You could transform the data directly in the variable given to x:
This is also fine, but in this case the x-axis scale would show you the log-transformed values, rather than the original values. (Try running the code yourself to see the difference!)
Let’s get back to our initial scatterplot and color the points by income:
Because income_per_person
is a continuous variable, ggplot created a gradient color scale.
We can change the default using scale_color_gradient()
, defining two colors for the lowest and highest values (and we can also log-transform the data like before):
For continuous color scales we can use the viridis palette, which has been developed to be color-blind friendly and perceptually better:
Earlier, when we did our boxplot, the x-axis was a categorical variable.
For categorical axis scales, you can use the scale_x_discrete()
and scale_y_discrete()
functions. For example, to limit which categories are shown and in which order:
You can manually change specific group’s color using scale_fill_manual(values = ...)
in which ...
should follow the same order of the grouping factor.
Important
If you specify colors for “large area”, use fill
, for example, in boxplot, we use geom_boxplot(aes(fill = ...))
. This will draw colors for the whole box.
If you specify colors for points or line, use color
. For example, geom_points(aes(color = ...))
.
Question: if you use geom_boxplot(aes(color = ...))
, what will happen? Does box be colored?
Taking the previous plot, let’s change the fill
scale to define custom colors “manually”.
For color/fill scales there’s a very convenient variant of the scale
function (“brewer”) that has some pre-defined palettes, including color-blind friendly ones:
You can see all the available palettes here. Note that some palettes only have a limited number of colors and ggplot
will give a warning if it has fewer colors available than categories in the data.
Modify the following code so that the point size is defined by the population size. The size should be on a log scale (Hint: use the scale_size_continuous
geometry.).
To make points change by size, we add the size aesthetic within the aes()
function:
ggplot(data = gapminder2010,
mapping = aes(x = children_per_woman, y = life_expectancy)) +
geom_point(aes(color = world_region, size = population)) +
scale_color_brewer(palette = "Dark2")
In this case the scale of the point’s size is on the original (linear) scale. To transform the scale, we can use scale_size_continuous()
:
To save a graph, you can use the ggsave()
function, which needs two pieces of information:
.pdf
, .png
, .jpeg
).You can also specify options for the size of the graph and dpi (for PNG or JPEG).
Another easy way to save your graphs is by using RStudio’s interface. From the “Plots” panel there is an option to “Export” the graph. However, doing it with code like above ensures reproducibility, and will allow you to track which files where generated from which script.
Every single element of a ggplot can be modified. This is further covered in a future episode.
For some simple plot, like the density plot of single variable, you may not need ggplot2
package. Base R has some built-in plotting function for data visualization. See this blog for example.
Data Tip: visualizing data
Data visualization is one of the fundamental elements of data analysis. It allows you to assess variation within variables and relationships between variables.
Choosing the right type of graph to answer particular questions (or convey a particular message) can be daunting. The data-to-viz website can be a great place to get inspiration from.
Here are some common types of graph you may want to do for particular situations:
Look at variation within a single variable using histograms (geom_histogram()
) or, less commonly (but quite useful) empirical cumulative density function plots (stat_ecdf
).
Look at variation of a variable across categorical groups using boxplots (geom_boxplot()
), violin plots (geom_violin()
) or frequency polygons (geom_freqpoly()
).
Look at the relationship between two numeric variables using scatterplots (geom_point()
).
If your x-axis is ordered (e.g. year) use a line plot (geom_line()
) to convey the change on your y-variable.