ISR ggplot2 Workshop

Georgios Karamanis

A very short introduction to ggplot2

ggplot2 is a powerful and flexible R package for creating data visualizations. It’s based on the Grammar of Graphics, a systematic approach to describing the components of a graphic.

Artwork by Allison Horst

Key features of ggplot2

Consistent and intuitive syntax
Layered approach to building plots
Wide range of plot types and customization options
Excellent for both quick exploratory plots and publication-quality graphics

Basic ggplot2 structure

A typical ggplot2 command has this structure:

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

ggplot(): Initializes the plot
data: The dataset you’re using
geom_function(): Determines the type of plot (e.g., geom_point(), geom_line())
aes(): Defines how variables are mapped to visual properties

Example:

ggplot(data = mtcars, aes(x = mpg, y = wt)) +
  geom_point()

For the exercises we will use two datasets, palmerpenguins and friends_info. Don’t worry about installing or downloading anything - these datasets are already loaded for you. The links are just for more information about the data.

Before plotting, let’s look at different ways to view datasets.

This displays the entire dataset, which can be overwhelming for large datasets.

head() shows the first 6 rows of the dataset, giving you a quick preview of the data structure.

colnames() lists all column names in the dataset, useful for identifying available variables.

summary() provides a statistical overview of each column, including min, max, mean, and quartiles for numeric data.

Code examples and exercises

In the following slides, we’ll explore practical examples and exercises based on the data visualization guidelines and pitfalls we discussed. Each section will:

Demonstrate the guideline or pitfall with code examples
Give you a chance to practice with exercises

Guideline 1: Create the simplest graph that conveys the information you want to convey

A scatter plot is one of the simplest ways to show relationships between two variables:

Other ways to write the same code

ggplot(palmerpenguins, aes(bill_length_mm, body_mass_g)) +
  geom_point()
  
ggplot(palmerpenguins) +
  geom_point(aes(bill_length_mm, body_mass_g))

For single variables, a histogram effectively shows their distribution:

Make a line plot using the friends_info dataset

First, write the code to view the first lines of the dataset. Then, choose your variables, one time variable for the x-axis and one for the y-axis.

Then, write the code to make the line chart using geom_line()

Exercise
Hint
Solution

Hint

Use head(friends_info) to view the first few rows of the dataset.
For the x-axis, look for a date column.
For the y-axis, consider using us_views_millions.

Solution.

head(friends_info)

ggplot(friends_info, aes(x = air_date, y = us_views_millions)) +  
  geom_line()

Thanks to ggplot2’s structure, we can easily add multiple visualizations

Guideline 2: Consider the type of encoding object and attribute used to create a plot

First, run the code below to create a scatter plot (spatial position for encoding).
Then, add color = species inside aes() to add color encoding and rerun the code.
Finally, add shape = sex inside aes() and rerun the code.

Guideline 3: Focus on visualizing patterns or on visualizing details, depending on the purpose of the plot

Visualizing details

Visualizing patterns

Make a box plot for friends_info
1. Copy the code from the previous slide and paste it in the box below
2. Replace the dataset name with friends_info, use season for x and group and us_views_millions for y

Exercise
Solution

Solution.

ggplot(friends_info, aes(x = season, y = us_views_millions, group = season)) +
  geom_boxplot()

Compare to the line chart we made before

A good way to easily see patterns is a heatmap

Guideline 4: Select meaningful axis ranges

ggplot2 automatically adjusts the axes depending on variable values

But we can change them!
1. Run the code.
2. Add limits = c(0, 100) inside scale_y_continuous() and rerun
3. Change the numbers in limits and see what happens to the line

Guideline 5: Data transformations and carefully chosen graph aspect ratios can be used to emphasize rates of change for time-series data

While Guideline 5 focuses on time series, the same principle applies to any variables with large ranges. Let’s look at ggplot2’s mammals sleep dataset.

Run the following code and notice that it’s impossible to distinguish the points close to 0 when using continuous scales

Let’s see the distribution of body weight - notice how most values are clustered near zero.

We can use logarithmic scales for both x and y to better reveal patterns across different magnitudes of weight.

Run the code
Remove one of the lines with scale_*_log10() and rerun the code

Guideline 6: Plot overlapping points in a way that density differences become apparent in scatter plots

Let’s plot imdb_rating and us_views_millions from friends_info with really big points so that they overlap a lot

We can reduce the opacity of the points by using a lower alpha value
The highest value is 1 (default) and the lowest is 0
1. Add alpha = 0.5 inside geom_point() and run the code
2. Try out different values and rerun the code

Pitfall: Vertical axis text

Run the code and see how the labels on the x-axis overlap
1. Change the angle in the last line to rotate the text and rerun the code
2. Set the angle back to 0 and then switch the x and y variables in line 2
3. Rerun the code

Pitfall: Rainbow color scale

Run the code to see a heatmap with the default rainbow scale in R

Run the code to see a better rainbow scale
1. Replace turbo with one of the other scale names: viridis, magma, plasma, mako
2. Rerun the code

That’s all! But if you have time, one last challenge awaits on the next slide!

Final exercise: Putting it all together

Create a scatter plot using the palmerpenguins dataset that shows:
- bill_length_mm vs bill_depth_mm
- body_mass_g as point size and species as color
- Use a viridis color scale
- Add a trend line for each species

Exercise
Hint
Solution

Hint

Start with ggplot(palmerpenguins, aes(...))
Use geom_point() for the scatter plot
Map species to color and body_mass_g to size in aes()
Use scale_color_viridis_d() for the color scale
Add geom_smooth() for trend lines

Solution.

ggplot(palmerpenguins, 
       aes(x = bill_length_mm, y = bill_depth_mm, 
           color = species, size = body_mass_g)) +
  geom_point(alpha = 0.7) +
  geom_smooth(aes(group = species)) +
  scale_color_viridis_d() +
  theme_minimal()