Interactive editor

Interactive code sections look like this. Make changes in the text box and click on the green “Run Code” button to see the results. Sometimes there will be a tab with a hint or solution.

Run selected code:
- macOS: ⌘ + ↩︎/Return
- Windows/Linux: Ctrl + ↩︎/Enter
To run the entire code cell, you can simply click the “Run code” button, or use the keyboard shortcut:
- Shift + ↩︎

If you’re curious how this works, each interactive code section uses the amazing {quarto-webr} package to run R directly in your browser.

Set Up

Overview

Univariate data analysis - distribution of single variable
Bivariate data analysis - relationship between two variables
Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.
If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.

Data: Lending Club

Thousands of loans made through the Lending Club, which is a platform that allows individuals to lend to other individuals
Not all loans are created equal – ease of getting a loan depends on (apparent) ability to pay back the loan
Data includes loans made, these are not loan applications

Take a peek at data

library(openintro)
library(tidyverse)
glimpse(loans_full_schema)

Rows: 10,000
Columns: 55
$ emp_title                        <chr> "global config engineer ", "warehouse…
$ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
$ state                            <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, I…
$ homeownership                    <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN…
$ annual_income                    <dbl> 90000, 40000, 40000, 30000, 35000, 34…
$ verified_income                  <fct> Verified, Not Verified, Source Verifi…
$ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
$ annual_income_joint              <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
$ verification_income_joint        <fct> , , , , Verified, , Not Verified, , ,…
$ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
$ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
$ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
$ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007, 2008, 1990, 2…
$ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
$ total_credit_lines               <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
$ open_credit_lines                <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ total_credit_limit               <int> 70795, 28800, 24193, 25400, 69839, 42…
$ total_credit_utilized            <int> 38767, 4321, 16000, 4997, 52722, 3898…
$ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
$ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
$ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
$ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
$ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
$ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
$ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
$ total_debit_limit                <int> 11100, 16500, 4300, 19400, 32700, 272…
$ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
$ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
$ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
$ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
$ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
$ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ loan_purpose                     <fct> moving, debt_consolidation, other, de…
$ application_type                 <fct> individual, individual, individual, i…
$ loan_amount                      <int> 28000, 5000, 2000, 21600, 23000, 5000…
$ term                             <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
$ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
$ installment                      <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
$ grade                            <ord> C, C, D, A, C, A, C, B, C, A, C, B, C…
$ sub_grade                        <fct> C3, C1, D1, A3, C3, A3, C2, B5, C2, A…
$ issue_month                      <fct> Mar-2018, Feb-2018, Feb-2018, Jan-201…
$ loan_status                      <fct> Current, Current, Current, Current, C…
$ initial_listing_status           <fct> whole, whole, fractional, whole, whol…
$ disbursement_method              <fct> Cash, Cash, Cash, Cash, Cash, Cash, C…
$ balance                          <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
$ paid_total                       <dbl> 1999.330, 499.120, 281.800, 3312.890,…
$ paid_principal                   <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
$ paid_interest                    <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
$ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Selected variables

loans <- loans_full_schema %>%
  select(loan_amount, interest_rate, term, grade, 
         state, annual_income, homeownership, debt_to_income)
glimpse(loans)

Rows: 10,000
Columns: 8
$ loan_amount    <int> 28000, 5000, 2000, 21600, 23000, 5000, 24000, 20000, 20…
$ interest_rate  <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, 13.59, 11.99, 1…
$ term           <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, 60, 60, 36, 60,…
$ grade          <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, B, D, D, D, F, E…
$ state          <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, IL, FL, SC, CO,…
$ annual_income  <dbl> 90000, 40000, 40000, 30000, 35000, 34000, 35000, 110000…
$ homeownership  <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, MORTGAGE, MORTGA…
$ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, 23.66, 16.19, 3…

Selected variables

variable	type	description
`loan_amount`	numerical, continuous	Amount of the loan received, in US dollars
`interest_rate`	numerical, continuous	Interest rate on the loan, in an annual percentage
`term`	numerical, discrete	The length of the loan, which is always set as a whole number of months
`grade`	categorical, ordinal	Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid
`state`	categorical, not ordinal	US state where the borrower resides
`annual_income`	numerical, continuous	Borrower’s annual income, including any second income, in US dollars
`homeownership`	categorical, not ordinal	Indicates whether the person owns, owns but has a mortgage, or rents
`debt_to_income`	numerical, continuous	Debt-to-income ratio

Visualizing Continous data

Describing shapes of numerical distributions

shape:
- skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
- modality: unimodal, bimodal, multimodal, uniform
center: mean (mean), median (median), mode (not always useful)
spread: range (range), standard deviation (sd), inter-quartile range (IQR)
unusual observations

Histogram

summary(loans$loan_amount)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1000    8000   14500   16362   24000   40000

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Your turn

Create a histogram for the interest_rate variable.

Histograms and binwidth

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 1000)

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 5000)

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 20000)

Your turn

Interactive Code
Solution

Visualized the histogram and interest_rate and modify the binwidth to $\frac{1}{20}$ of the range of of the interest_rate variable.

ggplot(loans) +
  geom_histogram(aes(x = interest_rate), 
                 binwidth = diff(range(loans$interest_rate))/20)

Customizing labels of histograms

Plot
Code

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(binwidth = 5000) +
1  labs(
    x = "Loan amount ($)", 
    y = "Frequency", 
    title = "Amounts of Lending Club loans" 
  )

1: labs() can modify axis, legend, and plot labels. You can also use xlab and ylab to modify labels for x and y axis, respectively.

Your turn

Interactive Code
Solution

Change x-axis label to ‘Interest Rate (%)’.

ggplot(loans) +
  geom_histogram(aes(x = interest_rate)) +
  xlab("Interest Rate (%)")

Fill with a categorical variable

Plot
Code

ggplot(loans, aes(x = loan_amount, 
1                  fill = homeownership)) +
  geom_histogram(binwidth = 5000,
2                 alpha = 0.5) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  )

1: Add homeownership to fill with certain category
2: Add alpha= argument to set up transparency for the figure

Your turn

Interactive Code
Solution

Use grade to highlight histograms with different grades. Set up the the transparency level to 80%.

ggplot(loans) +
  geom_histogram(aes(x = interest_rate, fill = grade), alpha = 0.8) +
  xlab("Interest Rate (%)")

Facet with a categorical variable

Plot
Code

ggplot(loans, aes(x = loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  ) +
  facet_wrap(~ homeownership, nrow = 3) #<<

Color of bar borders

Plot
Code

ggplot(loans, aes(x = loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000, color = "white") +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  ) +
  facet_wrap(~ homeownership, nrow = 3) #<<

Position of Histogram Bars

Plot
Code

ggplot(loans, aes(x = loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000, position = position_dodge()) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Amounts of Lending Club loans"
  )

Density plot

ggplot(loans, aes(x = loan_amount)) +
  geom_density()

Density plots and adjusting bandwidth

ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 0.5)

ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 1) # default bandwidth

ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 2)

Customizing density plots

Plot
Code

ggplot(loans, aes(x = loan_amount)) +
  geom_density(adjust = 2) +
  labs( #<<
    x = "Loan amount ($)", #<<
    y = "Density", #<<
    title = "Amounts of Lending Club loans" #<<
  ) #<<

Adding a categorical variable

Plot
Code

ggplot(loans, aes(x = loan_amount, 
                  fill = homeownership)) + #<<
  geom_density(adjust = 2, 
               alpha = 0.5) + #<<
  labs(
    x = "Loan amount ($)",
    y = "Density",
    title = "Amounts of Lending Club loans", 
    fill = "Homeownership" #<<
  )

Box plot

Boxplot visualises five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually.
- The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles).
- The whiskers extend from the hinge to the smallest and largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles).

ggplot(loans, aes(x = interest_rate)) +
  geom_boxplot()

Box plot and outliers

ggplot(loans, aes(x = annual_income)) +
  geom_boxplot()

Customizing box plots

Plot
Code

ggplot(loans, aes(x = interest_rate)) +
  geom_boxplot() +
  labs(
    x = "Interest rate (%)",
    y = NULL,
    title = "Interest rates of Lending Club loans"
  ) +
  theme( #<<
    axis.ticks.y = element_blank(), #<<
    axis.text.y = element_blank() #<<
  ) #<<

Adding a categorical variable

Plot
Code

ggplot(loans, aes(x = interest_rate,
                  y = grade)) + #<<
  geom_boxplot() +
  labs(
    x = "Interest rate (%)",
    y = "Grade",
    title = "Interest rates of Lending Club loans",
    subtitle = "by grade of loan" #<<
  )

Relationships numerical variables

Scatterplot

ggplot(loans, aes(x = debt_to_income, y = interest_rate)) +
  geom_point()

Hex plot

ggplot(loans, aes(x = debt_to_income, y = interest_rate)) +
  geom_hex()

Hex plot

ggplot(loans %>% filter(debt_to_income < 100), 
       aes(x = debt_to_income, y = interest_rate)) +
  geom_hex()

Visualize Categorical Variable

Bar plot

ggplot(loans, aes(x = homeownership)) +
  geom_bar()

Segmented bar plot

ggplot(loans, aes(x = homeownership, 
                  fill = grade)) + #<<
  geom_bar()

Segmented bar plot

ggplot(loans, aes(x = homeownership, fill = grade)) +
  geom_bar(position = "fill") #<<

Question

Which bar plot is a more useful representation for visualizing the relationship between homeownership and grade?

Customizing bar plots

Plot
Code

ggplot(loans, aes(y = homeownership, #<<
                  fill = grade)) +
  geom_bar(position = "fill") +
  labs( #<<
    x = "Proportion", #<<
    y = "Homeownership", #<<
    fill = "Grade", #<<
    title = "Grades of Lending Club loans", #<<
    subtitle = "and homeownership of lendee" #<<
  ) #<<

Relationships between numerical and categorical variables

Already talked about…

Colouring and faceting histograms and density plots
Side-by-side box plots

Violin plots

ggplot(loans, aes(x = homeownership, y = loan_amount)) +
  geom_violin()

Ridge plots

library(ggridges)
ggplot(loans, aes(x = loan_amount, y = grade, fill = grade, color = grade)) + 
  geom_density_ridges(alpha = 0.5)

Designing effective visualizations

Keep it simple

Use color to draw attention

Tell a story

Credit: Angela Zoss and Eric Monson, Duke DVS

Principles for effective visualizations

Order matters
Put long categories on the y-axis
Keep scales consistent
Select meaningful colors
Use meaningful and nonredundant labels

Data

In September 2019, YouGov survey asked 1,639 GB adults the following question:

In hindsight, do you think Britain was right/wrong to vote to leave EU?

Right to leave

Wrong to leave

Don’t know

Source: YouGov Survey Results, retrieved Oct 7, 2019

Order matters

Alphabetical order is rarely ideal

Plot
Code

ggplot(brexit, aes(x = opinion)) +
  geom_bar()

Order by frequency

Plot
Code

fct_infreq: Reorder factors’ levels by frequency

ggplot(brexit, aes(x = fct_infreq(opinion))) + #<<
  geom_bar()

Clean up labels

Plot
Code

ggplot(brexit, aes(x = opinion)) +
  geom_bar() +
  labs( #<<
    x = "Opinion", #<<
    y = "Count" #<<
  ) #<<

Alphabetical order is rarely ideal

Plot
Code

ggplot(brexit, aes(x = region)) +
  geom_bar()

Use inherent level order

Relevel
Plot

fct_relevel: Reorder factor levels using a custom order

brexit <- brexit %>%
  mutate(
    region = fct_relevel( #<<
      region,
      "london", "rest_of_south", "midlands_wales", "north", "scot"
    )
  )

Clean up labels

Recode
Plot

fct_recode: Change factor levels by hand

brexit <- brexit %>%
  mutate(
    region = fct_recode( #<<
      region,
      London = "london",
      `Rest of South` = "rest_of_south",
      `Midlands / Wales` = "midlands_wales",
      North = "north",
      Scotland = "scot"
    )
  )

Put long categories on the y-axis

Long categories can be hard to read

Move them to the y-axis

Plot
Code

ggplot(brexit, aes(y = region)) + #<<
  geom_bar()

And reverse the order of levels

Plot
Code

fct_rev: Reverse order of factor levels

ggplot(brexit, aes(y = fct_rev(region))) + #<<
  geom_bar()

Clean up labels

Plot
Code

]

ggplot(brexit, aes(y = fct_rev(region))) +
  geom_bar() +
  labs( #<<
    x = "Count", #<<
    y = "Region" #<<
  ) #<<

Pick a purpose

Segmented bar plots can be hard to read

Plot
Code

ggplot(brexit, aes(y = region, fill = opinion)) + #<<
  geom_bar()

Avoid redundancy?

Redundancy can help tell a story

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1)

Be selective with redundancy

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1) +
  guides(fill = "none") #<<

Use informative labels

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1) +
  guides(fill = "none") +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?", #<<
    x = NULL, y = NULL
  )

A bit more info

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1) +
  guides(fill = "none") +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
    subtitle = "YouGov Survey Results, 2-3 September 2019", #<<
    caption = "Source: https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x0msmggx08/YouGov%20-%20Brexit%20and%202019%20election.pdf", #<<
    x = NULL, y = NULL
  )

Let’s do better

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1) +
  guides(fill = "none") +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
    subtitle = "YouGov Survey Results, 2-3 September 2019",
    caption = "Source: bit.ly/2lCJZVg", #<<
    x = NULL, y = NULL
  )

Fix up facet labels

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region,
    nrow = 1,
    labeller = label_wrap_gen(width = 12) #<<
  ) + 
  guides(fill = "none") +
  labs(
    title = "Was Britain right/wrong to vote to leave EU?",
    subtitle = "YouGov Survey Results, 2-3 September 2019",
    caption = "Source: bit.ly/2lCJZVg",
    x = NULL, y = NULL
  )

Select meaningful colors

Rainbow colors not always the right choice

Nicola Rennie’s Blog: Working with colours in R

Manually choose colors when needed

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1, labeller = label_wrap_gen(width = 12)) +
  guides(fill = "none") +
  labs(title = "Was Britain right/wrong to vote to leave EU?",
       subtitle = "YouGov Survey Results, 2-3 September 2019",
       caption = "Source: bit.ly/2lCJZVg",
       x = NULL, y = NULL) +
  scale_fill_manual(values = c( #<<
    "Wrong" = "red", #<<
    "Right" = "green", #<<
    "Don't know" = "gray" #<<
  )) #<<

Choosing better colors

Source: colorbrewer2.org

Use better colors

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1, labeller = label_wrap_gen(width = 12)) +
  guides(fill = "none") +
  labs(title = "Was Britain right/wrong to vote to leave EU?",
       subtitle = "YouGov Survey Results, 2-3 September 2019",
       caption = "Source: bit.ly/2lCJZVg",
       x = NULL, y = NULL) +
  scale_fill_manual(values = c(
    "Wrong" = "#ef8a62", #<<
    "Right" = "#67a9cf", #<<
    "Don't know" = "gray" #<<
  ))

Select theme

Plot
Code

ggplot(brexit, aes(y = opinion, fill = opinion)) +
  geom_bar() +
  facet_wrap(~region, nrow = 1, labeller = label_wrap_gen(width = 12)) +
  guides(fill = "none") +
  labs(title = "Was Britain right/wrong to vote to leave EU?",
       subtitle = "YouGov Survey Results, 2-3 September 2019",
       caption = "Source: bit.ly/2lCJZVg",
       x = NULL, y = NULL) +
  scale_fill_manual(values = c("Wrong" = "#ef8a62",
                               "Right" = "#67a9cf",
                               "Don't know" = "gray")) +
  theme_minimal() #<<

Resource

--- title: "Lecture 09: Visualizing Numerical Data" subtitle: "`ggplot2` package" date: "2025-02-05" execute: eval: true echo: true warning: false message: false output-location: default code-annotations: below format: html: code-tools: true code-line-numbers: false code-fold: false number-offset: 0 fig-align: center uark-revealjs: scrollable: true chalkboard: true embed-resources: false code-fold: false number-sections: false footer: "ESRM 64503" slide-number: c/t tbl-colwidths: auto output-file: slides-index.html out.width: "100%" filters: - webr --- ```{r packages, echo=FALSE, message=FALSE, warning=FALSE} library(tidyverse) library(openintro) loans_full_schema <- loans_full_schema %>% mutate(grade = factor(grade, ordered = TRUE)) ``` # Interactive editor Interactive code sections look like this. Make changes in the text box and click on the green “Run Code” button to see the results. Sometimes there will be a tab with a hint or solution. - Run selected code: - macOS: ⌘ + ↩︎/Return - Windows/Linux: Ctrl + ↩︎/Enter - To run the entire code cell, you can simply click the “Run code” button, or use the keyboard shortcut: - Shift + ↩︎ If you’re curious how this works, each interactive code section uses [the amazing {quarto-webr} package](https://quarto-webr.thecoatlessprofessor.com/) to run R directly in your browser. ::: callout-caution ## Set Up ```{webr-r} library(openintro) library(tidyverse) loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income) ``` ::: # Overview ## Terminology ::: panel-tabset ## Number of variables involved - Univariate data analysis - distribution of single variable - Bivariate data analysis - relationship between two variables - Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others ## Types of variables - **Numerical variables** can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. ::: ## Data: Lending Club ::::: columns ::: column - Thousands of loans made through the Lending Club, which is a platform that allows individuals to lend to other individuals - Not all loans are created equal -- ease of getting a loan depends on (apparent) ability to pay back the loan - Data includes loans *made*, these are not loan applications ::: ::: column ```{r echo=FALSE, out.width = "100%"} knitr::include_graphics("images/lending-club.png") ``` ::: ::::: ## Take a peek at data ```{r output.lines=18} library(openintro) library(tidyverse) glimpse(loans_full_schema) ``` ## Selected variables ```{r} loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income) glimpse(loans) ``` ## Selected variables | variable | type | description | |------------------------|------------------------|------------------------| | `loan_amount` | numerical, continuous | Amount of the loan received, in US dollars | | `interest_rate` | numerical, continuous | Interest rate on the loan, in an annual percentage | | `term` | numerical, discrete | The length of the loan, which is always set as a whole number of months | | `grade` | categorical, ordinal | Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid | | `state` | categorical, not ordinal | US state where the borrower resides | | `annual_income` | numerical, continuous | Borrower’s annual income, including any second income, in US dollars | | `homeownership` | categorical, not ordinal | Indicates whether the person owns, owns but has a mortgage, or rents | | `debt_to_income` | numerical, continuous | Debt-to-income ratio | # Visualizing Continous data ## Describing shapes of numerical distributions - shape: - skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) - modality: unimodal, bimodal, multimodal, uniform - center: mean (`mean`), median (`median`), mode (not always useful) - spread: range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`) - unusual observations # Histogram ## Histogram ```{r} summary(loans$loan_amount) ``` ```{r message = TRUE, out.width = "50%"} ggplot(loans, aes(x = loan_amount)) + geom_histogram() ``` ------------------------------------------------------------------------ ::: callout-caution ## Your turn Create a histogram for the `interest_rate` variable. ```{webr-r} ggplot(loans) ``` ::: ## Histograms and binwidth ::: panel-tabset ## binwidth = 1000 ```{r out.width = "50%"} ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 1000) ``` ## binwidth = 5000 ```{r out.width = "50%"} ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) ``` ## binwidth = 20000 ```{r out.width = "50%"} ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 20000) ``` ::: ------------------------------------------------------------------------ :::: callout-caution ## Your turn ::: panel-tabset ## Interactive Code Visualized the histogram and `interest_rate` and modify the binwidth to $\frac{1}{20}$ of the range of of the `interest_rate` variable. ```{webr-r} ggplot(loans) + geom_histogram(aes(x = interest_rate)) ``` ## Solution ```{r} ggplot(loans) + geom_histogram(aes(x = interest_rate), binwidth = diff(range(loans$interest_rate))/20) ``` ::: :::: ## Customizing labels of histograms ::: panel-tabset ## Plot ```{r ref.label = "hist-custom", echo = FALSE, warning = FALSE, out.width = "100%"} ``` ## Code ```{r hist-custom, fig.show = "hide", warning = FALSE} ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) + labs( #<1> x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) ``` 1. `labs()` can modify axis, legend, and plot labels. You can also use `xlab` and `ylab` to modify labels for x and y axis, respectively. ::: ------------------------------------------------------------------------ :::: callout-caution ## Your turn ::: panel-tabset ## Interactive Code Change x-axis label to 'Interest Rate (%)'. ```{webr-r} ggplot(loans) + geom_histogram(aes(x = interest_rate)) ``` ## Solution ```{r} ggplot(loans) + geom_histogram(aes(x = interest_rate)) + xlab("Interest Rate (%)") ``` ::: :::: ## Fill with a categorical variable ::: panel-tabset ## Plot ```{r ref.label = "hist-fill", echo = FALSE, warning = FALSE, out.width = "100%"} ``` ## Code ```{r hist-fill, fig.show = "hide", warning = FALSE} ggplot(loans, aes(x = loan_amount, fill = homeownership)) + #<1> geom_histogram(binwidth = 5000, alpha = 0.5) + #<2> labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) ``` 1. Add `homeownership` to fill with certain category 2. Add `alpha=` argument to set up transparency for the figure ::: :::: callout-caution ## Your turn ::: panel-tabset ## Interactive Code Use `grade` to highlight histograms with different grades. Set up the the transparency level to 80%. ```{webr-r} ggplot(loans) + geom_histogram(aes(x = interest_rate)) ``` ## Solution ```{r} ggplot(loans) + geom_histogram(aes(x = interest_rate, fill = grade), alpha = 0.8) + xlab("Interest Rate (%)") ``` ::: :::: ## Facet with a categorical variable ::: panel-tabset ## Plot ```{r ref.label = "hist-facet", echo = FALSE, warning = FALSE, out.width = "100%"} ``` ## Code ```{r hist-facet, fig.show = "hide", warning = FALSE} ggplot(loans, aes(x = loan_amount, fill = homeownership)) + geom_histogram(binwidth = 5000) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + facet_wrap(~ homeownership, nrow = 3) #<< ``` ::: ## Color of bar borders ::: panel-tabset ## Plot ```{r ref.label = "hist-color", echo = FALSE, warning = FALSE, out.width = "100%"} ``` ## Code ```{r hist-color, fig.show = "hide", warning = FALSE} ggplot(loans, aes(x = loan_amount, fill = homeownership)) + geom_histogram(binwidth = 5000, color = "white") + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + facet_wrap(~ homeownership, nrow = 3) #<< ``` ::: ## Position of Histogram Bars ::: panel-tabset ## Plot ```{r ref.label = "hist-dodge", echo = FALSE, warning = FALSE, out.width = "100%"} ``` ## Code ```{r hist-dodge, fig.show = "hide", warning = FALSE} ggplot(loans, aes(x = loan_amount, fill = homeownership)) + geom_histogram(binwidth = 5000, position = position_dodge()) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) ``` ::: # Density plot ## Density plot ```{r} ggplot(loans, aes(x = loan_amount)) + geom_density() ``` ## Density plots and adjusting bandwidth ::: panel-tabset ## adjust = 0.5 ```{r out.width = "100%"} ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 0.5) ``` ## adjust = 1 ```{r out.width = "100%"} ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 1) # default bandwidth ``` ## adjust = 2 ```{r out.width = "100%"} ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 2) ``` ::: ## Customizing density plots ::: panel-tabset ## Plot ```{r ref.label = "density-custom", echo = FALSE, warning = FALSE, out.width = "100%"} ``` ## Code ```{r density-custom, fig.show = "hide", warning = FALSE} ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 2) + labs( #<< x = "Loan amount ($)", #<< y = "Density", #<< title = "Amounts of Lending Club loans" #<< ) #<< ``` ::: ## Adding a categorical variable ::: panel-tabset ## Plot ```{r ref.label = "density-cat", echo = FALSE, warning = FALSE, out.width = "100%"} ``` ## Code ```{r density-cat, fig.show = "hide", warning = FALSE} ggplot(loans, aes(x = loan_amount, fill = homeownership)) + #<< geom_density(adjust = 2, alpha = 0.5) + #<< labs( x = "Loan amount ($)", y = "Density", title = "Amounts of Lending Club loans", fill = "Homeownership" #<< ) ``` ::: # Box plot ## Box plot - Boxplot visualises five summary statistics (the median, two hinges and two whiskers), and all "outlying" points individually. - The lower and upper **hinges** correspond to the first and third quartiles (the 25th and 75th percentiles). - The **whiskers** extend from the hinge to the smallest and largest value no further than 1.5 \* IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). ```{r} ggplot(loans, aes(x = interest_rate)) + geom_boxplot() ``` ## Box plot and outliers ```{r} ggplot(loans, aes(x = annual_income)) + geom_boxplot() ``` ## Customizing box plots ::: panel-tabset ## Plot ```{r ref.label = "box-custom", echo = FALSE, warning = FALSE, out.width = "100%"} ``` ## Code ```{r box-custom, fig.show = "hide", warning = FALSE} ggplot(loans, aes(x = interest_rate)) + geom_boxplot() + labs( x = "Interest rate (%)", y = NULL, title = "Interest rates of Lending Club loans" ) + theme( #<< axis.ticks.y = element_blank(), #<< axis.text.y = element_blank() #<< ) #<< ``` ::: ## Adding a categorical variable ::: panel-tabset ## Plot ```{r ref.label = "box-cat", echo = FALSE, warning = FALSE, out.width = "100%"} ``` ## Code ```{r box-cat, fig.show = "hide", warning = FALSE} ggplot(loans, aes(x = interest_rate, y = grade)) + #<< geom_boxplot() + labs( x = "Interest rate (%)", y = "Grade", title = "Interest rates of Lending Club loans", subtitle = "by grade of loan" #<< ) ``` ::: # Relationships numerical variables ## Scatterplot ```{r warning = FALSE} ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_point() ``` ## Hex plot ```{r warning = FALSE} ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_hex() ``` ## Hex plot ```{r warning = FALSE} ggplot(loans %>% filter(debt_to_income < 100), aes(x = debt_to_income, y = interest_rate)) + geom_hex() ``` # Visualize Categorical Variable ## Bar plot ```{r} ggplot(loans, aes(x = homeownership)) + geom_bar() ``` ## Segmented bar plot ```{r} ggplot(loans, aes(x = homeownership, fill = grade)) + #<< geom_bar() ``` ## Segmented bar plot ```{r} ggplot(loans, aes(x = homeownership, fill = grade)) + geom_bar(position = "fill") #<< ``` ------------------------------------------------------------------------ ::: callout-note ## Question Which bar plot is a more useful representation for visualizing the relationship between homeownership and grade? ::: ::::: columns ::: column ```{r echo=FALSE, out.width = "100%"} ggplot(loans, aes(x = homeownership, fill = grade)) + geom_bar() ``` ::: ::: column ```{r echo=FALSE, out.width = "100%"} ggplot(loans, aes(x = homeownership, fill = grade)) + geom_bar(position = "fill") ``` ::: ::::: ## Customizing bar plots ::: panel-tabset ## Plot ```{r ref.label = "bar-custom", echo = FALSE, warning = FALSE, out.width="100%"} ``` ## Code ```{r bar-custom, fig.show = "hide", warning = FALSE} ggplot(loans, aes(y = homeownership, #<< fill = grade)) + geom_bar(position = "fill") + labs( #<< x = "Proportion", #<< y = "Homeownership", #<< fill = "Grade", #<< title = "Grades of Lending Club loans", #<< subtitle = "and homeownership of lendee" #<< ) #<< ``` ::: # Relationships between numerical and categorical variables ## Already talked about... - Colouring and faceting histograms and density plots - Side-by-side box plots ## Violin plots ```{r warning = FALSE} ggplot(loans, aes(x = homeownership, y = loan_amount)) + geom_violin() ``` ## Ridge plots ```{r warning = FALSE} library(ggridges) ggplot(loans, aes(x = loan_amount, y = grade, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5) ``` # Designing effective visualizations ## Keep it simple ::::: columns ::: column ```{r pie-3d, echo = FALSE, out.width="100%"} knitr::include_graphics("images/pie-3d.jpg") ``` ::: ::: column ```{r pie-to-bar, echo = FALSE, out.width="100%"} d <- tribble( ~category, ~value, "Cutting tools", 0.03, "Buildings and administration", 0.22, "Labor", 0.31, "Machinery", 0.27, "Workplace materials", 0.17 ) ggplot(d, aes(x = fct_reorder(category, value), y = value)) + geom_col() + theme_minimal() + coord_flip() + labs(x = "", y = "") ``` ::: ::::: ## Use color to draw attention ::::: columns ::: column ```{r echo = FALSE, out.width="100%"} d %>% mutate(category = str_replace(category, " ", "\n")) %>% ggplot(aes(x = category, y = value, fill = category)) + geom_col() + theme_minimal() + labs(x = "", y = "") + theme(legend.position = "none") ``` ::: ::: column ```{r echo = FALSE, out.width="100%"} ggplot(d, aes(x = fct_reorder(category, value), y = value, fill = category)) + geom_col() + theme_minimal() + coord_flip() + labs(x = "", y = "") + scale_fill_manual(values = c("red", rep("gray", 4))) + theme(legend.position = "none") ``` ::: ::::: ## Tell a story ```{r echo = FALSE, out.width = "80%"} knitr::include_graphics("images/time-series-story.png") ``` > Credit: Angela Zoss and Eric Monson, Duke DVS # Principles for effective visualizations ## Principles for effective visualizations - Order matters - Put long categories on the y-axis - Keep scales consistent - Select meaningful colors - Use meaningful and nonredundant labels ## Data In September 2019, YouGov survey asked 1,639 GB adults the following question: ::::: columns ::: column > In hindsight, do you think Britain was right/wrong to vote to leave EU? > > - Right to leave\ > - Wrong to leave\ > - Don't know ::: ::: column ```{r echo = FALSE} brexit <- tibble( opinion = c( rep("Right", 664), rep("Wrong", 787), rep("Don't know", 188) ), region = c( rep("london", 63), rep("rest_of_south", 241), rep("midlands_wales", 145), rep("north", 176), rep("scot", 39), rep("london", 110), rep("rest_of_south", 257), rep("midlands_wales", 152), rep("north", 176), rep("scot", 92), rep("london", 24), rep("rest_of_south", 49), rep("midlands_wales", 57), rep("north", 48), rep("scot", 10) ) ) ``` ```{r echo = FALSE, out.width="100%"} ggplot(brexit, aes(x = opinion)) + geom_bar() ``` ::: ::::: Source: [YouGov Survey Results](https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x0msmggx08/YouGov%20-%20Brexit%20and%202019%20election.pdf), retrieved Oct 7, 2019 # Order matters ## Alphabetical order is rarely ideal ::: panel-tabset ## Plot ```{r ref.label="default-opinion", echo = FALSE} ``` ## Code ```{r default-opinion, fig.show = "hide"} ggplot(brexit, aes(x = opinion)) + geom_bar() ``` ::: ## Order by frequency ::: panel-tabset ## Plot ```{r ref.label="infreq", echo = FALSE} ``` ## Code `fct_infreq`: Reorder factors' levels by frequency ```{r infreq, fig.show = "hide"} ggplot(brexit, aes(x = fct_infreq(opinion))) + #<< geom_bar() ``` ::: ## Clean up labels ::: panel-tabset ## Plot ```{r ref.label="labels", echo = FALSE} ``` ## Code ```{r labels, fig.show = "hide"} ggplot(brexit, aes(x = opinion)) + geom_bar() + labs( #<< x = "Opinion", #<< y = "Count" #<< ) #<< ``` ::: ## Alphabetical order is rarely ideal ::: panel-tabset ## Plot ```{r ref.label="region-default", echo = FALSE, out.width="100%"} ``` ## Code ```{r region-default, fig.show = "hide"} ggplot(brexit, aes(x = region)) + geom_bar() ``` ::: ## Use inherent level order ::: panel-tabset ## Relevel `fct_relevel`: Reorder factor levels using a custom order ```{r relevel, fig.show = "hide", out.width="100%"} brexit <- brexit %>% mutate( region = fct_relevel( #<< region, "london", "rest_of_south", "midlands_wales", "north", "scot" ) ) ``` ## Plot ```{r echo=FALSE} ggplot(brexit, aes(x = region)) + geom_bar() ``` ::: ## Clean up labels ::: panel-tabset ## Recode `fct_recode`: Change factor levels by hand ```{r recode, fig.show = "hide", out.width="100%"} brexit <- brexit %>% mutate( region = fct_recode( #<< region, London = "london", `Rest of South` = "rest_of_south", `Midlands / Wales` = "midlands_wales", North = "north", Scotland = "scot" ) ) ``` ## Plot ```{r recode-plot, echo=FALSE} ggplot(brexit, aes(x = region)) + geom_bar() ``` ::: # Put long categories on the y-axis ## Long categories can be hard to read ```{r ref.label="recode-plot", echo = FALSE, out.width="100%"} ``` ## Move them to the y-axis ::: panel-tabset ## Plot ```{r ref.label="flip", echo = FALSE, out.width="100%"} ``` ## Code ```{r flip, fig.show = "hide"} ggplot(brexit, aes(y = region)) + #<< geom_bar() ``` ::: ## And reverse the order of levels ::: panel-tabset ## Plot ```{r ref.label="rev", echo = FALSE, out.width="100%"} ``` ## Code `fct_rev`: Reverse order of factor levels ```{r rev, fig.show = "hide"} ggplot(brexit, aes(y = fct_rev(region))) + #<< geom_bar() ``` ::: ## Clean up labels ::: panel-tabset ## Plot ```{r ref.label="labels-again", echo = FALSE, out.width="100%"} ``` \] ## Code ```{r labels-again, fig.show = "hide"} ggplot(brexit, aes(y = fct_rev(region))) + geom_bar() + labs( #<< x = "Count", #<< y = "Region" #<< ) #<< ``` ::: # Pick a purpose ## Segmented bar plots can be hard to read ::: panel-tabset ## Plot ```{r ref.label="segment", echo = FALSE, out.width="100%"} ``` ## Code ```{r segment, fig.show = "hide"} ggplot(brexit, aes(y = region, fill = opinion)) + #<< geom_bar() ``` ::: ## Use facets ::: panel-tabset ## Plot ```{r ref.label="facet", echo = FALSE, fig.asp = 0.45, out.width = "100%"} ``` ## Code ```{r facet, fig.show = "hide"} ggplot(brexit, aes(y = opinion, fill = region)) + geom_bar() + facet_wrap(~region, nrow = 1) #<< ``` ::: ## Avoid redundancy? ```{r echo = FALSE, fig.asp = 0.45, out.width = "90%"} ggplot(brexit, aes(y = opinion)) + geom_bar() + facet_wrap(~region, nrow = 1) ``` ## Redundancy can help tell a story ::: panel-tabset ## Plot ```{r ref.label="facet-fill", echo = FALSE, fig.asp = 0.45, out.width = "90%"} ``` ## Code ```{r facet-fill, fig.show = "hide"} ggplot(brexit, aes(y = opinion, fill = opinion)) + geom_bar() + facet_wrap(~region, nrow = 1) ``` ::: ## Be selective with redundancy ::: panel-tabset ## Plot ```{r ref.label="hide-legend", echo = FALSE, fig.asp = 0.45, out.width = "90%"} ``` ## Code ```{r hide-legend, fig.show = "hide"} ggplot(brexit, aes(y = opinion, fill = opinion)) + geom_bar() + facet_wrap(~region, nrow = 1) + guides(fill = "none") #<< ``` ::: ## Use informative labels ::: panel-tabset ## Plot ```{r ref.label="informative-label", echo = FALSE, fig.asp = 0.45, out.width = "90%"} ``` ## Code ```{r informative-label, fig.show = "hide"} ggplot(brexit, aes(y = opinion, fill = opinion)) + geom_bar() + facet_wrap(~region, nrow = 1) + guides(fill = "none") + labs( title = "Was Britain right/wrong to vote to leave EU?", #<< x = NULL, y = NULL ) ``` ::: ## A bit more info ::: panel-tabset ## Plot ```{r ref.label="more-info", echo = FALSE, fig.asp = 0.45, out.width = "90%"} ``` ## Code ```{r more-info, fig.show = "hide"} ggplot(brexit, aes(y = opinion, fill = opinion)) + geom_bar() + facet_wrap(~region, nrow = 1) + guides(fill = "none") + labs( title = "Was Britain right/wrong to vote to leave EU?", subtitle = "YouGov Survey Results, 2-3 September 2019", #<< caption = "Source: https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x0msmggx08/YouGov%20-%20Brexit%20and%202019%20election.pdf", #<< x = NULL, y = NULL ) ``` ::: ## Let's do better ::: panel-tabset ## Plot ```{r ref.label="short-link", echo = FALSE, fig.asp = 0.45, out.width = "90%"} ``` ## Code ```{r short-link, fig.show = "hide"} ggplot(brexit, aes(y = opinion, fill = opinion)) + geom_bar() + facet_wrap(~region, nrow = 1) + guides(fill = "none") + labs( title = "Was Britain right/wrong to vote to leave EU?", subtitle = "YouGov Survey Results, 2-3 September 2019", caption = "Source: bit.ly/2lCJZVg", #<< x = NULL, y = NULL ) ``` ::: ## Fix up facet labels ::: panel-tabset ## Plot ```{r ref.label="label-wrap", echo = FALSE, fig.asp = 0.45, out.width = "90%"} ``` ## Code ```{r label-wrap, fig.show = "hide"} ggplot(brexit, aes(y = opinion, fill = opinion)) + geom_bar() + facet_wrap(~region, nrow = 1, labeller = label_wrap_gen(width = 12) #<< ) + guides(fill = "none") + labs( title = "Was Britain right/wrong to vote to leave EU?", subtitle = "YouGov Survey Results, 2-3 September 2019", caption = "Source: bit.ly/2lCJZVg", x = NULL, y = NULL ) ``` ::: # Select meaningful colors ## Rainbow colors not always the right choice [Nicola Rennie's Blog: Working with colours in R](https://nrennie.rbind.io/blog/colours-in-r/) ```{r ref.label="label-wrap", echo = FALSE, fig.asp = 0.45, out.width = "90%"} ``` ## Manually choose colors when needed ::: panel-tabset ## Plot ```{r ref.label="red-green", echo = FALSE, fig.asp = 0.45, out.width = "90%"} ``` ## Code ```{r red-green, fig.show = "hide"} ggplot(brexit, aes(y = opinion, fill = opinion)) + geom_bar() + facet_wrap(~region, nrow = 1, labeller = label_wrap_gen(width = 12)) + guides(fill = "none") + labs(title = "Was Britain right/wrong to vote to leave EU?", subtitle = "YouGov Survey Results, 2-3 September 2019", caption = "Source: bit.ly/2lCJZVg", x = NULL, y = NULL) + scale_fill_manual(values = c( #<< "Wrong" = "red", #<< "Right" = "green", #<< "Don't know" = "gray" #<< )) #<< ``` ::: ## Choosing better colors [Source: colorbrewer2.org](https://colorbrewer2.org/) ```{r echo = FALSE, out.width = "60%"} knitr::include_graphics("images/color-brewer.png") ``` ## Use better colors ::: panel-tabset ## Plot ```{r ref.label="color-brewer", echo = FALSE, fig.asp = 0.45, out.width = "90%"} ``` ## Code ```{r color-brewer, fig.show = "hide"} ggplot(brexit, aes(y = opinion, fill = opinion)) + geom_bar() + facet_wrap(~region, nrow = 1, labeller = label_wrap_gen(width = 12)) + guides(fill = "none") + labs(title = "Was Britain right/wrong to vote to leave EU?", subtitle = "YouGov Survey Results, 2-3 September 2019", caption = "Source: bit.ly/2lCJZVg", x = NULL, y = NULL) + scale_fill_manual(values = c( "Wrong" = "#ef8a62", #<< "Right" = "#67a9cf", #<< "Don't know" = "gray" #<< )) ``` ::: ## Select theme ::: panel-tabset ## Plot ```{r ref.label="theme", echo = FALSE, fig.asp = 0.45, out.width = "90%"} ``` ## Code ```{r theme, fig.show = "hide"} ggplot(brexit, aes(y = opinion, fill = opinion)) + geom_bar() + facet_wrap(~region, nrow = 1, labeller = label_wrap_gen(width = 12)) + guides(fill = "none") + labs(title = "Was Britain right/wrong to vote to leave EU?", subtitle = "YouGov Survey Results, 2-3 September 2019", caption = "Source: bit.ly/2lCJZVg", x = NULL, y = NULL) + scale_fill_manual(values = c("Wrong" = "#ef8a62", "Right" = "#67a9cf", "Don't know" = "gray")) + theme_minimal() #<< ``` ::: ## Resource 1. [DataScienceBox's GitHub](https://github.com/tidyverse/datascience-box/tree/main/course-materials/_slides/u2-d03-viz-num) 2. [datasciencebox.org](https://datasciencebox.org/) 3. [Using ggplot2 in packages](https://ggplot2.tidyverse.org/articles/ggplot2-in-packages.html#best-practices-for-common-tasks)

webR Code Links

Interactive editor

Overview

Terminology

Data: Lending Club

Take a peek at data

Selected variables

Selected variables

Visualizing Continous data

Describing shapes of numerical distributions

Histogram

Histogram

Histograms and binwidth

Customizing labels of histograms

Fill with a categorical variable

Facet with a categorical variable

Color of bar borders

Position of Histogram Bars

Density plot

Density plot

Density plots and adjusting bandwidth

Customizing density plots

Adding a categorical variable

Box plot

Box plot

Box plot and outliers

Customizing box plots

Adding a categorical variable

Relationships numerical variables

Scatterplot

Hex plot

Hex plot

Visualize Categorical Variable

Bar plot

Segmented bar plot

Segmented bar plot

Customizing bar plots

Relationships between numerical and categorical variables

Already talked about…

Violin plots

Ridge plots

Designing effective visualizations

Keep it simple

Use color to draw attention

Tell a story

Principles for effective visualizations

Principles for effective visualizations

Data

Order matters

Alphabetical order is rarely ideal

Order by frequency

Clean up labels

Alphabetical order is rarely ideal

Use inherent level order

Clean up labels

Put long categories on the y-axis

Long categories can be hard to read

Move them to the y-axis

And reverse the order of levels

Clean up labels

Pick a purpose

Segmented bar plots can be hard to read

Use facets

Avoid redundancy?

Redundancy can help tell a story

Be selective with redundancy

Use informative labels

A bit more info

Let’s do better

Fix up facet labels

Select meaningful colors

Rainbow colors not always the right choice

Manually choose colors when needed

Choosing better colors

Use better colors

Select theme

Resource

R History Command Contents