Data visualization for survey data

blog
ggplot2
Many tutorials online are about general data visualization. This post aims to showcase some tricks for survey data
Author

Jihong Zhang

Published

July 4, 2023

Code
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warnings=FALSE, include = TRUE)
library(here)
library(glue)
library(readr)
library(bruceR)
library(xtable)
library(formattable) # format styles of table 
library(reshape2)
library(tidyverse)
library(ggtext) 
library(kableExtra)
options(knitr.kable.NA = '')
mycolors = c("#4682B4", "#B4464B", "#B4AF46", 
             "#1B9E77", "#D95F02", "#7570B3",
             "#E7298A", "#66A61E", "#B4F60A")
softcolors = c("#B4464B", "#F3DCD4", "#ECC9C7", 
               "#D9E3DA", "#D1CFC0", "#C2C2B4")
mykbl <- function(x, ...){
  kbl(x, digits = 2, ...) |> kable_styling(bootstrap_options = c("striped", "condensed")) }
Code
datList <- readRDS(here::here("posts/2023-07-04-data-visualization-for-survey-data/Example_Data.RDS"))
str(datList)
List of 2
 $ measurement: tibble [500 × 12] (S3: tbl_df/tbl/data.frame)
  ..$ EDEQS1 : num [1:500] 1 0 3 2 0 0 1 0 2 0 ...
  ..$ EDEQS2 : num [1:500] 0 0 0 1 0 0 0 0 0 0 ...
  ..$ EDEQS3 : num [1:500] 0 0 0 2 0 0 0 0 2 0 ...
  ..$ EDEQS4 : num [1:500] 0 0 1 2 0 0 0 0 1 0 ...
  ..$ EDEQS5 : num [1:500] 3 1 3 1 0 1 1 0 2 1 ...
  ..$ EDEQS6 : num [1:500] 1 1 3 2 0 1 1 0 2 1 ...
  ..$ EDEQS7 : num [1:500] 0 0 0 1 0 0 0 0 0 0 ...
  ..$ EDEQS8 : num [1:500] 0 0 2 1 0 0 1 0 1 0 ...
  ..$ EDEQS9 : num [1:500] 0 1 2 2 0 0 2 0 0 0 ...
  ..$ EDEQS10: num [1:500] 0 0 1 1 0 0 1 0 0 0 ...
  ..$ EDEQS11: num [1:500] 1 1 2 1 0 1 0 0 1 1 ...
  ..$ EDEQS12: num [1:500] 2 1 3 2 0 1 1 0 2 1 ...
 $ description: tibble [500 × 6] (S3: tbl_df/tbl/data.frame)
  ..$ age       : num [1:500] 18 19 18 19 20 19 18 19 19 20 ...
  ..$ gender    : Factor w/ 2 levels "男","女": 2 1 2 2 2 2 2 1 1 2 ...
  ..$ race      : Factor w/ 2 levels "汉","少数": 2 1 1 1 1 1 1 1 1 1 ...
  ..$ birthplace: Factor w/ 2 levels "城市","农村": 2 2 2 1 1 2 2 1 2 2 ...
  ..$ height    : num [1:500] 160 160 160 160 163 158 170 176 183 155 ...
  ..$ weight    : num [1:500] 110 100 140 140 90 NA 105 121 180 90 ...

After finishing data preprocessing (e.g., data cleaning, missing cases dropping), it’s always a great habit to save a list of measurement data and demographic information. The example I used here is a pseudo data exact from real data. The sample size is 500.

The measurement data contains 12 items, each ranging from 0 to 3. The demographic data contains 6 variables: age, gender, race, birthplace, height, weight. The very first thing is to visualize the characteristics of the samples to have a big picture of respondents.

Code
description <- datList$description
bruceR::Freq(dplyr::select(description, gender:birthplace), 
             varname = "gender")
Frequency Statistics:
────────────
      N    %
────────────
男  172 34.4
女  328 65.6
────────────
Total N = 500
Code
mykbl(freqTableComb)
Variable Levels N %
gender 172 34.4
gender 328 65.6
race 449 89.8
race 少数 51 10.2
birthplace 城市 183 36.6
birthplace 农村 317 63.4
Code
survey = datList$measurement
survey <- survey |> 
  mutate(ID = 1:nrow(survey)) |> 
  mutate(across(starts_with("EDEQS"), \(x) factor(x, levels = 0:3))) |> 
  pivot_longer(starts_with("EDEQS"), names_to = "items", values_to = "values") |> 
  group_by(items) |> 
  dplyr::count(values) |> 
  dplyr::mutate(perc = n/sum(n) * 100)

p = ggplot(survey) +
  geom_col(aes(y = factor(items, levels = paste0("EDEQS", 1:12)),
               x = perc,
               fill = values), 
           position = position_stack(reverse = TRUE)) +
  labs(y = "", x = "Proportion (%)", title = "N and proportion of responses for items")

p = p + geom_text(aes(y = factor(items, levels = paste0("EDEQS", 1:12)),
                  x = perc, group = items,
                  label = ifelse(n >= 50, paste0(n, "(", round(perc, 1), "%)"), "")), 
              size = 3, color = "white",
              position = position_stack(reverse = TRUE, vjust = 0.5))
p = p + scale_fill_manual(values = mycolors)
p

We can clearly identify item 7 has highest proportion of level 0, and needed to be theoretically justified.

Back to top