Data visualization for survey data

blog
ggplot2
Many tutorials online are about general data visualization. This post aims to showcase some tricks for survey data
Author

Jihong Zhang

Published

July 4, 2023

1 R package - survey

1.1 What ChatGPT say about survey package

Me: I don’t know much about survey R package. Can you tell me know about core features of this package?

Sure, the survey package in R is designed specifically for analysis of data from complex surveys. It provides functions for descriptive statistics and general regression models for survey data that includes design features such as clustering, stratification, and weighting.

Here are some of the core features of the survey package:

  1. Descriptive Statistics: The package provides functions for computing descriptive statistics on survey data, including mean, total, and quantiles.

  2. Regression Models: The package provides a variety of model fitting functions for binary and multi-category response, count data, survival data, and continuous response.

  3. Design Effects: It allows calculation of design effects for complex survey designs.

  4. Post-stratification and Raking: The package allows for adjusting the sampling weights to match known population margins.

  5. Subpopulation Analysis: It includes functions for correctly handling analyses that are limited to a subset of the population (a subpopulation).

  6. Variance Estimation: The survey package supports multiple methods of variance estimation, including Taylor series linearization, replication weights, and subbootstrap.

Remember that before you can use these functions, you will need to define a survey design object that specifies the features of your survey’s design (like the sampling method, strata, clustering, and weights).

Here’s an example of how you might use it to calculate the mean of a variable from a survey:

# Load the necessary package
library(survey)

# Define the survey design
des <- svydesign(ids = ~1, data = mydata, weights = ~weight)

# Calculate the mean of a variable
mean <- svytotal(~variable, design = des)

Please replace mydata, weight, and variable with your actual data frame, weight column, and the variable you’re interested in, respectively.

Remember, working with survey data can be complex due to the design features of surveys. The survey package in R provides a robust set of tools for dealing with this complexity.

2 An empirical example

The example I used here is a tody data exacted from a real data about eating disorders. The sample size is 500.

The measurement data contains 12 items, each ranging from 0 to 3. The demographic data contains 6 variables: age, gender, race, birthplace, height, weight. The very first thing is to visualize the characteristics of the samples to have a big picture of respondents.

R Code
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warnings=FALSE, include = TRUE)
library(here)
library(glue)
library(readr)
library(bruceR)
library(xtable)
library(survey)
library(formattable) # format styles of table 
library(reshape2)
library(tidyverse)
library(ggtext) 
library(kableExtra)
options(knitr.kable.NA = '')
mycolors = c("#4682B4", "#B4464B", "#B4AF46", 
             "#1B9E77", "#D95F02", "#7570B3",
             "#E7298A", "#66A61E", "#B4F60A")
softcolors = c("#B4464B", "#F3DCD4", "#ECC9C7", 
               "#D9E3DA", "#D1CFC0", "#C2C2B4")
mykbl <- function(x, ...){
  kbl(x, digits = 2, ...) |> kable_styling(bootstrap_options = c("striped", "condensed")) }
R Code
datList <- readRDS(here::here("posts/2023-07-04-data-visualization-for-survey-data/Example_Data.RDS"))
str(datList)
List of 2
 $ measurement: tibble [500 × 12] (S3: tbl_df/tbl/data.frame)
  ..$ EDEQS1 : num [1:500] 1 0 3 2 0 0 1 0 2 0 ...
  ..$ EDEQS2 : num [1:500] 0 0 0 1 0 0 0 0 0 0 ...
  ..$ EDEQS3 : num [1:500] 0 0 0 2 0 0 0 0 2 0 ...
  ..$ EDEQS4 : num [1:500] 0 0 1 2 0 0 0 0 1 0 ...
  ..$ EDEQS5 : num [1:500] 3 1 3 1 0 1 1 0 2 1 ...
  ..$ EDEQS6 : num [1:500] 1 1 3 2 0 1 1 0 2 1 ...
  ..$ EDEQS7 : num [1:500] 0 0 0 1 0 0 0 0 0 0 ...
  ..$ EDEQS8 : num [1:500] 0 0 2 1 0 0 1 0 1 0 ...
  ..$ EDEQS9 : num [1:500] 0 1 2 2 0 0 2 0 0 0 ...
  ..$ EDEQS10: num [1:500] 0 0 1 1 0 0 1 0 0 0 ...
  ..$ EDEQS11: num [1:500] 1 1 2 1 0 1 0 0 1 1 ...
  ..$ EDEQS12: num [1:500] 2 1 3 2 0 1 1 0 2 1 ...
 $ description: tibble [500 × 6] (S3: tbl_df/tbl/data.frame)
  ..$ age       : num [1:500] 18 19 18 19 20 19 18 19 19 20 ...
  ..$ gender    : Factor w/ 2 levels "男","女": 2 1 2 2 2 2 2 1 1 2 ...
  ..$ race      : Factor w/ 2 levels "汉","少数": 2 1 1 1 1 1 1 1 1 1 ...
  ..$ birthplace: Factor w/ 2 levels "城市","农村": 2 2 2 1 1 2 2 1 2 2 ...
  ..$ height    : num [1:500] 160 160 160 160 163 158 170 176 183 155 ...
  ..$ weight    : num [1:500] 110 100 140 140 90 NA 105 121 180 90 ...

2.1 Descriptive statistics

We can use multiple R tools for descriptive statistics. bruceR is one of them.

R Code
description <- datList$description
bruceR::Freq(dplyr::select(description, gender:birthplace), 
             varname = "gender")
Frequency Statistics:
────────────
      N    %
────────────
男  172 34.4
女  328 65.6
────────────
Total N = 500
R Code
freqTblVars = c("gender", "race", "birthplace")
freqTable <- function(tbl, var) {
  tbl |> as.data.frame() |> 
    tibble::rownames_to_column("Levels") |> 
    dplyr::mutate(Variable = var)
}
freqTableComb = NULL
for (var in freqTblVars) {
  tbl = bruceR::Freq(dplyr::select(description, gender:birthplace), varname = var)
  freqTableComb = rbind(freqTableComb, freqTable(tbl = tbl, var = var))
  freqTableComb <- freqTableComb |> 
    relocate("Variable")
}
Frequency Statistics:
────────────
      N    %
────────────
男  172 34.4
女  328 65.6
────────────
Total N = 500
Frequency Statistics:
──────────────
        N    %
──────────────
汉    449 89.8
少数   51 10.2
──────────────
Total N = 500
Frequency Statistics:
──────────────
        N    %
──────────────
城市  183 36.6
农村  317 63.4
──────────────
Total N = 500
R Code
mykbl(freqTableComb)
Variable Levels N %
gender 172 34.4
gender 328 65.6
race 449 89.8
race 少数 51 10.2
birthplace 城市 183 36.6
birthplace 农村 317 63.4

Or we can use survey package for descriptive analysis

R Code
library(survey)
dexample = svydesign(ids = ~1,
                     data = datList$measurement)
Warning in svydesign.default(ids = ~1, data = datList$measurement): No weights
or probabilities supplied, assuming equal probability
R Code
summary(dexample)
Independent Sampling design (with replacement)
svydesign(ids = ~1, data = datList$measurement)
Probabilities:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 
Data variables:
 [1] "EDEQS1"  "EDEQS2"  "EDEQS3"  "EDEQS4"  "EDEQS5"  "EDEQS6"  "EDEQS7" 
 [8] "EDEQS8"  "EDEQS9"  "EDEQS10" "EDEQS11" "EDEQS12"
R Code
## summay statistics for all measurement indicators
vars <- colnames(datList$measurement)
svymean(make.formula(vars), design = dexample, na.rm = TRUE)
         mean     SE
EDEQS1  0.882 0.0439
EDEQS2  0.286 0.0283
EDEQS3  0.322 0.0280
EDEQS4  0.328 0.0270
EDEQS5  0.910 0.0448
EDEQS6  1.046 0.0500
EDEQS7  0.120 0.0203
EDEQS8  0.530 0.0382
EDEQS9  0.360 0.0295
EDEQS10 0.398 0.0283
EDEQS11 0.738 0.0346
EDEQS12 1.072 0.0413
R Code
svytotal(make.formula(vars), design = dexample, na.rm = TRUE)
        total     SE
EDEQS1    441 21.932
EDEQS2    143 14.160
EDEQS3    161 13.984
EDEQS4    164 13.512
EDEQS5    455 22.404
EDEQS6    523 25.004
EDEQS7     60 10.149
EDEQS8    265 19.112
EDEQS9    180 14.752
EDEQS10   199 14.149
EDEQS11   369 17.300
EDEQS12   536 20.646

2.2 Stacked barplot for survey data responses

R Code
survey = datList$measurement
survey <- survey |> 
  mutate(ID = 1:nrow(survey)) |> 
  mutate(across(starts_with("EDEQS"), \(x) factor(x, levels = 0:3))) |> 
  pivot_longer(starts_with("EDEQS"), names_to = "items", values_to = "values") |> 
  group_by(items) |> 
  dplyr::count(values) |> 
  dplyr::mutate(perc = n/sum(n) * 100)

p = ggplot(survey) +
  geom_col(aes(y = factor(items, levels = paste0("EDEQS", 1:12)),
               x = perc,
               fill = values), 
           position = position_stack(reverse = TRUE)) +
  labs(y = "", x = "Proportion (%)", title = "N and proportion of responses for items")

p = p + geom_text(aes(y = factor(items, levels = paste0("EDEQS", 1:12)),
                  x = perc, group = items,
                  label = ifelse(n >= 50, paste0(n, "(", round(perc, 1), "%)"), "")), 
              size = 3, color = "white",
              position = position_stack(reverse = TRUE, vjust = 0.5))
p = p + scale_fill_manual(values = mycolors)
p

We can clearly identify item 7 has highest proportion of level 0, and needed to be theoretically justified.

Back to top