Data visualization for survey data
1 Question
I am looking for R package than can analyze big data of survey.
2 R package - survey
The survey
package in R is designed specifically for analysis of data from complex surveys. It provides functions for descriptive statistics and general regression models for survey data that includes design features such as clustering, stratification, and weighting.
Here are some of the core features of the survey
package:
Descriptive Statistics: The package provides functions for computing descriptive statistics on survey data, including mean, total, and quantiles.
Regression Models: The package provides a variety of model fitting functions for binary and multi-category response, count data, survival data, and continuous response.
Design Effects: It allows calculation of design effects for complex survey designs.
Post-stratification and Raking: The package allows for adjusting the sampling weights to match known population margins.
Subpopulation Analysis: It includes functions for correctly handling analyses that are limited to a subset of the population (a subpopulation).
Variance Estimation: The
survey
package supports multiple methods of variance estimation, including Taylor series linearization, replication weights, and subbootstrap.
Remember that before you can use these functions, you will need to define a survey design object that specifies the features of your survey’s design (like the sampling method, strata, clustering, and weights).
Here’s an example of how you might use it to calculate the mean of a variable from a survey:
variables |
Formula or data frame specifying the variables measured in the survey. If NULL , the data argument is used. |
ids |
Formula or data frame specifying cluster ids from largest level to smallest level, ~0 or ~1 is a formula for no clusters. |
probs |
Formula or data frame specifying cluster sampling probabilities |
Please replace mydata
, weight
, and variable
with your actual data frame, weight column, and the variable you’re interested in, respectively.
Remember, working with survey data can be complex due to the design features of surveys. The survey
package in R provides a robust set of tools for dealing with this complexity.
3 An empirical example
The example I used here is a tody data exacted from a real data about eating disorders. The sample size is 500.
The measurement data contains 12 items, each ranging from 0 to 3. The demographic data contains 6 variables: age, gender, race, birthplace, height, weight. The very first thing is to visualize the characteristics of the samples to have a big picture of respondents.
⌘+C
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warnings=FALSE, include = TRUE)
library(here)
library(glue)
library(readr)
library(bruceR)
library(xtable)
library(survey)
library(formattable) # format styles of table
library(reshape2)
library(tidyverse)
library(ggtext)
library(kableExtra)
options(knitr.kable.NA = '')
mycolors = c("#4682B4", "#B4464B", "#B4AF46",
"#1B9E77", "#D95F02", "#7570B3",
"#E7298A", "#66A61E", "#B4F60A")
softcolors = c("#B4464B", "#F3DCD4", "#ECC9C7",
"#D9E3DA", "#D1CFC0", "#C2C2B4")
mykbl <- function(x, ...){
kbl(x, digits = 2, ...) |> kable_styling(bootstrap_options = c("striped", "condensed")) }
3.1 Descriptive statistics
We can use multiple R tools for descriptive statistics. bruceR
is one of them.
⌘+C
description <- datList$description
bruceR::Freq(dplyr::select(description, gender:birthplace),
varname = "gender")
⌘+C
freqTblVars = c("gender", "race", "birthplace")
freqTable <- function(tbl, var) {
tbl |> as.data.frame() |>
tibble::rownames_to_column("Levels") |>
dplyr::mutate(Variable = var)
}
freqTableComb = NULL
for (var in freqTblVars) {
tbl = bruceR::Freq(dplyr::select(description, gender:birthplace), varname = var)
freqTableComb = rbind(freqTableComb, freqTable(tbl = tbl, var = var))
freqTableComb <- freqTableComb |>
relocate("Variable")
}
⌘+C
mykbl(freqTableComb)
Or we can use survey
package for descriptive analysis
⌘+C
library(survey)
dexample = svydesign(ids = ~1,
data = datList$measurement)
summary(dexample)
## summay statistics for all measurement indicators
vars <- colnames(datList$measurement)
svymean(make.formula(vars), design = dexample, na.rm = TRUE)
svytotal(make.formula(vars), design = dexample, na.rm = TRUE)
3.2 Stacked barplot for survey data responses
⌘+C
survey = datList$measurement
survey <- survey |>
mutate(ID = 1:nrow(survey)) |>
mutate(across(starts_with("EDEQS"), \(x) factor(x, levels = 0:3))) |>
pivot_longer(starts_with("EDEQS"), names_to = "items", values_to = "values") |>
group_by(items) |>
dplyr::count(values) |>
dplyr::mutate(perc = n/sum(n) * 100)
p = ggplot(survey) +
geom_col(aes(y = factor(items, levels = paste0("EDEQS", 1:12)),
x = perc,
fill = values),
position = position_stack(reverse = TRUE)) +
labs(y = "", x = "Proportion (%)", title = "N and proportion of responses for items")
p = p + geom_text(aes(y = factor(items, levels = paste0("EDEQS", 1:12)),
x = perc, group = items,
label = ifelse(n >= 50, paste0(n, "(", round(perc, 1), "%)"), "")),
size = 3, color = "white",
position = position_stack(reverse = TRUE, vjust = 0.5))
p = p + scale_fill_manual(values = mycolors)
p
We can clearly identify item 7 has highest proportion of level 0, and needed to be theoretically justified.