Example 01: Make Friends with R

Author
Affiliation

Jihong Zhang*, Ph.D

Educational Statistics and Research Methods (ESRM) Program*

University of Arkansas

1 How to use this file

  1. This file is created using Quarto, one type of file you can review all R code on this webpage.

  2. To test one certain chunk of code, you click the “copy” icon in the upper right hand side of the chunk block (see screenshot below)

    • Try copy the following code

      Hide the code
      a = 1 + 1
      b = a + 1
      print(b)
  3. To review the whole file, click </> Code next to the title of this paper. Find View Source and click the button. Then, you can paste to the newly created Quarto Document.

2 Getting Started with R

2.1 What is R?

R is a powerful programming language and environment specifically designed for statistical computing and graphics. It’s free, open-source, and has a vast ecosystem of packages for data analysis, visualization, and machine learning.

2.2 Why R?

  • Free and Open Source: No licensing costs
  • Extensive Package Ecosystem: Over 18,000 packages available
  • Excellent for Statistics: Built by statisticians, for statisticians
  • Great Visualization: ggplot2 and other packages for beautiful graphics
  • Reproducible Research: R Markdown and Quarto for literate programming
  • Active Community: Large, helpful community of users

2.3 R vs RStudio

  • R: The programming language and computing environment

  • RStudio: An integrated development environment (IDE) that makes R easier to use

3 Comments in R

Hide the code
# R comments begin with a # -- there are no multiline comments

# RStudio helps you build syntax
#   GREEN: Comments and character values in single or double quotes
#   BLUE: Functions and keywords
#   BLACK: Variable names and values

# You can use the tab key to complete object names, functions, and arguments

# R is case sensitive. That means R and r are two different things.

# Good naming conventions:
#   - Use descriptive names: my_data instead of x
#   - Use underscores or dots: my_data or my.data
#   - Avoid spaces and special characters (except . and _)
#   - Don't start with numbers: 1data is invalid, data1 is valid

4 Basic Data Types in R

Hide the code
# R has several basic data types:

# 1. Numeric (double) - decimal numbers
numeric_value <- 3.14
class(numeric_value)

# 2. Integer - whole numbers
integer_value <- 42L  # The L suffix makes it an integer
class(integer_value)

# 3. Character (string) - text
character_value <- "Hello, R!"
class(character_value)

# 4. Logical (boolean) - TRUE/FALSE
logical_value <- TRUE
class(logical_value)

# 5. Complex - complex numbers
complex_value <- 3 + 4i
class(complex_value)

# Check the type of any object
typeof(numeric_value)
is.numeric(numeric_value)
is.character(character_value)

5 R Functions

Hide the code
# In R, every statement is a function

# The print function prints the contents of what is inside to the console
print(x = 10)

# The terms inside the function are called the arguments; here print takes x
#   To find help with what the arguments are use:
?print

# Each function returns an object
print(x = 10)

# You can determine what type of object is returned by using the class function
class(print(x = 10))

# Function syntax: function_name(argument1, argument2, ...)
# Examples of common functions:
sqrt(16)           # Square root
abs(-5)            # Absolute value
round(3.14159, 2)  # Round to 2 decimal places
length(c(1,2,3,4)) # Length of a vector
sum(c(1,2,3,4))    # Sum of values
mean(c(1,2,3,4))   # Mean of values

6 Vectors - The Building Blocks

Hide the code
# Vectors are the most basic data structure in R
# They can contain elements of the same type

# Creating vectors
numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "cherry")
logical_vector <- c(TRUE, FALSE, TRUE)

# Using the colon operator for sequences
sequence <- 1:10
sequence

# Using seq() function for more control
seq(from = 1, to = 10, by = 2)
seq(1, 10, length.out = 5)

# Using rep() to repeat values
rep(5, times = 3)
rep(c(1, 2), times = 3)
rep(c(1, 2), each = 3)

# Vector operations
x <- c(1, 2, 3, 4, 5)
y <- c(10, 20, 30, 40, 50)

x + y    # Element-wise addition
x * y    # Element-wise multiplication
x^2      # Element-wise exponentiation

7 R Objects

Hide the code
# Each objects can be saved into the R environment (the workspace here)
#   You can save the results of a function call to a variable of any name
MyObject = print(x = 10)
class(MyObject)

# You can view the objects you have saved in the Environment tab in RStudio
# Or type their name
MyObject

# There are literally thousands of types of objects in R (you can create them),
#   but for our course we will mostly be working with data frames (more later)

# The process of saving the results of a function to a variable is called 
#   assignment. There are several ways you can assign function results to 
#   variables:

# The equals sign takes the result from the right-hand side and assigns it to
#   the variable name on the left-hand side:
MyObject = print(x = 10)

# The <- (Alt "-" in RStudio) functions like the equals (right to left)
MyObject2 <- print(x = 10)

identical(MyObject, MyObject2)

# The -> assigns from left to right:
print(x = 10) -> MyObject3

identical(MyObject, MyObject2, MyObject3)

# Best practice: Use <- for assignment (more explicit)
# Use = only for function arguments

8 Working with Data Structures

8.1 Lists

Hide the code
# Lists can contain elements of different types
my_list <- list(
  name = "John",
  age = 30,
  scores = c(85, 90, 78),
  passed = TRUE
)

# Accessing list elements
my_list$name
my_list[["age"]]
my_list[[3]]

# Lists are very flexible and useful for complex data structures

8.2 Matrices

Hide the code
# Matrices are 2-dimensional arrays with the same data type
my_matrix <- matrix(1:12, nrow = 3, ncol = 4)
my_matrix

# Creating matrices from vectors
matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)

# Matrix operations
matrix1 <- matrix(1:4, nrow = 2)
matrix2 <- matrix(5:8, nrow = 2)
matrix1 + matrix2
matrix1 * matrix2  # Element-wise multiplication

9 Importing and Exporting Data

  • The data frame is an R object that is a rectangular array of data. The variables in the data frame can be any class (e.g., numeric, character) and go across the columns. The observations are across the rows.

  • We will start by importing data from a comma-separated values (csv) file.

  • We will use the read.csv() function. Here, the argument stringsAsFactors keeps R from creating data strings

  • We will use here::here() function to quickly point to the target data file.

Hide the code
# You can also set the directory using setwd(). Here, I set my directory to 
#   my root folder:
# setwd("~")

getwd()
dir()
# If I tried to re-load the data, I would get an error:
HeightsData = read.csv(file = here::here("teaching/2025-01-13-Experiment-Design/Lecture01", "heights.csv"), stringsAsFactors = FALSE)
Hide the code
# Method 2: I can use the full path to the file:
# HeightsData = 
#   read.csv(
#     file = "/Users/jihong/Documents/website-jihong/teaching/2024-07-21-applied-multivariate-statistics-esrm64503/Lecture01/data/heights.csv", 
#     stringsAsFactors = FALSE)

# Or, I can reset the current directory and use the previous syntax:
# setwd("/Users/jihong/Documents/website-jihong/teaching/2024-07-21-applied-multivariate-statistics-esrm64503/Lecture01/data/")

HeightsData = read.csv(file = "teaching/2025-01-13-Experiment-Design/Lecture01/heights.csv",
                       stringsAsFactors = FALSE)
HeightsData
Hide the code
# Note: Windows users will have to either change the direction of the slash
#   or put two slashes between folder levels.

# To show my data in RStudio, I can either double click it in the 
#   Environment tab or use the View() function
# View(HeightsData)

# You can see the variable names and contents by using the $:
HeightsData$ID

# To read in SPSS files, we will need the foreign library. The foreign
#   library comes installed with R (so no need to use install.packages()).
library(foreign)

# The read.spss() function imports the SPSS file to an R data frame if the 
#   argument to.data.frame is TRUE
WideData = read.spss(file = "teaching/2025-01-13-Experiment-Design/Lecture01/wide.sav", 
                     to.data.frame = TRUE)
WideData

10 Working with Data Frames

Hide the code
# Data frames are the most common data structure for statistical analysis
# They are like spreadsheets with rows (observations) and columns (variables)

# Basic data frame operations
dim(HeightsData)        # Dimensions (rows, columns)
nrow(HeightsData)       # Number of rows
ncol(HeightsData)       # Number of columns
names(HeightsData)      # Column names
str(HeightsData)        # Structure of the data frame
head(HeightsData)       # First 6 rows
tail(HeightsData)       # Last 6 rows
summary(HeightsData)    # Summary statistics

# Accessing data frame elements
HeightsData[1, 2]       # Row 1, Column 2
HeightsData[1:5, ]      # Rows 1-5, all columns
HeightsData[, "ID"]     # All rows, column named "ID"
HeightsData$ID          # Same as above (preferred method)

# Subsetting data frames
subset(HeightsData, Height > 170)
HeightsData[HeightsData$Height > 170, ]

10.1 Exercise

  • Obtain the information from WideData
    • Dimensions (rows, columns)
    • Number of rows
    • Number of columns
    • Column names
    • Structure of the data frame
    • First 6 rows
    • Last 6 rows
    • Summary statistics

11 Merging R data frame objects

Hide the code
# The WideData and HeightsData have the same set of ID numbers. We can use the merge() function to merge them into a single data frame. Here, x is the name of the left-side data frame and y is the name of the right-side data frame. The arguments by.x and by.y are the name of the variable(s) by which we will merge:
AllData = merge(x = WideData, y = HeightsData, by.x = "ID", by.y = "ID")
AllData

## Method 2: Use dplyr method, |> can be typed using `command + shift + M`
library(dplyr)
WideData |> 
  left_join(HeightsData, by = "ID")

# Different types of joins:
# left_join(): Keep all rows from left table
# right_join(): Keep all rows from right table  
# inner_join(): Keep only rows that appear in both tables
# full_join(): Keep all rows from both tables

12 Transforming Wide to Long

Hide the code
# Sometimes, certain packages require repeated measures data to be in a long
# format. 

library(dplyr) # contains variable selection 

## Wrong Way
AllDataLong <- AllData |> 
  tidyr::pivot_longer(starts_with("DVTime"), names_to = "DV", values_to = "DV_Value") |> 
  tidyr::pivot_longer(starts_with("AgeTime"), names_to = "Age", values_to = "Age_Value") 

OnePerson <- AllDataLong  |> 
  filter(ID == "1")

OnePerson

## Correct Way
AllDataLong <- AllData |> 
  tidyr::pivot_longer(c(starts_with("DVTime"), starts_with("AgeTime"))) |> 
  tidyr::separate(name, into = c("Variable", "Time"), sep = "Time") |> 
  tidyr::pivot_wider(names_from = "Variable", values_from = "value") -> AllDataLong

OnePerson <- AllDataLong |> 
  filter(ID == "1")
OnePerson

# Understanding data reshaping:
# Wide format: Each time point has its own column
# Long format: Time points are in rows, with a time variable

12.1 Exercise

12.1.1 Practice: Wide to Long with dplyr

In this exercise, practice reshaping repeated-measures data from wide to long using a dplyr pipeline (with tidyr helpers inside the pipe).

  1. Create the small wide data frame below.
  2. Reshape to long so that you have columns id, time, dv, and age.
  3. Compute the mean of dv by time as a quick check.
Hide the code
# Load packages
library(dplyr)
library(tidyr)

# 1) Start from a small wide toy data set
toy_wide <- tibble::tribble(
  ~id, ~dv_time1, ~dv_time2, ~dv_time3, ~age_time1, ~age_time2, ~age_time3,
   1,        10,        12,        15,         20,         21,         22,
   2,         8,        11,        11,         19,         20,         21,
   3,        14,        13,        16,         21,         22,         23
)

# 2) YOUR TURN: Convert to long using a single dplyr pipeline
#    Goal columns: id, time (1/2/3), dv, age
#    Hints:
#      - Use pivot_longer() on both dv_ and age_ columns together
#      - Separate the column name into variable (dv/age) and time (1/2/3)
#      - Use pivot_wider() to spread variable back into dv and age columns

12.1.1.1 Optional solution

Hide the code
toy_long <- toy_wide |> 
  pivot_longer(
    cols = c(starts_with("dv_"), starts_with("age_")),
    names_to = "name",
    values_to = "value"
  ) |> 
  separate(name, into = c("variable", "time"), sep = "_time") |> 
  pivot_wider(names_from = variable, values_from = value) |> 
  mutate(time = as.integer(time))

toy_long

toy_long |> 
  group_by(time) |> 
  summarize(mean_dv = mean(dv, na.rm = TRUE), .groups = "drop")

13 Data Manipulation with dplyr

Hide the code
# dplyr provides a grammar for data manipulation

# Select columns
AllData |> 
  select(ID, starts_with("DV"))

# Filter rows
AllData |> 
  filter(ID < 5)

# Arrange rows
AllData |> 
  arrange(ID)

# Create new variables
AllData |> 
  mutate(
    DV_avg = (DVTime1 + DVTime2 + DVTime3) / 3,
    DV_range = DVTime3 - DVTime1
  )

# Group and summarize
AllDataLong |> 
  group_by(Time) |> 
  summarize(
    mean_DV = mean(DV, na.rm = TRUE),
    sd_DV = sd(DV, na.rm = TRUE),
    n = n()
  )

14 Gathering Descriptive Statistics

Hide the code
# The psych package makes getting descriptive statistics very easy.
## install.packages("psych")
library(psych)

# We can use describe() to get descriptive statistics across all cases:
DescriptivesWide = describe(AllData)
DescriptivesWide

DescriptivesLong = describe(AllDataLong)
DescriptivesLong

# We can use describeBy() to get descriptive statistics by groups:
DescriptivesLongID = describeBy(AllDataLong, group = AllDataLong$ID)
DescriptivesLongID

# Basic descriptive statistics without packages:
mean(AllDataLong$DV, na.rm = TRUE)
median(AllDataLong$DV, na.rm = TRUE)
sd(AllDataLong$DV, na.rm = TRUE)
var(AllDataLong$DV, na.rm = TRUE)
min(AllDataLong$DV, na.rm = TRUE)
max(AllDataLong$DV, na.rm = TRUE)
quantile(AllDataLong$DV, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)

15 Transforming Data

Hide the code
# Transforming data is accomplished by the creation of new variables. 
AllDataLong$AgeC = AllDataLong$Age - mean(AllDataLong$Age)

# You can also use functions to create new variables. Here we create new terms
#   using the function for significant digits:
AllDataLong$AgeYear = signif(x = AllDataLong$Age, digits = 2)
AllDataLong$AgeDecade = signif(x = AllDataLong$Age, digits = 1)
head(AllDataLong)

# Common data transformations:
# Centering: subtract mean
# Standardizing: (x - mean) / sd
# Log transformation: log(x)
# Square root: sqrt(x)
# Recoding: ifelse(condition, value_if_true, value_if_false)

# Example: Create standardized variables
AllDataLong$DV_z <- scale(AllDataLong$DV)
AllDataLong$Age_z <- scale(AllDataLong$Age)

16 Basic Plotting

Hide the code
# R has excellent plotting capabilities

# Base R plotting
hist(AllDataLong$DV, main = "Distribution of DV", xlab = "DV Values")
boxplot(DV ~ Time, data = AllDataLong, main = "DV by Time")
plot(AllDataLong$Age, AllDataLong$DV, main = "DV vs Age")

# Using ggplot2 (more modern and flexible)
# If you have not install the package yet, type in install.packages("ggplot2")
library(ggplot2)

# Histogram
ggplot(AllDataLong, aes(x = DV)) +
  geom_histogram(bins = 30) +
  labs(title = "Distribution of DV", x = "DV Values", y = "Count")

# Boxplot
ggplot(AllDataLong, aes(x = Time, y = DV)) +
  geom_boxplot() +
  labs(title = "DV by Time")

# Scatter plot
ggplot(AllDataLong, aes(x = Age, y = DV)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "DV vs Age")

17 Control Structures

Hide the code
# Conditional statements
x <- 10
if (x > 5) {
  print("x is greater than 5")
} else {
  print("x is less than or equal to 5")
}

# Loops
for (i in 1:5) {
  print(paste("Iteration", i))
}

# While loop
i <- 1
while (i <= 5) {
  print(paste("While iteration", i))
  i <- i + 1
}

# Apply functions (more R-like than loops)
numbers <- 1:10
sapply(numbers, function(x) x^2)
lapply(numbers, function(x) x^2)

18 Working with Missing Data

Hide the code
# R uses NA to represent missing data
# Check for missing values
is.na(AllDataLong$DV)
sum(is.na(AllDataLong$DV))
complete.cases(AllDataLong)

# Remove rows with missing data
AllDataLong_complete <- na.omit(AllDataLong)
# or
AllDataLong_complete <- AllDataLong[complete.cases(AllDataLong), ]

# Replace missing values
AllDataLong$DV_imputed <- ifelse(is.na(AllDataLong$DV), 
                                 mean(AllDataLong$DV, na.rm = TRUE), 
                                 AllDataLong$DV)

19 Best Practices and Tips

Hide the code
# 1. Always use meaningful variable names
# 2. Comment your code
# 3. Use consistent formatting
# 4. Check your data after importing
# 5. Save your work regularly
# 6. Use version control (Git)
# 7. Write reproducible code
# 8. Use packages for common tasks
# 9. Learn to use help documentation
# 10. Practice regularly!

# Useful keyboard shortcuts in RStudio:
# Ctrl+Enter (Cmd+Enter on Mac): Run current line/selection
# Ctrl+Shift+Enter: Run entire script
# Ctrl+Shift+M: Insert pipe operator |>
# Ctrl+Shift+C: Comment/uncomment lines
# Ctrl+Shift+R: Insert code section

20 Getting Help

Hide the code
# R has excellent help documentation
?mean                    # Help for a function
??"regression"          # Search for functions containing "regression"
help(mean)              # Same as ?mean
example(mean)           # Run examples for a function

# Online resources:
# - R Documentation: https://www.rdocumentation.org/
# - Stack Overflow: https://stackoverflow.com/questions/tagged/r
# - R-bloggers: https://www.r-bloggers.com/
# - RStudio Community: https://community.rstudio.com/

# Installing and loading packages
install.packages("package_name")  # Install once
library(package_name)             # Load each session
require(package_name)             # Alternative to library()
Back to top