= 5
x <- function(x) {x = 3; return(x)}
print_x x
[1] 5
R Function
Jihong Zhang*, Ph.D
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
February 5, 2025
[1] 5
Error: object 'result' not found
<<-
to copy the object result
to the global enviroment.Non-constant Variance Score Test
Variance formula: ~ income
Chisquare = 1.520711, Df = 1, p = 0.21751
This fits variable prestige
as a linear function of predictor education
, and tests for nonconstant variance as a function of income
, another regressor in the data set Prestige.
lm
when call it in the functionf3 <- function(meanmod, dta, varmod) {
m3 <- lm(meanmod, dta)
car::ncvTest(m3, varmod)
}
f3(meanmod=prestige ~ education, dta=carData::Prestige, varmod = varmod ~ income)
Error in eval(data, envir = environment(formula(model))): object 'dta' not found
The arguments dta
and meanmod
are defined in the environment of the function, but the call to lm
looks for them in the global environment, and they are therefore invisible when lm
is called.
The assign function copies the dta
and meanmod
arguments to the global environment and rename them as .dta
and .meanmod
where ncvTest will be evaluated, and the remove function removes them before exiting the function.
f4 <- function(meanmod, dta, varmod) {
assign(".dta", dta, envir=.GlobalEnv)
assign(".meanmod", meanmod, envir=.GlobalEnv)
m1 <- lm(.meanmod, .dta)
ans <- car::ncvTest(m1, varmod)
remove(".dta", envir=.GlobalEnv)
remove(".meanmod", envir=.GlobalEnv)
ans
}
f4(prestige ~ education, carData::Prestige, ~income)
Non-constant Variance Score Test
Variance formula: ~ income
Chisquare = 1.520711, Df = 1, p = 0.21751
In this presentation, we will learn how to import data into R and tidy it for analysis. Data is often messy, but understanding how to manipulate it is a critical skill for any data scientist.
While R packages provide a great way to work with sample data, at some point, you’ll need to apply what you’ve learned to your own data.
In today’s lecture, we’ll focus on importing rectangular data, especially CSV files.
To read data in R, we’ll use the readr
package, which is part of the tidyverse
. Before you start, make sure to load the package.
Let’s start with a common file format: CSV. Here’s a simple example:
Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
You can read this CSV file into R using the read_csv()
function.
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
After loading your data, you’ll typically need to transform it for easier analysis. For instance, handling missing values is essential.
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
In some cases, column names may contain spaces, making them non-syntactic in R. You can handle this by renaming columns:
students_fixedColNames <- students |>
rename(
student_id = `Student ID`,
full_name = `Full Name`,
favourite_food = `favourite.food`,
meal_plan = `mealPlan`
)
students_fixedColNames
# A tibble: 6 × 5
student_id full_name favourite_food meal_plan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
janitor::clean_names()
function for quick renaming. clean_names()
use some heuristics to turn them all into snake case at once1.# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
Ensure that the variables are of the correct type. For example, the meal_plan
column is categorical and should be a factor:
students_fixedDataType <- students_fixedColNames |>
mutate(meal_plan = factor(meal_plan))
students_fixedDataType
# A tibble: 6 × 5
student_id full_name favourite_food meal_plan AGE
<dbl> <chr> <chr> <fct> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
tibble [6 × 5] (S3: tbl_df/tbl/data.frame)
$ student_id : num [1:6] 1 2 3 4 5 6
$ full_name : chr [1:6] "Sunil Huffmann" "Barclay Lynn" "Jayendra Lyne" "Leon Rossini" ...
$ favourite_food: chr [1:6] "Strawberry yoghurt" "French fries" "N/A" "Anchovies" ...
$ meal_plan : Factor w/ 2 levels "Breakfast and lunch",..: 2 2 1 2 1 2
$ AGE : chr [1:6] "4" "5" "7" NA ...
Sometimes, your data is split across multiple files. You can read them all at once and combine them into a single data frame:
#| code-summary: "01-sales.csv"
#| code-fold: true
month,year,brand,item,n
January,2019,1,1234,3
January,2019,1,8721,9
January,2019,1,1822,2
January,2019,2,3333,1
January,2019,2,2156,9
January,2019,2,3987,6
January,2019,2,3827,6
#| code-summary: "02-sales.csv"
#| code-fold: true
month,year,brand,item,n
February,2019,1,1234,8
February,2019,1,8721,2
February,2019,1,1822,3
February,2019,2,3333,1
February,2019,2,2156,3
February,2019,2,3987,6
#| code-summary: "03-sales.csv"
#| code-fold: true
month,year,brand,item,n
March,2019,1,1234,3
March,2019,1,3627,1
March,2019,1,8820,3
March,2019,2,7253,1
March,2019,2,8766,3
March,2019,2,8288,6
sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
sales_data <- read_csv(sales_files, id = "file")
sales_data
# A tibble: 19 × 6
file month year brand item n
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 data/01-sales.csv January 2019 1 1234 3
2 data/01-sales.csv January 2019 1 8721 9
3 data/01-sales.csv January 2019 1 1822 2
4 data/01-sales.csv January 2019 2 3333 1
5 data/01-sales.csv January 2019 2 2156 9
6 data/01-sales.csv January 2019 2 3987 6
7 data/01-sales.csv January 2019 2 3827 6
8 data/02-sales.csv February 2019 1 1234 8
9 data/02-sales.csv February 2019 1 8721 2
10 data/02-sales.csv February 2019 1 1822 3
11 data/02-sales.csv February 2019 2 3333 1
12 data/02-sales.csv February 2019 2 2156 3
13 data/02-sales.csv February 2019 2 3987 6
14 data/03-sales.csv March 2019 1 1234 3
15 data/03-sales.csv March 2019 1 3627 1
16 data/03-sales.csv March 2019 1 8820 3
17 data/03-sales.csv March 2019 2 7253 1
18 data/03-sales.csv March 2019 2 8766 3
19 data/03-sales.csv March 2019 2 8288 6
This will add a new column identifying which file the data came from.
You can also write data back to a file. For example, save the students
data frame as a CSV:
For more complex objects, consider using write_rds()
for saving R-specific objects.
Once you’ve mastered read_csv(), using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:
read_csv2()
reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker.
read_tsv()
reads tab-delimited files.
read_delim()
reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.
read_fwf()
reads fixed-width files. You can specify fields by their widths with fwf_widths() or by their positions with fwf_positions().
read_table()
reads a common variation of fixed-width files where columns are separated by white space.
read_log()
reads Apache-style log files.
readxl
package provides functions to read-in Microsoft Excel formats.The main functions are:
Function | Format | Typical suffix |
---|---|---|
read_excel | auto detect the format | xls, xlsx |
read_xls | original format | xls |
read_xlsx | new format | xlsx |
The Microsoft Excel formats permit you to have more than one spreadsheet in one file. These are referred to as sheets. The functions listed above read the first sheet by default, but we can also read the others. The excel_sheets function gives us the names of all the sheets in an Excel file. These names can then be passed to the sheet argument in the three functions above to read sheets other than the first.
Sometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script.
You can use tibble
and tribble
to create a new data frame
# A tibble: 3 × 3
x y z
<dbl> <chr> <dbl>
1 1 h 0.08
2 2 m 0.83
3 5 g 0.6
tribble()
is customized for data entry in code: column headings start with ~ and entries are separated by commas.Once you’ve loaded your data, the next step is often to tidy it for analysis. In this section, we’ll focus on transforming messy datasets into tidy ones.
Tidy data has three key rules:
Tidy data is easy to work with in R and other data analysis tools.
Tidy data allows for consistent use of tools like dplyr
and ggplot2
. The vectorized nature of R works best with tidy data because most functions expect each variable to be in its own column.
To tidy messy data, we often need to pivot it. Pivoting allows us to transform the structure of data without changing the underlying values.
The billboard
dataset records the billboard rank of songs in the year 2000:
In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song.
Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week1.
# A tibble: 317 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
# ℹ 307 more rows
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
We use the pivot_longer()
function to convert wide data into a longer format. This is helpful when multiple pieces of data are spread across columns.
billboard |>
pivot_longer(
cols = starts_with("wk"), # pecifies which columns need to be pivoted
names_to = "week", # names the variable stored in the column names, we named that variable week
values_to = "rank" # names the variable stored in the cell values, we named that variable rank.
)
# A tibble: 24,092 × 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
# ℹ 24,082 more rows
When pivoting data, you might encounter missing values. You can remove them using the values_drop_na
argument.
billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
# A tibble: 5,307 × 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
# ℹ 5,297 more rows
After tidying your data, you might need to convert data types. For instance, convert week
from a character string to a number:
billboard_longer <- billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
) |>
mutate(
week = parse_number(week)
)
billboard_longer
# A tibble: 5,307 × 5
artist track date.entered week rank
<chr> <chr> <date> <dbl> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
# ℹ 5,297 more rows
# A tibble: 3 × 3
id bp1 bp2
<chr> <dbl> <dbl>
1 A 100 120
2 B 140 115
3 C 120 125
# A tibble: 6 × 3
id measurement value
<chr> <chr> <dbl>
1 A bp1 100
2 A bp2 120
3 B bp1 140
4 B bp2 115
5 C bp1 120
6 C bp2 125
Sometimes, you need to make your data wider. The pivot_wider()
function is used when you have one observation spread across multiple rows.
If your column names contain multiple pieces of information, you can split them into separate variables during the pivot process:
who2 |>
pivot_longer(
cols = !(country:year),
names_to = c("diagnosis", "gender", "age"),
names_sep = "_",
values_to = "count"
)
# A tibble: 405,440 × 6
country year diagnosis gender age count
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan 1980 sp m 014 NA
2 Afghanistan 1980 sp m 1524 NA
3 Afghanistan 1980 sp m 2534 NA
4 Afghanistan 1980 sp m 3544 NA
5 Afghanistan 1980 sp m 4554 NA
6 Afghanistan 1980 sp m 5564 NA
7 Afghanistan 1980 sp m 65 NA
8 Afghanistan 1980 sp f 014 NA
9 Afghanistan 1980 sp f 1524 NA
10 Afghanistan 1980 sp f 2534 NA
# ℹ 405,430 more rows
If a dataset has both variable names and values in its columns, you can use .value
in pivot_longer()
to separate these:
household |>
pivot_longer(
cols = !family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)
# A tibble: 9 × 4
family child dob name
<int> <chr> <date> <chr>
1 1 child1 1998-11-26 Susan
2 1 child2 2000-01-29 Jose
3 2 child1 1996-06-22 Mark
4 3 child1 2002-07-11 Sam
5 3 child2 2004-04-05 Seth
6 4 child1 2004-10-10 Craig
7 4 child2 2009-08-27 Khai
8 5 child1 2000-12-05 Parker
9 5 child2 2005-02-28 Gracie
In this chapter, you’ve learned how to transform your data into a tidy format using pivot_longer()
and pivot_wider()
. These tools are fundamental for cleaning messy data and making it ready for analysis.
Imagine you have a dataset that tracks the number of visitors to a museum for each day of the week. The data is stored in wide format, where each day of the week is a separate column:
visitors_wide <- tibble(
museum = c("Museum A", "Museum B", "Museum C"),
monday = c(120, 150, 110),
tuesday = c(130, 160, 120),
wednesday = c(140, 170, 130),
thursday = c(110, 140, 100),
friday = c(160, 180, 150)
)
visitors_wide
# A tibble: 3 × 6
museum monday tuesday wednesday thursday friday
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Museum A 120 130 140 110 160
2 Museum B 150 160 170 140 180
3 Museum C 110 120 130 100 150
Task 1: Use pivot_longer() to convert this data into a long format where each row represents a single observation of a visitor count for a specific day.
Hint: The names_to
argument should be set to “day” to create a column for the days of the week, and the values_to
argument should be set to “visitors” to store the number of visitors.
Now, imagine you have a dataset in long format that tracks the average test scores of students in different subjects over several terms. The data is structured like this:
scores_long <- tibble(
student = c("Alice", "Alice", "Bob", "Bob", "Charlie", "Charlie"),
term = c("Term 1", "Term 2", "Term 1", "Term 2", "Term 1", "Term 2"),
subject = c("Math", "Math", "Math", "Math", "Math", "Math"),
score = c(85, 90, 78, 80, 92, 95)
)
scores_long
# A tibble: 6 × 4
student term subject score
<chr> <chr> <chr> <dbl>
1 Alice Term 1 Math 85
2 Alice Term 2 Math 90
3 Bob Term 1 Math 78
4 Bob Term 2 Math 80
5 Charlie Term 1 Math 92
6 Charlie Term 2 Math 95
Task 2: Use pivot_wider() to convert this data into a wide format, where each term is a separate column, and the values represent the student scores.
Hint: The names_from argument should be set to “term” to create columns for each term, and the values_from argument should be set to “score” to get the student scores.
Imagine you have a dataset that tracks the number of calls received by a customer service center over three months for various departments. The dataset is in wide format like this:
calls_wide <- tibble(
department = c("Sales", "Support", "Billing"),
jan = c(200, 150, 180),
feb = c(210, 160, 190),
mar = c(220, 170, 200)
)
calls_wide
# A tibble: 3 × 4
department jan feb mar
<chr> <dbl> <dbl> <dbl>
1 Sales 200 210 220
2 Support 150 160 170
3 Billing 180 190 200
Task 3:
First, use pivot_longer()
to convert this dataset into a long format.
Then, use pivot_wider()
to convert the data back into a wide format, but with the months as columns and the number of calls as the values.
Now that you understand data tidying, you can begin organizing your analysis in R scripts. In the next chapter, we’ll explore how to use projects and organize your code into files and directories.
---
title: "Lecture 04: Data Import and Data Tidying"
subtitle: "R Function"
author: "Jihong Zhang*, Ph.D"
institute: |
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
date: "2025-02-05"
sidebar: false
execute:
eval: true
echo: true
warning: false
output-location: default
code-annotations: below
highlight-style: "nord"
format:
html:
code-tools: true
code-line-numbers: false
code-fold: false
number-offset: 0
uark-revealjs:
scrollable: true
chalkboard: true
embed-resources: false
code-fold: false
number-sections: false
footer: "ESRM 64503 - Lecture 03: Object/Function/Package"
slide-number: c/t
tbl-colwidths: auto
output-file: slides-index.html
---
# Nested Function
## Function Scope
- **Local Scope (Environment)**: Variables defined inside a function are local to that function.
```{r}
x = 5
print_x <- function(x) {x = 3; return(x)}
x
```
- **Global Scope (Environment)**: Variables defined outside a function are accessible throughout the script.
------------------------------------------------------------------------
### Example: Local Scope
```{r}
#| error: true
add <- function(x, y) {
result <- x + y
return(result)
}
add(2, 3) # Output: 5
result # Error: object 'result' not found
```
- We can use `<<-` to copy the object `result` to the global enviroment.
```{r}
add <- function(x, y) {
result <<- x + y
return(result)
}
add(2, 3) # Output: 5
result
```
## Practical Example of Function Scoping
```{r}
library(car)
m2 <- lm(prestige ~ education, data=carData::Prestige)
car::ncvTest(m2, ~ income)
```
This fits variable `prestige` as a linear function of predictor `education`, and tests for nonconstant variance as a function of `income`, another regressor in the data set `Prestige.`
------------------------------------------------------------------------
### Issue of `lm` when call it in the function
```{r}
#| error: true
f3 <- function(meanmod, dta, varmod) {
m3 <- lm(meanmod, dta)
car::ncvTest(m3, varmod)
}
f3(meanmod=prestige ~ education, dta=carData::Prestige, varmod = varmod ~ income)
```
The arguments `dta` and `meanmod` are defined in the environment of the function, but the call to `lm` looks for them in the **global environment**, and they are therefore invisible when `lm` is called.
------------------------------------------------------------------------
### Solution: copy to global environment
The assign function copies the `dta` and `meanmod` arguments to the global environment and rename them as `.dta` and `.meanmod` where ncvTest will be evaluated, and the remove function removes them before exiting the function.
```{r}
f4 <- function(meanmod, dta, varmod) {
assign(".dta", dta, envir=.GlobalEnv)
assign(".meanmod", meanmod, envir=.GlobalEnv)
m1 <- lm(.meanmod, .dta)
ans <- car::ncvTest(m1, varmod)
remove(".dta", envir=.GlobalEnv)
remove(".meanmod", envir=.GlobalEnv)
ans
}
f4(prestige ~ education, carData::Prestige, ~income)
```
# Data Import
## Introduction to Data Import
In this presentation, we will learn how to import data into R and tidy it for analysis. Data is often messy, but understanding how to manipulate it is a critical skill for any data scientist.
------------------------------------------------------------------------
## Working with Your Own Data
- While R packages provide a great way to work with sample data, at some point, you'll need to apply what you've learned to your own data.
- In today's lecture, we'll focus on importing rectangular data, especially CSV files.
------------------------------------------------------------------------
## Prerequisites for Reading Data
To read data in R, we’ll use the `readr` package, which is part of the `tidyverse`. Before you start, make sure to load the package.
```{r}
library(tidyverse)
# or
library(readr)
```
------------------------------------------------------------------------
## Reading Data from a CSV File
Let’s start with a common file format: CSV. Here’s a simple example:
``` csv
Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
```
You can read this CSV file into R using the `read_csv()` function.
```{r}
students <- read_csv("data/students.csv")
students
```
------------------------------------------------------------------------
## Practical Tips for Data Import
After loading your data, you'll typically need to transform it for easier analysis. For instance, handling missing values is essential.
```{r}
students_fixedNA <- read_csv("data/students.csv", na = c("N/A", ""))
students_fixedNA
```
------------------------------------------------------------------------
## Handling Non-Syntactic Column Names
In some cases, column names may contain spaces, making them non-syntactic in R. You can handle this by renaming columns:
```{r}
students_fixedColNames <- students |>
rename(
student_id = `Student ID`,
full_name = `Full Name`,
favourite_food = `favourite.food`,
meal_plan = `mealPlan`
)
students_fixedColNames
```
- Alternatively, use the `janitor::clean_names()` function for quick renaming. `clean_names()` use some heuristics to turn them all into snake case at once1.
```{r}
students |> janitor::clean_names()
```
------------------------------------------------------------------------
## Fixing Data Types
Ensure that the variables are of the correct type. For example, the `meal_plan` column is categorical and should be a factor:
```{r}
students_fixedDataType <- students_fixedColNames |>
mutate(meal_plan = factor(meal_plan))
students_fixedDataType
str(students_fixedDataType)
```
```{r}
table(students_fixedDataType$meal_plan)
```
------------------------------------------------------------------------
## Reading Data from Multiple Files
Sometimes, your data is split across multiple files. You can read them all at once and combine them into a single data frame:
```csv
#| code-summary: "01-sales.csv"
#| code-fold: true
month,year,brand,item,n
January,2019,1,1234,3
January,2019,1,8721,9
January,2019,1,1822,2
January,2019,2,3333,1
January,2019,2,2156,9
January,2019,2,3987,6
January,2019,2,3827,6
```
```csv
#| code-summary: "02-sales.csv"
#| code-fold: true
month,year,brand,item,n
February,2019,1,1234,8
February,2019,1,8721,2
February,2019,1,1822,3
February,2019,2,3333,1
February,2019,2,2156,3
February,2019,2,3987,6
```
```csv
#| code-summary: "03-sales.csv"
#| code-fold: true
month,year,brand,item,n
March,2019,1,1234,3
March,2019,1,3627,1
March,2019,1,8820,3
March,2019,2,7253,1
March,2019,2,8766,3
March,2019,2,8288,6
```
```{r}
sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
sales_data <- read_csv(sales_files, id = "file")
sales_data
```
This will add a new column identifying which file the data came from.
------------------------------------------------------------------------
## Writing Data to a File
You can also write data back to a file. For example, save the `students` data frame as a CSV:
```{r}
#| eval: false
write_csv(students_fixedDataType, "data/students_final.csv")
```
For more complex objects, consider using `write_rds()` for saving R-specific objects.
## Other file types
- Once you’ve mastered read_csv(), using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:
- `read_csv2()` reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker.
- `read_tsv()` reads tab-delimited files.
- `read_delim()` reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.
- `read_fwf()` reads fixed-width files. You can specify fields by their widths with fwf_widths() or by their positions with fwf_positions().
- `read_table()` reads a common variation of fixed-width files where columns are separated by white space.
- `read_log()` reads Apache-style log files.
## Other Importing Packages
### readxl
- The `readxl` package provides functions to read-in Microsoft Excel formats.
```{r}
#| eval: false
library(readxl)
# Specify sheet either by position or by name
read_excel(datasets, 2)
read_excel(datasets, "mtcars")
```
The main functions are:
| Function | Format | Typical suffix |
|------------|------------------------|----------------|
| read_excel | auto detect the format | xls, xlsx |
| read_xls | original format | xls |
| read_xlsx | new format | xlsx |
The Microsoft Excel formats permit you to have more than one spreadsheet in one file. These are referred to as sheets. The functions listed above read the first sheet by default, but we can also read the others. The excel_sheets function gives us the names of all the sheets in an Excel file. These names can then be passed to the sheet argument in the three functions above to read sheets other than the first.
## Data Entry
- Sometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script.
- You can use `tibble` and `tribble` to create a new data frame
```{r}
tibble(
x = c(1, 2, 5),
y = c("h", "m", "g"),
z = c(0.08, 0.83, 0.60)
)
```
- `tribble()` is customized for data entry in code: column headings start with \~ and entries are separated by commas.
```{r}
tribble(
~x, ~y, ~z,
1, "h", 0.08,
2, "m", 0.83,
5, "g", 0.60
)
```
# Data Tidying
## Introduction to Data Tidying
Once you’ve loaded your data, the next step is often to tidy it for analysis. In this section, we’ll focus on transforming messy datasets into tidy ones.
------------------------------------------------------------------------
## What is Tidy Data?
Tidy data has three key rules:
- Each variable is a column.
- Each observation is a row.
- Each value is a cell.
Tidy data is easy to work with in R and other data analysis tools.
------------------------------------------------------------------------
## Why Ensure Your Data is Tidy?
Tidy data allows for consistent use of tools like `dplyr` and `ggplot2`. The vectorized nature of R works best with tidy data because most functions expect each variable to be in its own column.
------------------------------------------------------------------------
## Pivoting Data: An Overview
- To tidy messy data, we often need to pivot it. Pivoting allows us to transform the structure of data without changing the underlying values.
- The `billboard` dataset records the billboard rank of songs in the year 2000:
- In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song.
- Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week1.
```{r}
billboard
```
------------------------------------------------------------------------
## Pivoting Data Longer
We use the `pivot_longer()` function to convert wide data into a longer format. This is helpful when multiple pieces of data are spread across columns.
```{r}
billboard |>
pivot_longer(
cols = starts_with("wk"), # pecifies which columns need to be pivoted
names_to = "week", # names the variable stored in the column names, we named that variable week
values_to = "rank" # names the variable stored in the cell values, we named that variable rank.
)
```
------------------------------------------------------------------------
## Dropping NA Values While Pivoting
When pivoting data, you might encounter missing values. You can remove them using the `values_drop_na` argument.
```{r}
billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
```
------------------------------------------------------------------------
## Converting Columns to More Useful Data Types
After tidying your data, you might need to convert data types. For instance, convert `week` from a character string to a number:
```{r}
billboard_longer <- billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
) |>
mutate(
week = parse_number(week)
)
billboard_longer
```
------------------------------------------------------------------------
## How it works: I
```{r}
#| code-fold: true
df <- tribble(
~id, ~bp1, ~bp2,
"A", 100, 120,
"B", 140, 115,
"C", 120, 125
)
df
df |>
pivot_longer(
cols = bp1:bp2,
names_to = "measurement",
values_to = "value"
)
```
------------------------------------------------------------------------
## How it works: II
```{r}
#| eval: false
pivot_longer(
cols = bp1:bp2,
names_to = "measurement",
values_to = "value"
)
```

## Widening Data: Pivoting in the Other Direction
Sometimes, you need to make your data wider. The `pivot_wider()` function is used when you have one observation spread across multiple rows.
------------------------------------------------------------------------
## Pivoting with Multiple Variables in Column Names
If your column names contain multiple pieces of information, you can split them into separate variables during the pivot process:
```{r}
who2 |>
pivot_longer(
cols = !(country:year),
names_to = c("diagnosis", "gender", "age"),
names_sep = "_",
values_to = "count"
)
```
------------------------------------------------------------------------
## When Data Contains Both Variable and Value Names
If a dataset has both variable names and values in its columns, you can use `.value` in `pivot_longer()` to separate these:
```{r}
household |>
pivot_longer(
cols = !family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)
```
------------------------------------------------------------------------
## Summary of Data Tidying
In this chapter, you’ve learned how to transform your data into a tidy format using `pivot_longer()` and `pivot_wider()`. These tools are fundamental for cleaning messy data and making it ready for analysis.
------------------------------------------------------------------------
## Exercise: Practicing pivot_longer() and pivot_wider()
### Scenario 1: Using pivot_longer()
Imagine you have a dataset that tracks the number of visitors to a museum for each day of the week. The data is stored in wide format, where each day of the week is a separate column:
```{r}
#| eval: true
visitors_wide <- tibble(
museum = c("Museum A", "Museum B", "Museum C"),
monday = c(120, 150, 110),
tuesday = c(130, 160, 120),
wednesday = c(140, 170, 130),
thursday = c(110, 140, 100),
friday = c(160, 180, 150)
)
visitors_wide
```
**Task 1**: Use pivot_longer() to convert this data into a long format where each row represents a single observation of a visitor count for a specific day.
**Hint**: The `names_to` argument should be set to "day" to create a column for the days of the week, and the `values_to` argument should be set to "visitors" to store the number of visitors.
```{r}
#| echo: false
#| eval: false
visitors_long <- visitors_wide |>
pivot_longer(
cols = monday:friday,
names_to = "day",
values_to = "visitors"
)
visitors_long
```
------------------------------------------------------------------------
### Scenario 2: Using pivot_wider()
Now, imagine you have a dataset in long format that tracks the average test scores of students in different subjects over several terms. The data is structured like this:
```{r}
scores_long <- tibble(
student = c("Alice", "Alice", "Bob", "Bob", "Charlie", "Charlie"),
term = c("Term 1", "Term 2", "Term 1", "Term 2", "Term 1", "Term 2"),
subject = c("Math", "Math", "Math", "Math", "Math", "Math"),
score = c(85, 90, 78, 80, 92, 95)
)
scores_long
```
**Task 2**: Use pivot_wider() to convert this data into a wide format, where each term is a separate column, and the values represent the student scores.
**Hint**: The names_from argument should be set to "term" to create columns for each term, and the values_from argument should be set to "score" to get the student scores.
```{r}
#| echo: false
#| eval: false
scores_wide <- scores_long |>
pivot_wider(
names_from = term,
values_from = score
)
scores_wide
```
## Bonus Task: Combining pivot_longer() and pivot_wider()
Imagine you have a dataset that tracks the number of calls received by a customer service center over three months for various departments. The dataset is in wide format like this:
```{r}
calls_wide <- tibble(
department = c("Sales", "Support", "Billing"),
jan = c(200, 150, 180),
feb = c(210, 160, 190),
mar = c(220, 170, 200)
)
calls_wide
```
**Task 3**:
- First, use `pivot_longer()` to convert this dataset into a long format.
- Then, use `pivot_wider()` to convert the data back into a wide format, but with the months as columns and the number of calls as the values.
```{r}
#| eval: false
#| echo: false
calls_long <- calls_wide |>
pivot_longer(
cols = c("jan", "feb", "mar"),
names_to = "month",
values_to = "calls"
)
calls_long
calls_wide_again <- calls_long |>
pivot_wider(
names_from = month,
values_from = calls
)
calls_wide_again
```
## Next Steps
Now that you understand data tidying, you can begin organizing your analysis in R scripts. In the next chapter, we’ll explore how to use projects and organize your code into files and directories.