Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
Jihong Zhang
September 4, 2018
This blog is trying to elaborate steps for cleaning the data. Since datasets varied, this blog could not cover all. Depedent on the data you’re using, different methods should be used.
I use tidyverse as my main tool to clean the data. Tidyverse is a very useful package created by Hadley. It includes several sub-packages, such as dplyr (data manipulation), tidyr (data transforming), readr (data import), ggplot2 (data visulization) etc. If you haven’t installed this package yet, please run install.packages("tidyverse")
in your R console.
If your data is csv format, you could use read.csv()
to import the data into R. Be careful to add stringsAsFactors = FALSE
argument in the function, or all string variables will automate convert to factor by default. This will lead to some issues when you do further checking. # Configuration of Academic # Documentation: https://sourcethemes.com/academic/ # # This file is formatted using TOML syntax - learn more at https://learnxinyminutes.com/docs/toml/ # Each configuration section is defined by a name in square brackets (e.g. [outputs]
).
Perform some initial check before doing further manipulation. This step is to let you get familiar with you data and have a big picture on what you need to do next.
str()
is a very useful fuction in R base package which provides you sample sizes, number of variabels, variables names, variabile types and several sample responses. In the sample data, there are 81,525 observations and 23 variables.
'data.frame': 81525 obs. of 23 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ name : chr "Duce Staley" "Lamar Smith" "Tiki Barber" "Stephen Davis" ...
$ team : chr "PHI" "MIA" "NYG" "WAS" ...
$ game_year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
$ game_week : int 1 1 1 1 1 1 1 1 1 1 ...
$ rush_att : int 26 27 13 23 28 27 30 14 15 10 ...
$ rush_yds : int 201 145 144 133 124 119 110 109 88 87 ...
$ rush_avg : num 7.7 5.4 11.1 5.8 4.4 4.4 3.7 7.8 5.9 8.7 ...
$ rush_tds : int 1 1 2 1 1 0 1 0 0 1 ...
$ rush_fumbles: int 0 1 0 0 1 0 0 0 0 0 ...
$ rec : int 4 1 3 4 6 4 6 2 2 NA ...
$ rec_yds : int 61 12 25 37 40 32 34 3 20 NA ...
$ rec_avg : num 15.3 12 8.3 9.3 6.7 8 5.7 1.5 10 NA ...
$ rec_tds : int 0 0 0 0 1 0 1 0 0 NA ...
$ rec_fumbles : int 0 0 0 0 0 0 0 0 0 NA ...
$ pass_att : int NA NA NA NA NA NA NA NA NA 41 ...
$ pass_yds : int NA NA NA NA NA NA NA NA NA 290 ...
$ pass_tds : int NA NA NA NA NA NA NA NA NA 2 ...
$ int : int NA NA NA NA NA NA NA NA NA 0 ...
$ sck : int NA NA NA NA NA NA NA NA NA 2 ...
$ pass_fumbles: int NA NA NA NA NA NA NA NA NA 0 ...
$ rate : num NA NA NA NA NA ...
$ position : chr "RB" "RB" "RB" "RB" ...
Missing values could be checked one by one variables or in case-level. Knowing which one or more variablea have high missing values will help you think about the reasons.
X name team game_year
Min. : 1 Length:81525 Length:81525 Min. :2000
1st Qu.:20382 Class :character Class :character 1st Qu.:2004
Median :40763 Mode :character Mode :character Median :2009
Mean :40763 Mean :2009
3rd Qu.:61144 3rd Qu.:2013
Max. :81525 Max. :2017
game_week rush_att rush_yds rush_avg
Min. : 1.000 Min. : 0.00 Min. :-34.00 Min. :-34.00
1st Qu.: 4.000 1st Qu.: 1.00 1st Qu.: 4.00 1st Qu.: 2.00
Median : 9.000 Median : 4.00 Median : 15.00 Median : 3.50
Mean : 8.592 Mean : 6.96 Mean : 28.91 Mean : 4.05
3rd Qu.:13.000 3rd Qu.:10.00 3rd Qu.: 42.00 3rd Qu.: 5.30
Max. :16.000 Max. :43.00 Max. :296.00 Max. : 77.00
NA's :47710 NA's :47710 NA's :47710
rush_tds rush_fumbles rec rec_yds
Min. :0.00 Min. :0.00 Min. : 0.000 Min. :-22.00
1st Qu.:0.00 1st Qu.:0.00 1st Qu.: 1.000 1st Qu.: 8.00
Median :0.00 Median :0.00 Median : 2.000 Median : 21.00
Mean :0.21 Mean :0.08 Mean : 2.716 Mean : 31.23
3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.: 4.000 3rd Qu.: 45.00
Max. :5.00 Max. :3.00 Max. :21.000 Max. :329.00
NA's :47710 NA's :47710 NA's :16722 NA's :16722
rec_avg rec_tds rec_fumbles pass_att
Min. :-22.00 Min. :0.000 Min. :0.000 Min. : 0.00
1st Qu.: 5.10 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:21.00
Median : 9.00 Median :0.000 Median :0.000 Median :30.00
Mean : 10.15 Mean :0.188 Mean :0.032 Mean :27.64
3rd Qu.: 13.50 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:37.00
Max. : 99.00 Max. :4.000 Max. :2.000 Max. :68.00
NA's :16722 NA's :16722 NA's :16722 NA's :71044
pass_yds pass_tds int sck
Min. :-11.0 Min. :0.00 Min. :0.0 Min. : 0.00
1st Qu.:126.0 1st Qu.:0.00 1st Qu.:0.0 1st Qu.: 1.00
Median :204.0 Median :1.00 Median :1.0 Median : 2.00
Mean :193.1 Mean :1.16 Mean :0.8 Mean : 1.88
3rd Qu.:267.0 3rd Qu.:2.00 3rd Qu.:1.0 3rd Qu.: 3.00
Max. :527.0 Max. :7.00 Max. :7.0 Max. :12.00
NA's :71044 NA's :71044 NA's :71044 NA's :71044
pass_fumbles rate position
Min. :0.00 Min. : 0.00 Length:81525
1st Qu.:0.00 1st Qu.: 60.20 Class :character
Median :0.00 Median : 82.10 Mode :character
Mean :0.15 Mean : 81.57
3rd Qu.:0.00 3rd Qu.:103.20
Max. :3.00 Max. :158.30
NA's :71044 NA's :71044
In this sample, variables rush_att, rush_yds, rush_avg, rush_tds, rush_fumbles have 47,710 missing values; variables rec, rec_yds, rec_avg, rec_tds, rec_fumbles have 16,722 missing values, variables pass_att, pass_yds, pass_tds, int, sck, pass_fumbles, rate have 71,044 missing values.
Also, look at the range (Min., Max.) of variables in summary
output, sometimes the responses may exceed expected range. You may have to delete those cases in this situation.
Sometimes you may find the second row of data including some information you don’t want. For example, Qualtrics survey data will put some background log information in row 2. Thus, it’s better to check the head and tail of dataset. You can use functions below:
X name team game_year game_week rush_att rush_yds rush_avg rush_tds
1 1 Duce Staley PHI 2000 1 26 201 7.7 1
2 2 Lamar Smith MIA 2000 1 27 145 5.4 1
3 3 Tiki Barber NYG 2000 1 13 144 11.1 2
4 4 Stephen Davis WAS 2000 1 23 133 5.8 1
5 5 Edgerrin James IND 2000 1 28 124 4.4 1
6 6 Priest Holmes BAL 2000 1 27 119 4.4 0
rush_fumbles rec rec_yds rec_avg rec_tds rec_fumbles pass_att pass_yds
1 0 4 61 15.3 0 0 NA NA
2 1 1 12 12.0 0 0 NA NA
3 0 3 25 8.3 0 0 NA NA
4 0 4 37 9.3 0 0 NA NA
5 1 6 40 6.7 1 0 NA NA
6 0 4 32 8.0 0 0 NA NA
pass_tds int sck pass_fumbles rate position
1 NA NA NA NA NA RB
2 NA NA NA NA NA RB
3 NA NA NA NA NA RB
4 NA NA NA NA NA RB
5 NA NA NA NA NA RB
6 NA NA NA NA NA RB
X name team game_year game_week rush_att rush_yds rush_avg
81520 81520 Jared Goff LA 2017 15 NA NA NA
81521 81521 Andy Dalton CIN 2017 15 NA NA NA
81522 81522 Trevor Siemian DEN 2017 15 NA NA NA
81523 81523 A.J. McCarron CIN 2017 15 NA NA NA
81524 81524 Derek Anderson CAR 2017 16 NA NA NA
81525 81525 Johnny Hekker LA 2017 16 NA NA NA
rush_tds rush_fumbles rec rec_yds rec_avg rec_tds rec_fumbles pass_att
81520 NA NA NA NA NA NA NA 21
81521 NA NA NA NA NA NA NA 22
81522 NA NA NA NA NA NA NA 9
81523 NA NA NA NA NA NA NA 6
81524 NA NA NA NA NA NA NA 1
81525 NA NA NA NA NA NA NA 1
pass_yds pass_tds int sck pass_fumbles rate position
81520 120 2 1 2 0 93.4 QB
81521 113 0 2 3 0 27.3 QB
81522 67 0 1 2 0 39.8 QB
81523 19 0 0 0 0 56.9 QB
81524 0 0 0 0 0 39.6 WR/TE
81525 0 0 0 0 0 39.6 WR/TE
The R output provide first 6 cases and last 6 case. You can use head(dat1, 10)
to output first 10 cases.
After initial checks, some basic data cleaning could be done. You may want to select some variables you want and remove others. You can use select
function to do that:
I selected 3 variables (name, team, pass_att) from the dat1 and assign 3-variables data to “dat2”.
name team pass_att
1 Duce Staley PHI NA
2 Lamar Smith MIA NA
3 Tiki Barber NYG NA
4 Stephen Davis WAS NA
5 Edgerrin James IND NA
6 Priest Holmes BAL NA
To rename the variables’ names, you could use set_names()
from purrr packages.
Players Team Pass_Attribute
1 Duce Staley PHI NA
2 Lamar Smith MIA NA
3 Tiki Barber NYG NA
4 Stephen Davis WAS NA
5 Edgerrin James IND NA
6 Priest Holmes BAL NA
V1 V2 V3
1 Duce Staley PHI NA
2 Lamar Smith MIA NA
3 Tiki Barber NYG NA
4 Stephen Davis WAS NA
5 Edgerrin James IND NA
6 Priest Holmes BAL NA
If you prefer cases with no missing cases at all. You can use the function below.
name team pass_att
Length:10481 Length:10481 Min. : 0.00
Class :character Class :character 1st Qu.:21.00
Mode :character Mode :character Median :30.00
Mean :27.64
3rd Qu.:37.00
Max. :68.00
Or if you want remove the cases whose Pass_Attribute is missing, you can use:
name team pass_att
Length:10481 Length:10481 Min. : 0.00
Class :character Class :character 1st Qu.:21.00
Mode :character Mode :character Median :30.00
Mean :27.64
3rd Qu.:37.00
Max. :68.00
To be continued…
---
title: How to do Data Cleaning in R
author: Jihong Zhang
date: '2018-09-04'
categories:
- blog
- R
- tidyverse
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
> This blog is trying to elaborate steps for cleaning the data. Since datasets varied, this blog could not cover all. Depedent on the data you're using, different methods should be used.
## Libraries
I use tidyverse as my main tool to clean the data. Tidyverse is a very useful package created by Hadley. It includes several sub-packages, such as dplyr (data manipulation), tidyr (data transforming), readr (data import), ggplot2 (data visulization) etc. If you haven't installed this package yet, please run `install.packages("tidyverse")` in your R console.
```{r , results='hide', message=FALSE }
library(tidyverse)
```
## Step 1: Import Data
If your data is csv format, you could use `read.csv()` to import the data into R. Be careful to add `stringsAsFactors = FALSE` argument in the function, or all string variables will automate convert to factor by default. This will lead to some issues when you do further checking. \# Configuration of Academic \# Documentation: <https://sourcethemes.com/academic/> \# \# This file is formatted using TOML syntax - learn more at <https://learnxinyminutes.com/docs/toml/> \# Each configuration section is defined by a name in square brackets (e.g. `[outputs]`).
```{r}
dat1 <- read.csv("nfl_2010-2017.csv", stringsAsFactors = FALSE)
```
## Step 2: Initial Check
Perform some initial check before doing further manipulation. This step is to let you get familiar with you data and have a big picture on what you need to do next.
### Step 2.1: check variables
`str()` is a very useful fuction in R base package which provides you sample sizes, number of variabels, variables names, variabile types and several sample responses. In the sample data, there are 81,525 observations and 23 variables.
```{r}
str(dat1)
```
### step 2.2: check missing values and ranges
Missing values could be checked one by one variables or in case-level. Knowing which one or more variablea have high missing values will help you think about the reasons.
```{r}
summary(dat1)
```
In this sample, variables **rush_att, rush_yds, rush_avg, rush_tds, rush_fumbles** have 47,710 missing values; variables **rec, rec_yds, rec_avg, rec_tds, rec_fumbles** have 16,722 missing values, variables **pass_att, pass_yds, pass_tds, int, sck, pass_fumbles, rate** have 71,044 missing values.
Also, look at the range (Min., Max.) of variables in `summary` output, sometimes the responses may exceed expected range. You may have to delete those cases in this situation.
### step 2.3: check first and last cases
Sometimes you may find the second row of data including some information you don't want. For example, Qualtrics survey data will put some background log information in row 2. Thus, it's better to check the head and tail of dataset. You can use functions below:
```{r}
head(dat1)
tail(dat1)
```
The R output provide first 6 cases and last 6 case. You can use `head(dat1, 10)` to output first 10 cases.
## Step 3: Select and rename Variables
After initial checks, some basic data cleaning could be done. You may want to select some variables you want and remove others. You can use `select` function to do that:
```{r}
dat2 <- dat1 %>% select(name, team, pass_att)
```
I selected 3 variables (name, team, pass_att) from the dat1 and assign 3-variables data to "dat2".
```{r}
head(dat2)
```
To rename the variables' names, you could use `set_names()` from purrr packages.
```{r}
dat2 %>% purrr::set_names(nm = "Players", "Team", "Pass_Attribute") %>% head()
dat2 %>% purrr::set_names(nm = "V1", "V2", "V3") %>% head()
```
## Step 4: Remove missing values
If you prefer cases with no missing cases at all. You can use the function below.
```{r}
dat_complete_cases <- dat2 %>%
filter_all(all_vars(!is.na(.)))
summary(dat_complete_cases)
```
Or if you want remove the cases whose **Pass_Attribute** is missing, you can use:
```{r}
dat_partialcomplete_cases <- dat2 %>%
filter(!is.na(pass_att))
summary(dat_complete_cases)
```
To be continued...