How to do Data Cleaning in R

blog

tidyverse

Author

Jihong Zhang

Published

September 4, 2018

This blog is trying to elaborate steps for cleaning the data. Since datasets varied, this blog could not cover all. Depedent on the data you’re using, different methods should be used.

1 Libraries

I use tidyverse as my main tool to clean the data. Tidyverse is a very useful package created by Hadley. It includes several sub-packages, such as dplyr (data manipulation), tidyr (data transforming), readr (data import), ggplot2 (data visulization) etc. If you haven’t installed this package yet, please run install.packages("tidyverse") in your R console.

⌘+C

library(tidyverse)

Warning: package 'dplyr' was built under R version 4.2.3

Warning: package 'stringr' was built under R version 4.2.3

2 Step 1: Import Data

If your data is csv format, you could use read.csv() to import the data into R. Be careful to add stringsAsFactors = FALSE argument in the function, or all string variables will automate convert to factor by default. This will lead to some issues when you do further checking. # Configuration of Academic # Documentation: https://sourcethemes.com/academic/ # # This file is formatted using TOML syntax - learn more at https://learnxinyminutes.com/docs/toml/ # Each configuration section is defined by a name in square brackets (e.g. [outputs]).

⌘+C

dat1 <- read.csv("nfl_2010-2017.csv", stringsAsFactors = FALSE)

3 Step 2: Initial Check

Perform some initial check before doing further manipulation. This step is to let you get familiar with you data and have a big picture on what you need to do next.

3.1 Step 2.1: check variables

str() is a very useful fuction in R base package which provides you sample sizes, number of variabels, variables names, variabile types and several sample responses. In the sample data, there are 81,525 observations and 23 variables.

⌘+C

str(dat1)

'data.frame':   81525 obs. of  23 variables:
 $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ name        : chr  "Duce Staley" "Lamar Smith" "Tiki Barber" "Stephen Davis" ...
 $ team        : chr  "PHI" "MIA" "NYG" "WAS" ...
 $ game_year   : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ game_week   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ rush_att    : int  26 27 13 23 28 27 30 14 15 10 ...
 $ rush_yds    : int  201 145 144 133 124 119 110 109 88 87 ...
 $ rush_avg    : num  7.7 5.4 11.1 5.8 4.4 4.4 3.7 7.8 5.9 8.7 ...
 $ rush_tds    : int  1 1 2 1 1 0 1 0 0 1 ...
 $ rush_fumbles: int  0 1 0 0 1 0 0 0 0 0 ...
 $ rec         : int  4 1 3 4 6 4 6 2 2 NA ...
 $ rec_yds     : int  61 12 25 37 40 32 34 3 20 NA ...
 $ rec_avg     : num  15.3 12 8.3 9.3 6.7 8 5.7 1.5 10 NA ...
 $ rec_tds     : int  0 0 0 0 1 0 1 0 0 NA ...
 $ rec_fumbles : int  0 0 0 0 0 0 0 0 0 NA ...
 $ pass_att    : int  NA NA NA NA NA NA NA NA NA 41 ...
 $ pass_yds    : int  NA NA NA NA NA NA NA NA NA 290 ...
 $ pass_tds    : int  NA NA NA NA NA NA NA NA NA 2 ...
 $ int         : int  NA NA NA NA NA NA NA NA NA 0 ...
 $ sck         : int  NA NA NA NA NA NA NA NA NA 2 ...
 $ pass_fumbles: int  NA NA NA NA NA NA NA NA NA 0 ...
 $ rate        : num  NA NA NA NA NA ...
 $ position    : chr  "RB" "RB" "RB" "RB" ...

3.2 step 2.2: check missing values and ranges

Missing values could be checked one by one variables or in case-level. Knowing which one or more variablea have high missing values will help you think about the reasons.

⌘+C

summary(dat1)

       X             name               team             game_year   
 Min.   :    1   Length:81525       Length:81525       Min.   :2000  
 1st Qu.:20382   Class :character   Class :character   1st Qu.:2004  
 Median :40763   Mode  :character   Mode  :character   Median :2009  
 Mean   :40763                                         Mean   :2009  
 3rd Qu.:61144                                         3rd Qu.:2013  
 Max.   :81525                                         Max.   :2017  
                                                                     
   game_week         rush_att        rush_yds         rush_avg     
 Min.   : 1.000   Min.   : 0.00   Min.   :-34.00   Min.   :-34.00  
 1st Qu.: 4.000   1st Qu.: 1.00   1st Qu.:  4.00   1st Qu.:  2.00  
 Median : 9.000   Median : 4.00   Median : 15.00   Median :  3.50  
 Mean   : 8.592   Mean   : 6.96   Mean   : 28.91   Mean   :  4.05  
 3rd Qu.:13.000   3rd Qu.:10.00   3rd Qu.: 42.00   3rd Qu.:  5.30  
 Max.   :16.000   Max.   :43.00   Max.   :296.00   Max.   : 77.00  
                  NA's   :47710   NA's   :47710    NA's   :47710   
    rush_tds      rush_fumbles        rec            rec_yds      
 Min.   :0.00    Min.   :0.00    Min.   : 0.000   Min.   :-22.00  
 1st Qu.:0.00    1st Qu.:0.00    1st Qu.: 1.000   1st Qu.:  8.00  
 Median :0.00    Median :0.00    Median : 2.000   Median : 21.00  
 Mean   :0.21    Mean   :0.08    Mean   : 2.716   Mean   : 31.23  
 3rd Qu.:0.00    3rd Qu.:0.00    3rd Qu.: 4.000   3rd Qu.: 45.00  
 Max.   :5.00    Max.   :3.00    Max.   :21.000   Max.   :329.00  
 NA's   :47710   NA's   :47710   NA's   :16722    NA's   :16722   
    rec_avg          rec_tds       rec_fumbles       pass_att    
 Min.   :-22.00   Min.   :0.000   Min.   :0.000   Min.   : 0.00  
 1st Qu.:  5.10   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:21.00  
 Median :  9.00   Median :0.000   Median :0.000   Median :30.00  
 Mean   : 10.15   Mean   :0.188   Mean   :0.032   Mean   :27.64  
 3rd Qu.: 13.50   3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:37.00  
 Max.   : 99.00   Max.   :4.000   Max.   :2.000   Max.   :68.00  
 NA's   :16722    NA's   :16722   NA's   :16722   NA's   :71044  
    pass_yds        pass_tds          int             sck       
 Min.   :-11.0   Min.   :0.00    Min.   :0.0     Min.   : 0.00  
 1st Qu.:126.0   1st Qu.:0.00    1st Qu.:0.0     1st Qu.: 1.00  
 Median :204.0   Median :1.00    Median :1.0     Median : 2.00  
 Mean   :193.1   Mean   :1.16    Mean   :0.8     Mean   : 1.88  
 3rd Qu.:267.0   3rd Qu.:2.00    3rd Qu.:1.0     3rd Qu.: 3.00  
 Max.   :527.0   Max.   :7.00    Max.   :7.0     Max.   :12.00  
 NA's   :71044   NA's   :71044   NA's   :71044   NA's   :71044  
  pass_fumbles        rate          position        
 Min.   :0.00    Min.   :  0.00   Length:81525      
 1st Qu.:0.00    1st Qu.: 60.20   Class :character  
 Median :0.00    Median : 82.10   Mode  :character  
 Mean   :0.15    Mean   : 81.57                     
 3rd Qu.:0.00    3rd Qu.:103.20                     
 Max.   :3.00    Max.   :158.30                     
 NA's   :71044   NA's   :71044

In this sample, variables rush_att, rush_yds, rush_avg, rush_tds, rush_fumbles have 47,710 missing values; variables rec, rec_yds, rec_avg, rec_tds, rec_fumbles have 16,722 missing values, variables pass_att, pass_yds, pass_tds, int, sck, pass_fumbles, rate have 71,044 missing values.

Also, look at the range (Min., Max.) of variables in summary output, sometimes the responses may exceed expected range. You may have to delete those cases in this situation.

3.3 step 2.3: check first and last cases

Sometimes you may find the second row of data including some information you don’t want. For example, Qualtrics survey data will put some background log information in row 2. Thus, it’s better to check the head and tail of dataset. You can use functions below:

⌘+C

head(dat1)

  X           name team game_year game_week rush_att rush_yds rush_avg rush_tds
1 1    Duce Staley  PHI      2000         1       26      201      7.7        1
2 2    Lamar Smith  MIA      2000         1       27      145      5.4        1
3 3    Tiki Barber  NYG      2000         1       13      144     11.1        2
4 4  Stephen Davis  WAS      2000         1       23      133      5.8        1
5 5 Edgerrin James  IND      2000         1       28      124      4.4        1
6 6  Priest Holmes  BAL      2000         1       27      119      4.4        0
  rush_fumbles rec rec_yds rec_avg rec_tds rec_fumbles pass_att pass_yds
1            0   4      61    15.3       0           0       NA       NA
2            1   1      12    12.0       0           0       NA       NA
3            0   3      25     8.3       0           0       NA       NA
4            0   4      37     9.3       0           0       NA       NA
5            1   6      40     6.7       1           0       NA       NA
6            0   4      32     8.0       0           0       NA       NA
  pass_tds int sck pass_fumbles rate position
1       NA  NA  NA           NA   NA       RB
2       NA  NA  NA           NA   NA       RB
3       NA  NA  NA           NA   NA       RB
4       NA  NA  NA           NA   NA       RB
5       NA  NA  NA           NA   NA       RB
6       NA  NA  NA           NA   NA       RB

⌘+C

tail(dat1)

          X           name team game_year game_week rush_att rush_yds rush_avg
81520 81520     Jared Goff   LA      2017        15       NA       NA       NA
81521 81521    Andy Dalton  CIN      2017        15       NA       NA       NA
81522 81522 Trevor Siemian  DEN      2017        15       NA       NA       NA
81523 81523  A.J. McCarron  CIN      2017        15       NA       NA       NA
81524 81524 Derek Anderson  CAR      2017        16       NA       NA       NA
81525 81525  Johnny Hekker   LA      2017        16       NA       NA       NA
      rush_tds rush_fumbles rec rec_yds rec_avg rec_tds rec_fumbles pass_att
81520       NA           NA  NA      NA      NA      NA          NA       21
81521       NA           NA  NA      NA      NA      NA          NA       22
81522       NA           NA  NA      NA      NA      NA          NA        9
81523       NA           NA  NA      NA      NA      NA          NA        6
81524       NA           NA  NA      NA      NA      NA          NA        1
81525       NA           NA  NA      NA      NA      NA          NA        1
      pass_yds pass_tds int sck pass_fumbles rate position
81520      120        2   1   2            0 93.4       QB
81521      113        0   2   3            0 27.3       QB
81522       67        0   1   2            0 39.8       QB
81523       19        0   0   0            0 56.9       QB
81524        0        0   0   0            0 39.6    WR/TE
81525        0        0   0   0            0 39.6    WR/TE

The R output provide first 6 cases and last 6 case. You can use head(dat1, 10) to output first 10 cases.

4 Step 3: Select and rename Variables

After initial checks, some basic data cleaning could be done. You may want to select some variables you want and remove others. You can use select function to do that:

⌘+C

dat2 <- dat1 %>% select(name, team, pass_att)

I selected 3 variables (name, team, pass_att) from the dat1 and assign 3-variables data to “dat2”.

⌘+C

head(dat2)

            name team pass_att
1    Duce Staley  PHI       NA
2    Lamar Smith  MIA       NA
3    Tiki Barber  NYG       NA
4  Stephen Davis  WAS       NA
5 Edgerrin James  IND       NA
6  Priest Holmes  BAL       NA

To rename the variables’ names, you could use set_names() from purrr packages.

⌘+C

dat2 %>% purrr::set_names(nm = "Players", "Team", "Pass_Attribute") %>% head()

         Players Team Pass_Attribute
1    Duce Staley  PHI             NA
2    Lamar Smith  MIA             NA
3    Tiki Barber  NYG             NA
4  Stephen Davis  WAS             NA
5 Edgerrin James  IND             NA
6  Priest Holmes  BAL             NA

⌘+C

dat2 %>% purrr::set_names(nm = "V1", "V2", "V3") %>% head()

              V1  V2 V3
1    Duce Staley PHI NA
2    Lamar Smith MIA NA
3    Tiki Barber NYG NA
4  Stephen Davis WAS NA
5 Edgerrin James IND NA
6  Priest Holmes BAL NA

5 Step 4: Remove missing values

If you prefer cases with no missing cases at all. You can use the function below.

⌘+C

dat_complete_cases <- dat2 %>% 
  filter_all(all_vars(!is.na(.)))

summary(dat_complete_cases)

     name               team              pass_att    
 Length:10481       Length:10481       Min.   : 0.00  
 Class :character   Class :character   1st Qu.:21.00  
 Mode  :character   Mode  :character   Median :30.00  
                                       Mean   :27.64  
                                       3rd Qu.:37.00  
                                       Max.   :68.00

Or if you want remove the cases whose Pass_Attribute is missing, you can use:

⌘+C

dat_partialcomplete_cases <- dat2 %>% 
  filter(!is.na(pass_att))

summary(dat_complete_cases)

     name               team              pass_att    
 Length:10481       Length:10481       Min.   : 0.00  
 Class :character   Class :character   1st Qu.:21.00  
 Mode  :character   Mode  :character   Median :30.00  
                                       Mean   :27.64  
                                       3rd Qu.:37.00  
                                       Max.   :68.00

To be continued…