Lecture 01: Basics of R

Getting Started

Jihong Zhang*, Ph.D

Educational Statistics and Research Methods (ESRM) Program*

University of Arkansas

2024-10-09

Today’s Class

  1. Why using R?
    1. Brief history of R
    2. Main features of R
  2. Installation of R
  3. How to use RStudio

Why R?

Brief History

  • 1975-1976: S (Book: A Brief History of S) grew up in the statistics research departments (John Chambers and others) at Bell Laboratories

    • To bring interactive computing to bear on statistics and data analysis problem
  • 1993: Prof. Ross Ihaka and Robert Gentleman from University of Auckland posted first binary file of R to teach introductory statistics

  • 1995: Martin Mächler made an important contribution by convincing Ross and Robert to use the GNU General Public License to make R free software

  • 1997: The Comprehensive R Archive Network (CRAN) was founded by Kurt Hornik and Friedrich Leisch to host R’s source code, executable files, documentation, and user-created packages

  • 2000: the first official 1.0 version of R was released

  • 2024: R ver. 4.2.1

Example of S Language

X = 1:5 # A vector of numbers from 1 to 5
X[c(TRUE, TRUE, TRUE, FALSE, FALSE)]
[1] 1 2 3
X[1:3]
[1] 1 2 3
X[-1:3]
Error in X[-1:3]: only 0's may be mixed with negative subscripts
X[-(1:3)]
[1] 4 5
X[NULL]
integer(0)
X[NA]
[1] NA NA NA NA NA
X[]
[1] 1 2 3 4 5

Main Feature of R

  1. it was developed by statisticians as an interactive environment for data analysis rather than C or Java that created by software development.
  2. The interactivity of R is an indispensable feature in data science
  3. However, like in other programming languages, you can save your work in R as scripts that can be easily executed at any moment.
  4. If you are an expert programmer, you should not expect R to follow the conventions you are used to since you will be disappointed.

Attractive Features of R

  • R is free and open source.
  • It runs on all major platforms: Windows, MacOS, UNIX/Linux.
  • Scripts and data objects can be shared seamlessly across platforms.
  • There is a large, growing, and active community of R users and, as a result, there are numerous resources for learning and asking questions.
  • It is easy for others to contribute add-ons which enables developers to share software implementations of new data science methodologies. This gives R users early access to the latest methods and to tools which are developed for a wide variety of disciplines, including ecology, molecular biology, social sciences, and geography, just to name a few example

Get started

R console

  • One way of using R is to simply start R console on your computer (PC).

    • In Mac, after installing R, simply type in “R” in terminal to get started

As a quick example, try using the console to calculate a 15% tip on a meal that cost $19.71:

0.15 * 19.71

Rstudio (now called Posit)

Installation of R and RStudio

  • You can download and install R base via r-project.org (currently R-4.4.1)

  • Then, after the installation of R, you can download RStudio via posit.co (currently)

  • After installation of R and RStudio, you can open up the RStudio to start your R programming.

    • however, your R only has the base package

    • To enhance its utility, most users will install R packages for certain purposes

R packages

  • R packages are uploaded to some platforms (i.e., CRAN or Github) by researchers or companies

    • Those R packages typically have their version numbers. Some functions may be available for some version (like Ver. 1.1) but not be available in other versions.

    • Do not upgrade your packages if you code is running well

  • R users are free to download and use those R packages

    • To download certain package, you should know package name

    • For example, if you want to download the latest version of tidyverse package, you can type in following command in the console panel of Rstudio

    install.packages("tidyverse")
    • Or if you want to install the older version of package
    require(devtools)
    install_version("tidyverse", version = "1.3.0", repos = "http://cran.us.r-project.org")

More about R packages

  • CRAN (Comprehensive R Archive Network) is a network of servers around the world that store identical, up-to-date, versions of code and documentation for R.
    • It contains most stable version of packages.
    • Most of time, we download package from CRAN
  • Github is for the fast development for R packages
    • It contains the up-to-date version of R which may potentially be unstable

    • You can download the package from Github using pak package

      pak::pak("tidyverse/ggplot2")
    • You can update the package and its dependencies

      pak::pkg_install("ggplot2", upgrade = TRUE)

R functions

  • To operate certain tasks, you need to use functions contained in R packages

    • There are two ways of using R functions

    • Direct way: you don’t have to load your package first


    • Use-after-load way: Package is loaded in your session before you can call the function name without specifying the package name

      library("ggplot2")
      ggplot() +
        geom_point(aes(x = 1:100, y = 100:1), color = "tomato")

R functions (Cont.)

  • How do you know you already load the package or not

  • You can use sessionInfo function

    sessionInfo()
  • It outputs multiple info:

    • R version, Operations System, Matrix operation package, Locale

    • Attached packages (you can call the functions of those package)

    • Loaded package via a namespace (and not attached), which you cannot call functions and need to library or require them

Execute R code

  • After you finish R script, you have multiple ways of executing the code and output the results on Console:

    • Method 1: you can click Run button in the top right-head of Rstudio

    • Method 2: you can select certain code and press Ctrl + Enter (Win) or Command + Return (Mac)

    • Method 3: you can Rscript [FILENAME].r to run the whole script

    • Method 4: you can using R notebook to interactively run R code

Script file is .R Script file is .rmd or .qmd
Run the whole script
  • Method 1
  • Method 3
  • Method 4
Run the partial script
  • Method 2
  • Method 4

Introduce Rstudio

  • RStudio will be our launching pad for data science projects. It not only provides an editor for us to create and edit our scripts but also provides many other useful tools.

  • When you start RStudio for the first time, you will see three panes:

    • The left pane shows the Code editor (will show when you create a new file) and R console.

    • On the right, the top pane includes tabs such as Environment and History, while the bottom pane shows five tabs: File, Plots, Packages, Help, and Viewer .

  • To start a new script in Code editor, you can click on File > New File > R Script.

Key Binding

  • For the efficient coding, we highly recommend that you memorize key bindings for the operations you use most.

  • RStudio provides a useful cheat sheet with the most widely used commands

  • To open the cheat sheet, Help > Cheat Sheets > Rstudio IDE Cheat Sheets

Global Option

  • You can change the look and functionality of RStudio quite a bit.

  • To change the global options you click on Tools then Global Options.

  • As an example we show how to make a change that we highly recommend:

    • General > Basic > Workspace: Change the Save workspace to .RData on exit to Never .

    • General > Basic > Workspace: Uncheck the Restore .RData into workspace at startup to Never

    • Code > Editing: check use the native piper operator, |>

.RData file

  • By default, when you exit R saves all the objects you have created into a file called .RData.

  • This is done so that when you restart the session in the same folder, it will load these objects.

  • We find that this causes confusion especially when we share code with colleagues and assume they have this .RData file.

Installing R Packages: from CRAN

  • For example, to install the dslabs package, you would type the following in your console:

    install.packages("dslabs") # DON'T FORGET DOUBLE QUOTE
  • We can then load the package into our R sessions using the library function in your Rscript file:

    library("dslabs")
    head(admissions)
      major gender admitted applicants
    1     A    men       62        825
    2     B    men       63        560
    3     C    men       37        325
    4     D    men       33        417
    5     E    men       28        191
    6     F    men        6        373
  • As you go through this class, you will see that we load packages without installing them. This is because once you install a package, it remains installed and only needs to be loaded with library.

  • We can install more than one package at once by feeding a character vector to this function:

    install.packages(c("dplyr", "dslabs"))
  • You can see all the packages you have installed using the following function:

    installed.packages()

Installing R Packages: from GitHub

  • You can also install user-built package from GitHub

  • I built a package for this course: link

install.packages("remotes") # install one package called "remotes"
library("remotes") # load the package into your R session
install_github(repo = "JihongZ/ESRM6990V") # install one GitHub package from my GitHub repository
library(ESRM6990V) # load the package into your R session
jihong(details = TRUE) # call one function called "jihong" from the package

Let’s Practice

  1. Finish Exercise 1

R Package Structure

Basic Information

  1. What: An R package is a structured collection of R functions, data, and compiled code that is bundled together according to a specific format.
    • They can be thought of as libraries or modules in other programming languages.
  2. Why: R Packages are designed to add functionality to R, allowing users to perform specific tasks or analyses that are not covered by the basic installation of R.
  3. How: You can install/uninstall, create, load, and use R packages.
    • If you want to build or publish your own package, the Comprehensive R Archive Network (CRAN), Bioconductor, and GitHub are popular repositories where R packages are commonly published and maintained.

What R package include

  • Functions: A set of R functions that perform specific tasks, which are not available in the default R environment.

  • Data: Some packages include datasets that are useful for demonstrating functions within the package or for use in specific types of analysis.

  • Documentation: Every package comes with documentation that explains how the functions work, the data included (if any), and examples of how to use the package. This is often accessible via R help pages.

  • Vignettes: Many packages include vignettes, which are long-form documentation that shows how to use the package functions in a more detailed and contextual way, often in the form of tutorials.

  • Namespace: A namespace file that manages how functions from the package are imported and exported, helping avoid naming conflicts between different packages.

  • Meta-information: A DESCRIPTION file containing metadata about the package, such as its name, version, dependencies (other packages it requires to function), author, and license information.

R Package states

  1. When you create or modify a package, you work on its “source code” or “source files”. You interact with the in-development package in its source form.
  2. To better understand package, we need to know the five states of R package:
    1. source
    2. bundled
    3. binary
    4. installed
    5. in-memory
  3. We already know two functions:
    1. install.packages() can move a package from source/bundled/binary into installed state.
    2. library can load a package from installed state into memory (in-memory state)
  4. What are source/bundled/binary states then? Why they differ?

Source package

  1. A source package is just a directory of files with specific structure including:
    1. DESCRIPTION file
    2. R/ folder containing all .r files
  2. Many R packages on GitHub are in source state
    1. networkscore: https://github.com/JihongZ/networkscore
    2. esrm64503: https://github.com/JihongZ/ESRM64503
  3. You may also find some tar.gz file on packages’ CRAN landing page via the “Package source” field (this is the bundled state of the package). Decompressing the tar.gz file will have the source directory including R/ and DESCRIPTION
    1. forcats: https://cran.r-project.org/web/packages/forcats/index.html
    2. You can depress using commands in terminal like:
tar xvf forcats_0.4.0.tar.gz

Bundled package

  1. A bundled package is a package that’s been compressed into a single file (this process is called build the package). Bundled state is a compressed form of package with only single file.

  2. By convention, package bundles in R use the extension .tar.gz and are sometimes referred to as “source tarballs”. In computer science, it is called gzipped tar file format.

  3. A “source tarballs” file is not simply compressed file of source directory. When build source directory into bundled (.tar.gz), a few diagnostic checks and cleanups are performed. See more details here.

Binary package

  1. If you want to distribute your package to an R user who doesn’t have package development tools, you’ll need to provide a binary package. The main distributor of binary package is CRAN.

  2. Like a package bundle, a binary package is a single file. Unlike a bundled package, a binary package is platform specific and there are two basic flavors: Windows and macOS.

  3. CRAN packages are usually available in binary form:

    • forcats for macOS: forcats_0.4.0.tgz
    • readxl for Windows: readxl_1.3.1.zip
  4. This is, indeed, part of what’s usually going on behind the scenes when you call install.packages().

  5. Uncompressing binary file will give you totally difference file structure than source/bundled package.

    • There are no .R files in the R/ directory - instead there are three files that store the parsed functions in an efficient file format.

Installed package

  1. An installed package is a binary package that’s been decompressed into a package library

  2. In practice, you don’t need to care about stats if you install popular package, unless you have issues installing R package via install.packages() or you install in-development packages .

In-memory package

  1. When we use library() function, we load installed package into the memory of R.
  2. This is the last step of using the package in our R task.
  3. When you call library(somepackage), R looks through the current libraries for an installed package named “somepackage” and, if successful, it makes somepackage available for use.

Next Week

Preparation: Make Contribute to Github R Package

  1. Make sure you have set up a GitHub account
  2. Make sure you download the GitHub Desktop