Lecture 01: Basics of R

Getting Started

Jihong Zhang*, Ph.D

Educational Statistics and Research Methods (ESRM) Program*

University of Arkansas

2024-10-09

Today’s Class

  1. Why using R?
    1. Brief history of R
    2. Main features of R
  2. Installation of R
  3. How to use RStudio

Why R?

Brief History

  • 1975-1976: S (Book: A Brief History of S) grew up in the statistics research departments (John Chambers and others) at Bell Laboratories

    • To bring interactive computing to bear on statistics and data analysis problem
  • 1993: Prof. Ross Ihaka and Robert Gentleman from University of Auckland posted first binary file of R to teach introductory statistics

  • 1995: Martin Mächler made an important contribution by convincing Ross and Robert to use the GNU General Public License to make R free software

  • 1997: The Comprehensive R Archive Network (CRAN) was founded by Kurt Hornik and Friedrich Leisch to host R’s source code, executable files, documentation, and user-created packages

  • 2000: the first official 1.0 version of R was released

  • 2024: I am using ver. 4.2.1

Example of S Language: the Subscripting Operator / Indexing

X = 1:5 # A vector of numbers from 1 to 5
X[c(TRUE, TRUE, TRUE, FALSE, FALSE)]
X[1:3]
X[-1:3]
X[-(1:3)]
X[NULL]
X[NA]
X[]

Main Feature of R

  1. it was developed by statisticians as an interactive environment for data analysis rather than C or Java that created by software development.
  2. The interactivity of R is an indispensable feature in data science
  3. However, like in other programming languages, you can save your work in R as scripts that can be easily executed at any moment.
  4. If you are an expert programmer, you should not expect R to follow the conventions you are used to since you will be disappointed.

Attractive Features of R

  • R is free and open source.
  • It runs on all major platforms: Windows, MacOS, UNIX/Linux.
  • Scripts and data objects can be shared seamlessly across platforms.
  • There is a large, growing, and active community of R users and, as a result, there are numerous resources for learning and asking questions.
  • It is easy for others to contribute add-ons which enables developers to share software implementations of new data science methodologies. This gives R users early access to the latest methods and to tools which are developed for a wide variety of disciplines, including ecology, molecular biology, social sciences, and geography, just to name a few example

Get started

R console

  • One way of using R is to simply start R console on your computer (PC).

    • In Mac, after installing R, simply type in “R” in terminal to get started

As a quick example, try using the console to calculate a 15% tip on a meal that cost $19.71:

0.15 * 19.71

Rstudio (now called Posit)

Introduce Rstudio

  • RStudio will be our launching pad for data science projects. It not only provides an editor for us to create and edit our scripts but also provides many other useful tools.

  • When you start RStudio for the first time, you will see three panes:

    • The left pane shows the Code editor (will show when you create a new file) and R console.

    • On the right, the top pane includes tabs such as Environment and History, while the bottom pane shows five tabs: File, Plots, Packages, Help, and Viewer .

  • To start a new script in Code editor, you can click on File > New File > R Script.

Key Binding

  • For the efficient coding, we highly recommend that you memorize key bindings for the operations you use most.

  • RStudio provides a useful cheat sheet with the most widely used commands

  • To open the cheat sheet, Help > Cheat Sheets > Rstudio IDE Cheat Sheets

Global Option

  • You can change the look and functionality of RStudio quite a bit.

  • To change the global options you click on Tools then Global Options.

  • As an example we show how to make a change that we highly recommend:

    • General > Basic > Workspace: Change the Save workspace to .RData on exit to Never .

    • General > Basic > Workspace: Uncheck the Restore .RData into workspace at startupto Never

    • Code > Editing: check use the native piper operator, |>

.RData file

  • By default, when you exit R saves all the objects you have created into a file called .RData.

  • This is done so that when you restart the session in the same folder, it will load these objects.

  • We find that this causes confusion especially when we share code with colleagues and assume they have this .RData file.

Installing R Packages

  • The functionality provided by a fresh install of R is only a small fraction of what is possible.

  • We refer to what you get after your first install as base R.

    • You can check base R packages when you type loadedNamespaces()
  • The extra functionality comes from add-ons available from developers. We called those add-on functionality as packages (similar to python modules)

    • Currently hundreds of these packages available from CRAN and many others shared via other repositories such as GitHub

Installing R Packages: Code

  • For example, to install the dslabs package, you would type the following in your console:

    install.packages("dslabs") # DON'T FORGET DOUBLE QUOTE
  • We can then load the package into our R sessions using the library function in your rscript file:

    library(dslabs) # DOUBLE QUOTE IS NOT NEEDED HERE
  • As you go through this class, you will see that we load packages without installing them. This is because once you install a package, it remains installed and only needs to be loaded with library.

  • We can install more than one package at once by feeding a character vector to this function:

    install.packages(c("dplyr", "dslabs"))
  • You can see all the packages you have installed using the following function:

    installed.packages()

R Package Structure

You will learn:

  1. the various states a package can be in and the difference between a package and library (and why you should care).

R Package states

  1. When you create or modify a package, you work on its “source code” or “source files”. You interact with the in-development package in its source form.
  2. To better understand package, we need to know the five states of R package:
    1. source
    2. bundled
    3. binary
    4. installed
    5. in-memory
  3. We already know two functions:
    1. install.packages() can move a package from source/bundled/binary into installed state.
    2. library can load a package from installed state into memory (in-memory state)
  4. What are source/bundled/binary states then? Why they differ?

Source package

  1. A source package is just a directory of files with specific structure including:
    1. DESCRIPTION file
    2. R/ folder containing all .r files
  2. Many R packages on GitHub are in source state
    1. networkscore: https://github.com/JihongZ/networkscore
    2. esrm64503: https://github.com/JihongZ/ESRM64503
  3. You may also find some tar.gz file on packages’ CRAN landing page via the “Package source” field (this is the bundled state of the package). Decompressing the tar.gz file will have the source directory including R/ and DESCRIPTION
    1. forcats: https://cran.r-project.org/web/packages/forcats/index.html
    2. You can depress using commands in terminal like:
tar xvf forcats_0.4.0.tar.gz

Bundled package

  1. A bundled package is a package that’s been compressed into a single file (this process is called build the package). Bundled state is a compressed form of package with only single file.

  2. By convention, package bundles in R use the extension .tar.gz and are sometimes referred to as “source tarballs”. In computer science, it is called gzipped tar file format.

  3. A “source tarballs” file is not simply compressed file of source directory. When build source directory into bundled (.tar.gz), a few diagnostic checks and cleanups are performed. See more details here.

Binary package

  1. If you want to distribute your package to an R user who doesn’t have package development tools, you’ll need to provide a binary package. The main distributor of binary package is CRAN.

  2. Like a package bundle, a binary package is a single file. Unlike a bundled package, a binary package is platform specific and there are two basic flavors: Windows and macOS.

  3. CRAN packages are usually available in binary form:

    • forcats for macOS: forcats_0.4.0.tgz
    • readxl for Windows: readxl_1.3.1.zip
  4. This is, indeed, part of what’s usually going on behind the scenes when you call install.packages().

  5. Uncompressing binary file will give you totally difference file structure than source/bundled package.

    • There are no .R files in the R/ directory - instead there are three files that store the parsed functions in an efficient file format.

Installed package

  1. An installed package is a binary package that’s been decompressed into a package library

  2. In practice, you don’t need to care about stats if you install popular package, unless you have issues installing R package via install.packages() or you install in-development packages .

In-memory package

  1. When we use library() function, we load installed package into the memory of R.
  2. This is the last step of using the package in our R task.
  3. When you call library(somepackage), R looks through the current libraries for an installed package named “somepackage” and, if successful, it makes somepackage available for use.

Exercises

Exercise 01: Edit R script (10 mins)

Let’s try the following steps:

  1. Opening a new script as we did before
  2. Give the script a name by saving the current new unnamed script (ctrl + S for Win and Cmd + S for Mac)
  3. A good convention is to use a descriptive name, with lower case letters, no spaces, only hyphens to separate words, and then followed by the suffix .R. We will call this script my-first-script.R.
  4. Now we are ready to start editing our first script.
  5. We install a R package called tidyverse in Console.
  6. We add the code to load the tidyverse package in the script
  7. Finally, we save the script.