<- 1
coef_a <- 1
coef_b <- -1 coef_c
Lecture 01: Basics of R
Getting Started
0.1 Today’s Class
- R objects and prebuilt function
- Data types
- Vectors
- Coercion
- Not available (NA)
- Sorting
- Vector arithmetics
- Indexing
- Basic Plot
1 R Object and Pre-built function
1.1 Objects
To do data clean, data analysis, or statistics, we need to store some information and manipulate it in R. The information that we can create/change/remove is called R object.
Suppose we want to solve several quadratic equations of the form x^2 + x - 1 = 0. We know that the quadratic formula gives us the solutions:
\frac{-b\pm\sqrt{b^2 -4ac}}{2a}
The solution depend on the values of a, b, and c. One advantage of programming languages is that we can define variables and write expressions with these variables
We use
<-
to assign values to the variables. We can also assign values using=
instead of<-
, but we recommend against using=
to avoid confusion.To see the value stored in a variable, we simply ask R to evaluate
coef_a
and it shows the stored value:coef_a
[1] 1
- A more explicit way to ask R to show us the value stored in
coef_a
is usingprint
function like this (print
is a prebuilt function in R, we will explain later):
print(coef_a)
[1] 1
- A more explicit way to ask R to show us the value stored in
1.2 Workspace
So we have object, then where they are stored in R. The workspace is the place storing the objects we can use:
You can see all the variables saved in your workspace by typing:
ls()
[1] "coef_a" "coef_b" "coef_c"
In RStudio, the Environment tab shows the values:
We should see
coef_a
,coef_b
, andcoef_c
.Missing R object in workspace: If you try to recover the value of a variable that is not in your workspace, you receive an error. For example, if you type
some_random_object
you will receive the following message:Error: object 'some_random_object' not found
.print(some_random_object)
Error in print(some_random_object): object 'some_random_object' not found
Now since these values are saved in variables, to obtain a solution to our equation, we use the quadratic formula:
-coef_b + sqrt(coef_b^2 - 4*coef_a*coef_c))/(2*coef_a) (
[1] 0.618034
-coef_b - sqrt(coef_b^2 - 4*coef_a*coef_c))/(2*coef_a) (
[1] -1.618034
-
: is a negative operator which switches the sign of object+
and*
and/
: addition, multiplication, and divisionsqrt
: a prebuilt R function of calculating the squared root of the object^
: exponent operator to calculate the “power” of the “base”;a^3
: a to the 3rd power
1.3 Prebuilt functions
Functions: Once we defined the objects, the data analysis process can usually be described as a series of functions applied to the data.
In other words, we considered “function” as a set of pre-specified operations (e.g., macro in SAS)
R includes several predefined functions and most of the analysis pipelines we construct make extensive use of these.
We already used or discussed the
install.packages
,library
, andls
functions. We also used the functionsqrt
to solve the quadratic equation above.
Evaluation: In general, we need to use parentheses followed by a function name to evaluate a function.
- If you type
ls
, the function is not evaluated and instead R shows you the code that defines the function. If you typels()
the function is evaluated and, as seen above, we see objects in the workspace.
- If you type
Function Arguments: Unlike
ls
, most functions require one or more arguments to specify the settings of the function.For example, we assign different object to the argument of the function
log
. Remember that we earlier definedcoef_a
to be 1:log(8)
[1] 2.079442
log(coef_a)
[1] 0
Help: You can find out what the function expects and what it does by reviewing the very useful manuals included in R. You can get help by using the
help
function like this:help("log") ?log
The help page will show you what arguments the function is expecting. For example,
log
needsx
andbase
to run.The base of the function
log
defaults tobase = exp(1)
makinglog
the natural log by default.You can also use
args
to look at the argumentsargs(log)
function (x, base = exp(1)) NULL
log(x = 8, base = 2)
[1] 3
If specifying the arguments’ names, then we can include them in whatever order we want:
log(base = 2, x = 8)
[1] 3
1.4 Prebuilt objects
There are several datasets or values that are included for users to practice and test out functions. For example, you can use \pi in your calculation directly:
pi
[1] 3.141593
Or infinity value \infty:
Inf + 1
[1] Inf
You can see all the available datasets by typing:
data()
For example, if you type
iris
, it will output the famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width:iris ?iris
You can check the detailed help page of
iris
using?
as we did for functions.
1.5 Variable Names
When writing code in R, it’s important to choose variable names that are both meaningful and avoid conflicts with existing functions or reserved words in the language.
For example, avoid using
c
as variable name because R has a existing prebuilt functionc()
:c(1, 2)
[1] 1 2
Some basic rules in R are that variable names have to start with a letter, can’t contain spaces, and should not be variables that are predefined in R, such as
c
.A nice convention to follow is to use meaningful words that describe what is stored, use only lower case, and use underscores as a substitute for spaces.
<- (-coef_b + sqrt(coef_b^2 - 4*coef_a*coef_c))/(2*coef_a) r_1 <- (-coef_b - sqrt(coef_b^2 - 4*coef_a*coef_c))/(2*coef_a) r_2 r_1
[1] 0.618034
r_2
[1] -1.618034
1.6 Saving workspace
Objects and functions remain in the workspace until you end your session or erase them with the function
rm
.Autosave: Your current workspaces also can be saved for later use.
- In fact, when you quit R, Rstudio asks you if you want to save your workspace as
.RData
. If you do save it, the next time you start R, the program will restore the workspace.
- In fact, when you quit R, Rstudio asks you if you want to save your workspace as
ManualSave: We actually recommend against saving the workspace this way because, as you start working on different projects, it will become harder to keep track of what is saved.
save(file = "project_lecture_2.RData") # Save Workspace as project_lecture_2.RData load("project_lecture_2.RData") # Load Workspace
1.7 Commenting your code
If a line of R code starts with the symbol
#
, it is a comment and is not evaluated.We can use this to write reminders of why we wrote particular code:
## Code to compute solution to quadratic equation ## Define the variables <- 3 coef_a <- 2 coef_b <- -1 coef_c ## Now compute the solution -coef_b + sqrt(coef_b^2 - 4*coef_a*coef_c))/(2*coef_a) (
[1] 0.3333333
-coef_b - sqrt(coef_b^2 - 4*coef_a*coef_c))/(2*coef_a) (
[1] -1
2 Exercises (30 minutes)
What is the sum of the first 100 positive integers? The formula for the sum of integers 1 through n is n(n+1)/2. Define
n=100 and then use R to compute the sum of 1 through 100 using the formula. What is the sum?
Now use the same formula to compute the sum of the integers from 1 through 1000.
Look at the result of typing the following code into R:
- Based on the result, what do you think the functions
seq
andsum
do? You can use help.
- Based on the result, what do you think the functions
<- 1000
n <- seq(1, n)
x sum(x)
[1] 500500
In math and programming, we say that we evaluate a function when we replace the argument with a given value. So if we type
sqrt(4)
, we evaluate thesqrt
function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the log, in base 10, of the square root of 100.Which of the following will always return the numeric value stored in x? You can try out examples and use the help system if you want.
3 Data types
3.1 Check types of object
Variables in R can be of different types.
- For example, we need to distinguish numbers from character strings and tables from simple lists of numbers.
The function
class
helps us determine what type of object we have:<- 2 a class(a)
[1] "numeric"
<- "Jihong" b class(b)
[1] "character"
<- TRUE c class(c)
[1] "logical"
class(iris)
[1] "data.frame"
- To work efficiently in R, it is important to learn the different types of variables and what we can do with these.
3.2 Data frames
The most common way of storing a dataset in R is in a data frame.
A data frame is a table with rows representing observations and the different variables reported for each observation defining the columns.
Data frames has multiple informatiom:
We can check the number of rows and number of columns:
nrow(iris)
[1] 150
ncol(iris)
[1] 5
We can check the columns names
colnames(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
The function
str
is useful for finding out more about the structure of an data.framestr(iris)
'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We can show the first 3 or last 3 lines using the function
head
andtail
head(iris, n = 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa
tail(iris, n = 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 148 6.5 3.0 5.2 2.0 virginica 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica
3.2.1 Have access to certain column
we can access the different variables represented by columns included in this data frame using
$
or[["column_name"]]
.head(iris$Sepal.Width)
[1] 3.5 3.0 3.2 3.1 3.6 3.9
head(iris[["Sepal.Length"]])
[1] 5.1 4.9 4.7 4.6 5.0 5.4
Note that if we use
["column_name"]
, it will extract single-column data frameclass(iris["Sepal.Length"])
[1] "data.frame"
ncol(iris["Sepal.Length"])
[1] 1
3.3 Vectors
The object
iris$Sepal.Width
is not one number but several. We call these types of objects vectors.We use the term vectors to refer to objects with several entries. The function
length
tells you how many entries are in the vector:<- iris$Sepal.Width Sepal.Width_values class(Sepal.Width_values)
[1] "numeric"
length(Sepal.Width_values)
[1] 150
We can also calculate some descriptive statistics using
max
,min
,sd
if the vector contains numeric values onlymax(Sepal.Width_values)
[1] 4.4
min(Sepal.Width_values)
[1] 2
mean(Sepal.Width_values)
[1] 3.057333
sd(Sepal.Width_values)
[1] 0.4358663
Vector also can have multiple types: numeric, character (string), logistic, factor
You cannot calculate
min
/max
/mean
for the factor vector containing string values. It will return not applied (NA
)<- iris$Species Species_names head(Species_names)
[1] setosa setosa setosa setosa setosa setosa Levels: setosa versicolor virginica
class(Species_names)
[1] "factor"
mean(Species_names)
[1] NA
- or the character vector
<- c("Tom", "Jimmy", "Emily") student_names class(student_names)
[1] "character"
mean(student_names)
[1] NA
You can calculate the mean of the logical vector, which is the proportion of
TRUE
values in the vector:<- c(TRUE, TRUE, FALSE) is_student_male class(is_student_male)
[1] "logical"
mean(is_student_male)
[1] 0.6666667
You can also calculate the distribution of the factor vector using
table
function:table(Species_names)
Species_names setosa versicolor virginica 50 50 50
3.4 Factor
Factors are useful for storing categorical data
class(iris$Species)
[1] "factor"
We can see that there are only 3 types of iris by using the
levels
function:levels(iris$Species)
[1] "setosa" "versicolor" "virginica"
In the background, R stores these levels as integers and keeps a map to keep track of the labels. This is more memory efficient than storing all the characters.
Stop at https://rafalab.dfci.harvard.edu/dsbook-part-1/R/R-basics.html#sec-factors