Getting Started
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
2024-10-09
To do data clean, data analysis, or statistics, we need to store some information and manipulate it in R. The information that we can create/change/remove is called R object.
Suppose we want to solve several quadratic equations of the form \(x^2 + x - 1 = 0\). We know that the quadratic formula gives us the solutions:
\[ \frac{-b\pm\sqrt{b^2 -4ac}}{2a} \]
The solution depend on the values of a, b, and c. One advantage of programming languages is that we can define variables and write expressions with these variables
We use <-
to assign values to the variables. We can also assign values using =
instead of <-
, but we recommend against using =
to avoid confusion.
To see the value stored in a variable, we simply ask R to evaluate coef_a
and it shows the stored value:
coef_a
is using print
function like this (print
is a prebuilt function in R, we will explain later):So we have object, then where they are stored in R. The workspace is the place storing the objects we can use:
You can see all the variables saved in your workspace by typing:
In RStudio, the Environment tab shows the values:
We should see coef_a
, coef_b
, and coef_c
.
Missing R object in workspace: If you try to recover the value of a variable that is not in your workspace, you receive an error. For example, if you type some_random_object
you will receive the following message: Error: object 'some_random_object' not found
.
Now since these values are saved in variables, to obtain a solution to our equation, we use the quadratic formula:
Operators
-
: is a negative operator which switches the sign of object
+
and *
and /
: addition, multiplication, and division
sqrt
: a prebuilt R function of calculating the squared root of the object
^
: exponent operator to calculate the “power” of the “base”; a^3
: a to the 3rd power
Functions: Once we defined the objects, the data analysis process can usually be described as a series of functions applied to the data.
In other words, we considered “function” as a set of pre-specified operations (e.g., macro in SAS)
R includes several predefined functions and most of the analysis pipelines we construct make extensive use of these.
We already used or discussed the install.packages
, library
, and ls
functions. We also used the function sqrt
to solve the quadratic equation above.
Evaluation: In general, we need to use parentheses followed by a function name to evaluate a function.
ls
, the function is not evaluated and instead R shows you the code that defines the function. If you type ls()
the function is evaluated and, as seen above, we see objects in the workspace.Function Arguments: Unlike ls
, most functions require one or more arguments to specify the settings of the function.
For example, we assign different object to the argument of the function log
. Remember that we earlier defined coef_a
to be 1:
Help: You can find out what the function expects and what it does by reviewing the very useful manuals included in R. You can get help by using the help
function like this:
The help page will show you what arguments the function is expecting. For example, log
needs x
and base
to run.
The base of the function log
defaults to base = exp(1)
making log
the natural log by default.
You can also use args
to look at the arguments
If specifying the arguments’ names, then we can include them in whatever order we want:
There are several datasets or values that are included for users to practice and test out functions. For example, you can use \(\pi\) in your calculation directly:
Or infinity value \(\infty\):
You can see all the available datasets by typing:
For example, if you type iris
, it will output the famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width:
You can check the detailed help page of iris
using ?
as we did for functions.
When writing code in R, it’s important to choose variable names that are both meaningful and avoid conflicts with existing functions or reserved words in the language.
For example, avoid using c
as variable name because R has a existing prebuilt function c()
:
Some basic rules in R are that variable names have to start with a letter, can’t contain spaces, and should not be variables that are predefined in R, such as c
.
A nice convention to follow is to use meaningful words that describe what is stored, use only lower case, and use underscores as a substitute for spaces.
Objects and functions remain in the workspace until you end your session or erase them with the function rm
.
Autosave: Your current workspaces also can be saved for later use.
.RData
. If you do save it, the next time you start R, the program will restore the workspace.ManualSave: We actually recommend against saving the workspace this way because, as you start working on different projects, it will become harder to keep track of what is saved.
If a line of R code starts with the symbol #
, it is a comment and is not evaluated.
We can use this to write reminders of why we wrote particular code:
What is the sum of the first 100 positive integers? The formula for the sum of integers 1 through n is \(n(n+1)/2\). Define
\(n=100\) and then use R to compute the sum of 1 through 100 using the formula. What is the sum?
Now use the same formula to compute the sum of the integers from 1 through 1000.
Look at the result of typing the following code into R:
seq
and sum
do? You can use help.In math and programming, we say that we evaluate a function when we replace the argument with a given value. So if we type sqrt(4)
, we evaluate the sqrt
function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the log, in base 10, of the square root of 100.
Which of the following will always return the numeric value stored in x? You can try out examples and use the help system if you want.
Variables in R can be of different types.
The function class
helps us determine what type of object we have:
The most common way of storing a dataset in R is in a data frame.
A data frame is a table with rows representing observations and the different variables reported for each observation defining the columns.
Data frames has multiple informatiom:
We can check the number of rows and number of columns:
We can check the columns names
The function str
is useful for finding out more about the structure of an data.frame
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We can show the first 3 or last 3 lines using the function head
and tail
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
we can access the different variables represented by columns included in this data frame using $
or [["column_name"]]
.
Note that if we use ["column_name"]
, it will extract single-column data frame
The object iris$Sepal.Width
is not one number but several. We call these types of objects vectors.
We use the term vectors to refer to objects with several entries. The function length
tells you how many entries are in the vector:
We can also calculate some descriptive statistics using max
, min
, sd
if the vector contains numeric values only
Vector also can have multiple types: numeric, character (string), logistic, factor
You cannot calculate min
/max
/mean
for the factor vector containing string values. It will return not applied (NA
)
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
[1] "factor"
[1] NA
You can calculate the mean of the logical vector, which is the proportion of TRUE
values in the vector:
You can also calculate the distribution of the factor vector using table
function:
Factors are useful for storing categorical data
We can see that there are only 3 types of iris by using the levels
function:
In the background, R stores these levels as integers and keeps a map to keep track of the labels. This is more memory efficient than storing all the characters.
Stop at https://rafalab.dfci.harvard.edu/dsbook-part-1/R/R-basics.html#sec-factors
ESRM 64503 - Lecture 08: Multivariate Analysis