Getting Started
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
2024-10-09
Install and load dslabs
package
character/numeric/logical variable
vector
list (data.frame)
matrix
The most common way of storing a dataset in R is in a data frame.
murders
data in the dslabs
package as the example:
we can think of a data frame as a table with rows representing observations and the different variables reported for each observation defining the columns.
Most pre-built datasets are stored in R packages For example, you can access the murders
by loading the dslabs
package.
Some helpful functions that you can find out more about the structure of data frame.
'data.frame': 51 obs. of 5 variables:
$ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ abb : chr "AL" "AK" "AZ" "AR" ...
$ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
$ population: num 4779736 710231 6392017 2915918 37253956 ...
$ total : num 135 19 232 93 1257 ...
state abb region population total
46 Vermont VT Northeast 625741 2
47 Virginia VA South 8001024 250
48 Washington WA West 6724540 93
49 West Virginia WV South 1852994 27
50 Wisconsin WI North Central 5686986 97
51 Wyoming WY West 563626 5
[1] 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934
[9] 601723 19687653 9920000 1360301 1567582 12830632 6483802 3046355
[17] 2853118 4339367 4533372 1328361 5773552 6547629 9883640 5303925
[25] 2967297 5988927 989415 1826341 2700551 1316470 8791894 2059179
[33] 19378102 9535483 672591 11536504 3751351 3831074 12702379 1052567
[41] 4625364 814180 6346105 25145561 2763885 625741 8001024 6724540
[49] 1852994 5686986 563626
[1] 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934
[9] 601723 19687653 9920000 1360301 1567582 12830632 6483802 3046355
[17] 2853118 4339367 4533372 1328361 5773552 6547629 9883640 5303925
[25] 2967297 5988927 989415 1826341 2700551 1316470 8791894 2059179
[33] 19378102 9535483 672591 11536504 3751351 3831074 12702379 1052567
[41] 4625364 814180 6346105 25145561 2763885 625741 8001024 6724540
[49] 1852994 5686986 563626
NULL
NULL
[1] "state" "abb" "region" "population" "total"
The object murders$population
is not one number but several. We call these types of objects vectors.
A vector is a combination of values with the same type* (except NA)
A single number is technically a vector of length 1, but in general we use the term vectors to refer to objects with several entries.
Some helpful functions that tells you more about the vector
One special vector is logical vector. It must be either TRUE
(1) or FALSE
(0), which are special values in R
[1] TRUE
[1] FALSE
[1] FALSE FALSE TRUE TRUE TRUE
[1] 7 8 10
[1] "logical"
[1] FALSE FALSE TRUE FALSE FALSE
Here, >
is a relational operator. You can use the relational operator like, <
, <=
, >=
, ==
(is equal to), !=
(is not equal to) to elementwise compare variables in vectors. identical
function could be use to determine whether two objects are the same.
Factors are another type of vectors that are useful for storing categorical data with a few levels.
For example, in murders
dataset, we can see there are only 4 regions
[1] "factor"
[1] "Northeast" "South" "North Central" "West"
Northeast South North Central West
9 17 12 13
reorder
function[1] "Northeast" "South" "North Central" "West"
value <- murders$total # the total number of murder
region_ordered <- reorder(region, value, FUN = sum)
levels(region_ordered)
[1] "Northeast" "North Central" "West" "South"
south_as_reference <- factor(region, levels = c("South", "Northeast", "North Central", "West"))
levels(south_as_reference)
[1] "South" "Northeast" "North Central" "West"
Matrices are similar to data frames in that they are two-dimensional: they have rows and columns.
However, like numeric, character and logical vectors, entries in matrices have to be all the same type.
Yet matrices have a major advantage over data frames: we can perform matrix algebra operations, a powerful type of mathematical technique.
We can define a matrix using the matrix
function. We need to specify the data in the matrix as well as the number of rows and columns.
You can access specific entries in a matrix using square brackets ([
). If you want the second row, third column, you use:
If you want the entire second row, you leave the column spot empty:
Similarly, if you want the entire third column or 2 to 3 columns, you leave the row spot empty:
Make sure the US murders dataset is loaded. Use the function str
to examine the structure of the murders object. Which of the following best describes the variables represented in this data frame?
str
shows no relevant information.What are the column names used by the data frame for these five variables?
Use the accessor $
to extract the state abbreviations and assign them to the object a
. What is the class of this object?
Now use the square brackets to extract the state abbreviations and assign them to the object b
. Use the identical function to determine if a and b are the same.
We saw that the region column stores a factor. You can corroborate this by typing:
With one line of code, use the functions levels and length to determine the number of regions defined by this dataset.
table
takes a vector and returns the frequency of each element. You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of number of states per region.When writing code in R, it’s important to choose variable names that are both meaningful and avoid conflicts with existing functions or reserved words in the language.
Some basic rules in R are that variable names have to start with a letter, can’t contain spaces, and should not be variables that are predefined in R, such as c
.
# ------------------
this.is.a.numer <- 3
print(this.is.a.numer)
# ------------------
2 <- this.is.a.numer
`2` <- this.is.a.numer
print(2)
print(`2`)
# ------------------
`@_@` <- "funny"
paste0("You look like ", `@_@`)
# ------------------
`@this` <- 3
print(`@this`)
# ------------------
`I hate R programming` <- "No, you don't"
print(`I hate R programming`)
We use <-
to assign values to the variables. We can also assign values using =
instead of <-
, but we recommend against using =
to avoid confusion.
So we have object in the global environment of R session. The workspace global environment is the place storing the objects we can use
You can see all the variables saved in your workspace by typing:
In RStudio, the Environment tab shows the values:
We should see coef_a
, coef_b
, and coef_c
.
Missing R object in workspace: If you try to recover the value of a variable that is not in your workspace, you receive an error. For example, if you type some_random_object
you will receive the following message: Error: object 'some_random_object' not found
.
Now since these values are saved in variables, to obtain a solution to our equation, we use the quadratic formula:
Operators
-
: is a negative operator which switches the sign of object
+
and *
and /
: addition, multiplication, and division
sqrt
: a prebuilt R function of calculating the squared root of the object
^
: exponent operator to calculate the “power” of the “base”; a^3
: a to the 3rd power
Help: You can find out what the function expects and what it does by reviewing the very useful manuals included in R. You can get help by using the help
function like this:
The help page will show you what arguments the function is expecting. For example, log
needs x
and base
to run.
The base of the function log
defaults to base = exp(1)
making log
the natural log by default.
You can also use args
to look at the arguments
If specifying the arguments’ names, then we can include them in whatever order we want:
There are several datasets or values that are included for users to practice and test out functions. For example, you can use \(\pi\) in your calculation directly:
Or infinity value \(\infty\):
You can see all the available datasets by typing:
For example, if you type iris
, it will output the famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width:
You can check the detailed help page of iris
using ?
as we did for functions.
Objects and functions remain in the workspace until you end your session or erase them with the function rm
.
Autosave: Your current workspaces also can be saved for later use.
.RData
. If you do save it, the next time you start R, the program will restore the workspace.ManualSave: We actually recommend against saving the workspace this way because, as you start working on different projects, it will become harder to keep track of what is saved.
If a line of R code starts with the symbol #
, it is a comment and is not evaluated.
But be careful not to confuse the single quote '
with the back quote `.
you receive an error because the variables italy, canada, and egypt are not defined. If we do not use the quotes, R looks for variables with those names and returns an error.
Note
The first argument defines the start, and the second defines the end which is included.
What is the sum of the first 100 positive integers? The formula for the sum of integers 1 through n is \(n(n+1)/2\). Define
\(n=100\) and then use R to compute the sum of 1 through 100 using the formula. What is the sum?
Now use the same formula to compute the sum of the integers from 1 through 1000.
Look at the result of typing the following code into R:
seq
and sum
do? You can use help.sqrt(4)
, we evaluate the sqrt
function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the log (use log()
function), in base 10, of the square root of 100.Note
R coerced the data into characters. It guessed that because you put a character string in the vector, you meant the 1 and 3 to actually be character strings "1"
and "3"
.
NA
for “not available”:You can also calculate the summary statistics of vector including NA
using sum
or mean
na.rm
argument to remove the NA
from the vectorsort
[1] 2 4 5 5 7 8 11 12 12 16 19 21 22 27 32
[16] 36 38 53 63 65 67 84 93 93 97 97 99 111 116 118
[31] 120 135 142 207 219 232 246 250 286 293 310 321 351 364 376
[46] 413 457 517 669 805 1257
order
: [1] 46 35 30 51 12 42 20 13 27 40 2 16 45 49 28 38 8 24 17 6 32 29 4 48 7
[26] 50 9 37 18 22 25 1 15 41 43 3 31 47 34 21 36 26 19 14 11 23 39 33 10 44
[51] 5
state abb region population total
46 Vermont VT Northeast 625741 2
35 North Dakota ND North Central 672591 4
30 New Hampshire NH Northeast 1316470 5
51 Wyoming WY West 563626 5
12 Hawaii HI West 1360301 7
42 South Dakota SD North Central 814180 8
20 Maine ME Northeast 1328361 11
13 Idaho ID West 1567582 12
27 Montana MT West 989415 12
40 Rhode Island RI Northeast 1052567 16
2 Alaska AK West 710231 19
16 Iowa IA North Central 3046355 21
45 Utah UT West 2763885 22
49 West Virginia WV South 1852994 27
28 Nebraska NE North Central 1826341 32
38 Oregon OR West 3831074 36
8 Delaware DE South 897934 38
24 Minnesota MN North Central 5303925 53
17 Kansas KS North Central 2853118 63
6 Colorado CO West 5029196 65
32 New Mexico NM West 2059179 67
29 Nevada NV West 2700551 84
4 Arkansas AR South 2915918 93
48 Washington WA West 6724540 93
7 Connecticut CT Northeast 3574097 97
50 Wisconsin WI North Central 5686986 97
9 District of Columbia DC South 601723 99
37 Oklahoma OK South 3751351 111
18 Kentucky KY South 4339367 116
22 Massachusetts MA Northeast 6547629 118
25 Mississippi MS South 2967297 120
1 Alabama AL South 4779736 135
15 Indiana IN North Central 6483802 142
41 South Carolina SC South 4625364 207
43 Tennessee TN South 6346105 219
3 Arizona AZ West 6392017 232
31 New Jersey NJ Northeast 8791894 246
47 Virginia VA South 8001024 250
34 North Carolina NC South 9535483 286
21 Maryland MD South 5773552 293
36 Ohio OH North Central 11536504 310
26 Missouri MO North Central 5988927 321
19 Louisiana LA South 4533372 351
14 Illinois IL North Central 12830632 364
11 Georgia GA South 9920000 376
23 Michigan MI North Central 9883640 413
39 Pennsylvania PA Northeast 12702379 457
33 New York NY Northeast 19378102 517
10 Florida FL South 19687653 669
44 Texas TX South 25145561 805
5 California CA West 37253956 1257
rank
is also related to order and can be useful:order
and sort
are functions for sorting the data frame. rank
is more easy to used for filter certain cases with specific ranking.You can summarize a numeric vector using some familiar math terms
min
, max
, mean
, median
, and sd
R functions to calculate summary statistics of a vectorYou can get positions of largest number or lowest number or certain number
Arithmetic operations on vectors occur element-wise. For a quick example, suppose we have height in inches:
inches
by 2.54: [1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80
\[ \begin{bmatrix}a\\b\\c\\d\end{bmatrix} + \begin{bmatrix}e\\f\\g\\h\end{bmatrix} = \begin{bmatrix}a+e\\b+f\\c+g\\d+h\end{bmatrix} \]
The same holds for other mathematical operations, such as -
, *
and /
.
[1] "Vermont" "New Hampshire" "Hawaii"
[4] "North Dakota" "Iowa" "Idaho"
[7] "Utah" "Maine" "Wyoming"
[10] "Oregon" "South Dakota" "Minnesota"
[13] "Montana" "Colorado" "Washington"
[16] "West Virginia" "Rhode Island" "Wisconsin"
[19] "Nebraska" "Massachusetts" "Indiana"
[22] "Kansas" "New York" "Kentucky"
[25] "Alaska" "Ohio" "Connecticut"
[28] "New Jersey" "Alabama" "Illinois"
[31] "Oklahoma" "North Carolina" "Nevada"
[34] "Virginia" "Arkansas" "Texas"
[37] "New Mexico" "California" "Florida"
[40] "Tennessee" "Pennsylvania" "Arizona"
[43] "Georgia" "Mississippi" "Michigan"
[46] "Delaware" "South Carolina" "Maryland"
[49] "Missouri" "Louisiana" "District of Columbia"
Warning
So if the vectors don’t match in length, it is natural to assume that we should get an error. But we don’t. Notice what happens: one vector x
recycles. Another common source of unnoticed errors in R is the use of recycling.
R provides a powerful and convenient way of indexing vectors. We can, for example, subset a vector based on properties of another vector.
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[49] FALSE FALSE FALSE
[1] "Hawaii" "Iowa" "New Hampshire" "North Dakota"
[5] "Vermont"
[1] "Hawaii" "Iowa" "New Hampshire" "North Dakota"
[5] "Vermont"
Note
Note that we get back a logical vector with TRUE
for each entry smaller than or equal to 0.71. To see which states these are, we can leverage the fact that vectors can be indexed with logicals.
&
to get a vector of logicals that tells us which states satisfy both conditions:%in%
which
tells us which entries of a logical vector are TRUE. So we can type:[1] 33 10 44
[1] 2.667960 3.398069 3.201360
%in%
. Imagine you are not sure if Boston, Dakota, and Washington are states.[1] FALSE FALSE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[49] FALSE FALSE FALSE
1 We will use the US murders dataset for this exercises. Make sure you load it prior to starting. Use the $
operator to access the population size data and store it as the object pop
. Then use the sort function to redefine pop
so that it is sorted. Finally, use the [
operator to report the smallest population size.
2 Now instead of the smallest population size, find the index of the entry with the smallest population size. Hint: use order instead of sort.
3 We can actually perform the same operation as in the previous exercise using the function which.min. Write one line of code that does this.
4 Now we know how small the smallest state is and we know which row represents it. Which state is it? Define a variable states to be the state names from the murders data frame. Report the name of the state with the smallest population.
Use the rank
function to determine the population rank of each state from smallest population size to biggest. Save these ranks in an object called ranks, then create a data frame with the state name and its rank. Call the data frame my_df
.
Repeat the previous exercise, but this time order my_df so that the states are ordered from least populous to most populous. Hint: create an object ind that stores the indexes needed to order the population values. Then use the bracket operator [ to re-order each column in the data frame.
The na_example vector represents a series of counts. You can quickly examine the object using:
However, when we compute the average with the function mean, we obtain an NA:
The is.na
function returns a logical vector that tells us which entries are NA. Assign this logical vector to an object called ind and determine how many NAs does na_example have.
ESRM 64503 - Lecture 02: Object/Function/Package