The main purpose of R is to manipulate ‘objects’ to accomplish tasks.
Your goal is to assign objects and then use functions to manipulate
them.
There are many types (or classes) of objects. Many functions are
specifically tailored to deal with specific types of objects. Therefore,
it is critical that you understand the distinctions between different
types of objects, and how to best make use of each. Some packages
generate special types of objects, which can then be manipulated or
analyzed in special ways. Here, we will cover some of the most common
types of objects you will encounter.
Object Type | Detail |
---|---|
Numeric | Numbers |
Character | Text |
Factor | A set of characters with finite levels |
Logical | TRUE or FALSE |
Date | Dates and times can take on special formats |
Vector | A variable with multiple values of the same type (i.e., numeric, character, factor, logical, etc.) |
Matrix | A two-dimensional array of numbers |
Array | A set of numbers arranged in any number of dimensions. For example, you can have a three-dimentional array, which is essentially a stack of matrices. |
Data frame | A two-dimensional object with each column consisting of a numerica vector or character string. What you typically thing of as a spreadsheet. |
List | A bundle of any set of components. Each element in a list can be whatever object. Once you get used to them, lists are very useful. |
Other types of objects
Aside from these common types of objects, there are all sorts of other specialized objects that are outputs of specific functions. For example, the output of a specific statistical analysis (say, linear models, using the function
lm()
). But at the end of the day, even these are typically customized lists composed of the objects described above
Vectors are essentially a one-dimensional set of elements. The elements can be numbers (numeric vectors), characters, etc.
Let’s try making a numeric vector using a function
called c()
(for ‘combine’):
v=c(4,3,5,3,2,3,1)
v
## [1] 4 3 5 3 2 3 1
Objects can also be text. Text objects are called character strings. In R, all text needs to be contained within quotes (single or double quotes are allowed). Otherwise, it will just try to give you an object with that name.
We can combine multiple character strings into a vector. Each element can be a single letter, word, phrase, or entire sentences.
chars=c("a", "word", "or a phrase")
chars
## [1] "a" "word" "or a phrase"
If you try to combine letters and numbers into a single vector, it will turn into a character vector, with numbers treated as text:
numbersletters=c(1,2,3, "one", "two", "three")
numbersletters
## [1] "1" "2" "3" "one" "two" "three"
Factors are different from chracters in that they have levels. This will become a bit more important later when we start playing with dataframes.
factors=as.factor(numbersletters) #convert the vector above to factors
factors
## [1] 1 2 3 one two three
## Levels: 1 2 3 one three two
Objects can also be logical objects, i.e., TRUE or FALSE. Note all capitals. This class can be really important and useful.
logic=c(TRUE, TRUE, FALSE, FALSE)
logic
## [1] TRUE TRUE FALSE FALSE
One cool thing to note is that we can convert logical objects into numerics by adding a number:
logic+0
## [1] 1 1 0 0
You can see that TRUE becomes 1 and FALSE becomes 0
You can measure various attributes of this vector. For example, let’s find out how many numbers there are in this vector and add up all of the numbers. Try:
length(v)
## [1] 7
sum(v)
## [1] 21
From this, we can calculate the mean.
sum(v)/length(v)
## [1] 3
Of course, there is a pre-packaged function that calculates the mean of a vector, so this is simpler:
mean(v)
## [1] 3
Here are some more mathematical functions you can try out. Try typing these, and also try looking at the details of the functions using ?’functionname’:
function | meaning |
---|---|
max() |
maximum value |
min() |
minimum value |
sum() |
sum |
mean() |
average |
median() |
median |
range() |
returns vector of min and max values |
var() |
sample variance |
We can manipulate vectors as a whole. for example, let’s multiply the vector by 10.
v*10
## [1] 40 30 50 30 20 30 10
For multi-element objects (i.e., anything that is a combination of
numbers, letters, etc.), we can locate specific elements within objects
using square brackets []
. For example, we can ask what is
the 6th number in the numeric vector v
, or the second
element in the character vector chars
from above.
v[6]
## [1] 3
chars[2]
## [1] "word"
Ok, now let’s try a matrix. This is a two-dimensional set of numbers, so when we create a matrix, we also need to specify the dimensions. Let’s demonstrate the difference beween vectors and matrices:
1:9 #colon create vector of integers
## [1] 1 2 3 4 5 6 7 8 9
vec=1:9
mat=matrix(1:9,nrow=3)
Now look at the objects vec
and mat
vec
## [1] 1 2 3 4 5 6 7 8 9
mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Note that R arranges the number series going up to down. This is important to remember when you are creating matrices. You can make R construct matrices by rows (which is more intuitive to me) by:
mat2=matrix(1:9,nrow=3,byrow=TRUE)
mat2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Now, try a slight variation:
mat3=matrix(1:10,nrow=2,byrow=TRUE)
rownames(mat3)=c("row1","row2")
colnames(mat3)=c("A","B","C","D","E")
mat3
## A B C D E
## row1 1 2 3 4 5
## row2 6 7 8 9 10
You can see that matrices can be “rectangular”, and also you can name
the dimensions (rows & columns) of the matrix using
rownames()
and colnames()
.
Indexing in a matrix requires two values inside the square brackets:
[row, column]
. You can also use this to look at entire rows
or columns. For example:
mat3[2,3] #what is the number in row 2, column 3?
## [1] 8
mat3[2,] #what are the values of row 2?
## A B C D E
## 6 7 8 9 10
mat3[,4] #what are the values of column 4?
## row1 row2
## 4 9
You can conduct mathematical operations on matrices:
mat3*10 #multiply all values in mat3 by 10
## A B C D E
## row1 10 20 30 40 50
## row2 60 70 80 90 100
Arrays
Technically, a matrix is simply a two-dimensional array (and vectors are one-dimensional arrays). More generally, an array can be any number of dimensions. A three-dimensional array would be a stack of matrices, and a four-dimensional arrays would be yet another stack of those… Arrays can be very useful for fast computing, but it can also be very confusing, so I’m going to avoid the issue here. We may come back to the idea of three-dimensional arrays later in the course.
For most cases, your data will be organized in the form of a dataframe. A dataframe is an object with rows and columns in which each row represents an observation (sometimes called cases), and each column is a measurement of a variable (sometimes called fields). Whereas the values of a matrix can only be numbers, the values of a variable in a dataframe can be numeric, character,factor, or other formats (e.g., dates, logical variables such as TRUE and FALSE).
Let’s try creating a dataframe by combining a factor (categorical variable) and a numeric vector.
sex=c(rep("M",5), rep("F",5))
size=c(9,8,8,9,7,5,4,4,3,4)
dat=data.frame(sex, size)
dat
## sex size
## 1 M 9
## 2 M 8
## 3 M 8
## 4 M 9
## 5 M 7
## 6 F 5
## 7 F 4
## 8 F 4
## 9 F 3
## 10 F 4
Notice that the columns already have names. The
data.frame
function uses the object name as the default
column names. However, you can also assign column names using arguments
inside the function:
dat=data.frame(Sex=sex, Size=size) #Notice the capitalization
dat
## Sex Size
## 1 M 9
## 2 M 8
## 3 M 8
## 4 M 9
## 5 M 7
## 6 F 5
## 7 F 4
## 8 F 4
## 9 F 3
## 10 F 4
We can refer to each row or columns in the dataframe using square brackets, just as with the other objects we have learned already.
dat[1,] #first row
## Sex Size
## 1 M 9
dat[,2] #third column
## [1] 9 8 8 9 7 5 4 4 3 4
You can also get the columns of the dataframe using the
$
operator:
dat$Sex
## [1] "M" "M" "M" "M" "M" "F" "F" "F" "F" "F"
Here, the output shows the “levels” available in this column because it is a factor.
You can find out the type of variable for each column using the
function class()
class(dat$Sex)
## [1] "character"
class(dat$Size)
## [1] "numeric"
Two more useful functions: str()
gives you the structure
of the object, and summary()
gives you some basic info on
each column.
str(dat)
## 'data.frame': 10 obs. of 2 variables:
## $ Sex : chr "M" "M" "M" "M" ...
## $ Size: num 9 8 8 9 7 5 4 4 3 4
summary(dat)
## Sex Size
## Length:10 Min. :3.0
## Class :character 1st Qu.:4.0
## Mode :character Median :6.0
## Mean :6.1
## 3rd Qu.:8.0
## Max. :9.0
The base R program comes with a bunch of datasets as part of the
program. To load a specific data set, you simply use the function
data()
. For example, to load the data set called
‘iris’:
data("iris")
Now let’s look at this dataset. Here, I’m going to use the function
head()
, which will display only the first 6 lines of the
dataset:
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Built-in datasets are often useful for learning how functions work. You will often see examples within help files make use of built-in data sets to demonstrate how something works. You will also see some R packages will include some built-in data sets for this same reason.
A List object is a powerful and flexible tool in R.
Dataframes, matrices and array have many constraints – e.g., each row
must have the same number of columns. In contrast, you can combine any
set of objects together into a list.
As an example, let’s create three vectors that are of different lengths
with different types of elements (number, logical, and character).
apples=c(1,2,3,4,5)
oranges=c(TRUE, FALSE)
grapes=c("grape", "Grape", "GRAPE")
We can try to combine these objects into a dataframe, but we won’t be able to because the vectors are different lengths:
data.frame(apples, oranges, grapes)
## Error in data.frame(apples, oranges, grapes): arguments imply differing number of rows: 5, 2, 3
However, we can combine these into a list:
mylist=list(apples, oranges, grapes)
mylist
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] TRUE FALSE
##
## [[3]]
## [1] "grape" "Grape" "GRAPE"
Lists are structured differently than other objects. In a list, each
component or item is indexed using a double bracket [[]]
.
So the first item in the list (i.e., apples) is:
mylist[[1]]
## [1] 1 2 3 4 5
… and the second element within the third item (i.e., grapes) would be:
mylist[[3]][2]
## [1] "Grape"
You can name the items within a list when creating it, or afterwards:
#These do the same thing
mylist=list(apples=apples, oranges=oranges, grapes=grapes)
names(mylist)=c("apples", "oranges", "grapes")
mylist
## $apples
## [1] 1 2 3 4 5
##
## $oranges
## [1] TRUE FALSE
##
## $grapes
## [1] "grape" "Grape" "GRAPE"
Once you name the items in a list, you can use the $
operator to call a specific item:
mylist$grapes
## [1] "grape" "Grape" "GRAPE"
You can even combine different dataframes into a list. Let’s do this by loading several built-in data sets and then combining them into a list (output hidden):
data("iris")
data("trees")
data("Loblolly")
mydata=list(iris, trees, Loblolly)
mydata
Lists may not be intuitive to you yet, but you will see how convenient this type of object can be when we get around to more complex tasks such as batch processessing and apply functions.