Data Structures
Last updated on 2024-11-12 | Edit this page
Estimated time: 12 minutes
Overview
Questions
- What are the basic data types in R?
- How do I represent categorical information in R?
Objectives
After completing this episode, participants should be able to…
- To be aware of the different types of data.
- To begin exploring data frames, and understand how they are related to vectors, factors and lists.
- To be able to ask questions from R about the type, class, and structure of an object.
Vectors
So far we’ve looked on individual values. Now we will move to a data structure called vectors. Vectors are arrays of values of the same data type.
Data types
Data type refers to a type of information that is stored by a value. It can be:
-
numerical
(a number) -
integer
(a number without information about decimal points) -
logical
(a boolean - are values TRUE or FALSE?) -
character
(a text/ string of characters) -
complex
(a complex number) -
raw
(raw bytes)
We won’t discuss complex
or raw
data type
in the workshop.
Data structures
Vectors are the most common and basic data structure in R but you will come across other data structures such as data frames, lists and matrices as well. In short:
- data.frames is a two-dimensional data structure in which columns are vectors of the same length that can have different data types. We will use this data structure in this lesson.
- lists can have an arbitrary structure and can mix data types;
- matrices are two-dimensional data structures containing elements of the same data type.
For a more detailed description, see Data Types and Structures.
Note that vector data in the geospatial context is different from vector data types. More about vector data in a later lesson!
You can create a vector with a c()
function.
R
# vector of numbers - numeric data type.
numeric_vector <- c(2, 6, 3)
numeric_vector
OUTPUT
[1] 2 6 3
R
# vector of words - or strings of characters- character data type
character_vector <- c('Amsterdam', 'London', 'Delft')
character_vector
OUTPUT
[1] "Amsterdam" "London" "Delft"
R
# vector of logical values (is something true or false?)- logical data type.
logical_vector <- c(TRUE, FALSE, TRUE)
logical_vector
OUTPUT
[1] TRUE FALSE TRUE
Combining vectors
The combine function, c()
, will also append things to an
existing vector:
R
ab_vector <- c('a', 'b')
ab_vector
OUTPUT
[1] "a" "b"
R
abcd_vector <- c(ab_vector, 'c', 'd')
abcd_vector
OUTPUT
[1] "a" "b" "c" "d"
Missing values
Challenge: combining vectors
Combine the abcd_vector
with the
numeric_vector
in R. What is the data type of this new
vector and why?
combined_vector <- c(abcd_vector, numeric_vector)
combined_vector
The combined vector is a character vector. Because vectors can only
hold one data type and abcd_vector
cannot be interpreted as
numbers, the numbers in numeric_vector
are coerced
into characters.
A common operation you want to perform is to remove all the missing
values (in R denoted as NA
). Let’s have a look how to do
it:
R
with_na <- c(1, 2, 1, 1, NA, 3, NA ) # vector including missing value
First, let’s try to calculate mean for the values in this vector
R
mean(with_na) # mean() function cannot interpret the missing values
OUTPUT
[1] NA
R
# You can add the argument na.rm=TRUE to calculate the result while
# ignoring the missing values.
mean(with_na, na.rm = T)
OUTPUT
[1] 1.6
However, sometimes, you would like to have the NA
permanently removed from your vector. For this you need to identify
which elements of the vector hold missing values with
is.na()
function.
R
is.na(with_na) # This will produce a vector of logical values,
OUTPUT
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
R
# stating if a statement 'This element of the vector is a missing value'
# is true or not
!is.na(with_na) # The ! operator means negation, i.e. not is.na(with_na)
OUTPUT
[1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE
We know which elements in the vectors are NA
. Now we
need to retrieve the subset of the with_na
vector that is
not NA
. Sub-setting in R
is done with square
brackets[ ]
.
R
without_na <- with_na[ !is.na(with_na) ] # this notation will return only
# the elements that have TRUE on their respective positions
without_na
OUTPUT
[1] 1 2 1 1 3
Factors
Another important data structure is called a factor. Factors look like character data, but are used to represent categorical information.
Factors create a structured relation between the different levels
(values) of a categorical variable, such as days of the week or
responses to a question in a survey. While factors look (and often
behave) like character vectors, they are actually treated as numbers by
R
, which is useful for computing summary statistics about
their distribution, running regression analysis, etc. So you need to be
very careful when treating them as strings.
Create factors
Once created, factors can only contain a pre-defined set of values, known as levels.
R
nordic_str <- c('Norway', 'Sweden', 'Norway', 'Denmark', 'Sweden')
nordic_str # regular character vectors printed out
OUTPUT
[1] "Norway" "Sweden" "Norway" "Denmark" "Sweden"
R
# factor() function converts a vector to factor data type
nordic_cat <- factor(nordic_str)
nordic_cat # With factors, R prints out additional information - 'Levels'
OUTPUT
[1] Norway Sweden Norway Denmark Sweden
Levels: Denmark Norway Sweden
Inspect factors
R will treat each unique value from a factor vector as a level and (silently) assign numerical values to it. This can come in handy when performing statistical analysis. You can inspect and adapt levels of the factor.
R
levels(nordic_cat) # returns all levels of a factor vector.
OUTPUT
[1] "Denmark" "Norway" "Sweden"
R
nlevels(nordic_cat) # returns number of levels in a vector
OUTPUT
[1] 3
Reorder levels
Note that R
sorts the levels in the alphabetic order,
not in the order of occurrence in the vector. R
assigns
value of:
- 1 to level ‘Denmark’,
- 2 to ‘Norway’
- 3 to ‘Sweden’.
This is important as it can affect e.g. the order in which categories are displayed in a plot or which category is taken as a baseline in a statistical model.
You can reorder the categories using factor()
function.
This can be useful, for instance, to select a reference category (first
level) in a regression model or for ordering legend items in a plot,
rather than using the default category systematically (i.e. based on
alphabetical order).
R
nordic_cat <- factor(
nordic_cat,
levels = c(
"Norway",
"Denmark",
"Sweden"
)
)
# now Norway will be the first category, Denmark second and Sweden third
nordic_cat
OUTPUT
[1] Norway Sweden Norway Denmark Sweden
Levels: Norway Denmark Sweden
Callout
There is more than one way to reorder factors. Later in the lesson,
we will use fct_relevel()
function from
forcats
package to do the reordering.
R
library(forcats)
nordic_cat <- fct_relevel(
nordic_cat,
"Norway",
"Denmark",
"Sweden"
) # With this, Norway will be first category,
# Denmark second and Sweden third
nordic_cat
OUTPUT
[1] Norway Sweden Norway Denmark Sweden
Levels: Norway Denmark Sweden
You can also inspect vectors with str()
function. In
factor vectors, it shows the underlying values of each category. You can
also see the structure in the environment tab of RStudio.
R
str(nordic_cat)
OUTPUT
Factor w/ 3 levels "Norway","Denmark",..: 1 3 1 2 3
Note of caution
Remember that once created, factors can only contain a pre-defined
set of values, known as levels. It means that whenever you try to add
something to the factor outside of this set, it will become an
unknown/missing value detonated by R
as
NA
.
R
nordic_str
OUTPUT
[1] "Norway" "Sweden" "Norway" "Denmark" "Sweden"
R
nordic_cat2 <- factor(
nordic_str,
levels = c("Norway", "Denmark")
)
# because we did not include Sweden in the list of
# factor levels, it has become NA.
nordic_cat2
OUTPUT
[1] Norway <NA> Norway Denmark <NA>
Levels: Norway Denmark
Key Points
- The mostly used basic data types in R are
numeric
,integer
,logical
, andcharacter
- Use factors to represent categories in R.