R introduction: R fundamentals
Overview
Teaching: min
Exercises: minQuestions
What are the basic features of R?
Objectives
Understand basic R commands relating to simple arithmatic
Be able to create and manipulate data structures in R, such as vectors
Use built-in functions and access their corresponding help file
Use R functions to generate data
Understand the basic R data types
Be able to sub-set data
Simple arithmetic
R is basically a very fancy calculator. You can use arithmetic operators as you would expect, such as,
addition
3+4
[1] 7
Exponentiation
2^3
[1] 8
Use of brackets
(5+5) / 5
[1] 2
Comparisons
Use greater than/less than symbols such as,
3 > 4
[1] FALSE
and,
5 == (2.5 * 2)
[1] TRUE
Where == means ‘equal to’. So, the above command is says 5 is equal to 5, which then evaluates to true.
Assigning variables
We can give variables names and then consequently use them,
sample1 = 6
sample2 = 4
sample1 + sample2
[1] 10
Here we’ve added 6 and 4 via the assigned variables.
Note that R is case sensitive. For example, ‘sample’ is not the sample as ‘Sample’. Also keep in mind that variables should be named in a sensible and clear way in order to enhance reproducibility. For example, if you return to your code after a long break, and see a variable called ‘b’, you’ll have some work to do to figure out what you meant. Life will be easier if you call it, for example, ‘biomarker_CRP_mgml’.
You can also add comments to your R code using the hash symbol #. Any text after this symbol will be ignored.
We can also assign words to variables,
patient_name = "Jim"
If we then call that variable, the data we’ve stored will be printed to the screen,
patient_name
[1] "Jim"
If we try to mix types, we’ll get an error,
patient_name + sample1
Error in patient_name + sample1: non-numeric argument to binary operator
Why? Because we’re trying to add a number to a word!
Vectors
Vector is just a fancy word for a collection of items. For example,
patients_ages = c(12,5,6,3,7,13)
patients_names = c('Jim', 'John', 'Brian', 'Susan', 'Keith', 'Geoff')
Note that we string them together the the letter c, which stands for combine.
Once we have a vector, we can pick out certain elements. For example,
patients_names[3]
[1] "Brian"
Use a colon to indicate items between two indices,
patients_names[3:5]
[1] "Brian" "Susan" "Keith"
Remember using == above? Let’s try it here,
patients_names == 'Jim'
[1] TRUE FALSE FALSE FALSE FALSE FALSE
We’ve just said, ‘do the elements of our vector, patient_names, equal Jim?’
We can do something similar to the numeric vector,
patients_ages >= 10
[1] TRUE FALSE FALSE FALSE FALSE TRUE
We’ve just said ‘are the elements of our vector, patients_ages, greater than or equal to 10?’
Built-in Functions
R comes with a huge number of functions, covering every possible data analysis and statistical requirement.
Exercise: Built-in functions
Create the following vector and then investigate the following functions: and investigate the following functions; round(), sqrt(), median(), max(), sort()
numbers = c(2.34, 4.53, 1.42, 1.42, 9.41, 8.54)
Solution
round(): Rounds the numbers, sqrt(): Takes the square root, median(): Returns the median, max(): Returns the max, sort(): Sorts into size order
Getting help
The easiest way to get help is to write the name of the function you’re interested in, preceded by a question mark. For example,
#?plot
This brings up the help file for the function (a function is a package of code that performs a particular task). Beyond this, there are many websites with guides and user forums with guidance. Often, just searching on Google for the problem you’re trying to solve (e.g. ‘How do I plot a histogram in R?’) or searching for the error message (e.g. ‘non-numeric argument to binary operator’) will lead you to the answer.
There are also a number of websites that may be of use, including,
The Comprehensive R Archive Network
Exercise: Getting help
Type and run the following. What happens?
mean(numbers)
Solution
[1] 4.61 We see the mean values from the 'numbers' vector
Exercise: Built-in functions
Now type and run the following. What happens this time?
numbers = c(1,2,3,4,NA) mean(numbers)
Solution
[1] NA We get an 'NA' instead of a mean value. R will default to NA when NA values are present in the corresponding vector
Exercise: Built-in functions
Type the following and see if you can figure out how to amend the code in the above exercise to give you a sensible answer. Note, you may notice RStudio’s auto-complete feature when typing the answer.
?mean
Solution
mean(numbers, na.rm = TRUE) We're telling the mean function to remove the NA values in the calculation using the 'na.rm' paramater
The above functions are useful for applying to existing numerical data. There are also functions that are useful for being applied to non-numerical data.
Exercise: Built-in functions
Create the following vector, and then investigate the following functions: sort(), order(), table()
terms = c('Serum', 'Blood', 'Plasma', 'Blood', 'Blood', 'Plasma')
Solution
sort(): Sorts into alphabetical order, order(): Returns the alphabetical order as a numerical vector, table(): Gives a table containing how many of each item are in the vector
There are also plenty of functions for creating data. For example, rnorm() will give you random numbers from the normal distribution.
Exercise: rnorm()
Investigate rnorm() from the help files and create 100 data-points with a mean of 10 and standard deviation of 2. Run your code several times. What do you notice?
?rnorm()
Solution
rnorm(n = 100, mean = 10, sd = 2) The numbers change each time you run the code. This is because R is generating a new set of random data-points each time
Exercise: Setting the seed
Type and run the following above your code. Now, run both lines together several times. What do you notice?
set.seed(123)
Solution
The output from rnorm() is the same each time. This 'set seed' function forces R to choose the same 'random' data-points each time. This is essential if you want to ensure your outputs are the same each time
Other useful functions are seq() and rep(), which general a sequence of numbers and repeated numbers (or words), respectively.
Exercise: seq()
Create a vector of numbers from 1 to 10 using the seq() function
Solution
numbers = seq(from = 1, to = 10, by = 1) or numbers = seq(1:10)
Exercise: sqrt()
Create a second vector of the square root of the first vector
Solution
numbers_sqrt = sqrt(numbers)
Exercise: sample()
Pick 4 random values from the second vector using the ‘sample()’ function
Solution
sample(numbers_sqrt, size = 4)
Data types and structures
Data types in R mainly consist of the following,
- numeric
- integer
- character
- logical
The type of data you’re dealing with will limit the sort of things that you can do with them, so it’s important to know what you have. There are several ways to check, one of which is ‘typeof()’. We’ve already encountered each of these data types to some extent. The ‘logical’ data type is extremely important when dealing with missing values, and a very useful function when checking for missing values is is.na(). For example,
na_example = c(2, 5, 7, 7, NA, 3, 10, NA, 9, 2)
is.na(na_example)
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
This tells you where the missing values are, which you can then use with your code (which we’ll see later)
Exercise: Counting missing values
Using one of the above functions, how could you check how many missing values you have?
Solution
table(na_example, useNA = 'always')
Data structures are collections of the above, and we’ve already seen one of these, too (vectors). The other main ones that you’re likely to encounter are,
- list
- matrix
- data-frame
Lists are like vectors, but they can contain a mixture of data types. Matrices are 2-dimensional structures used in various areas of science and statistics. The data-type that most resembles a spreadsheet is the data-frame, which we’ll see in the next section on loading data.
Sub-setting
Whatever your data type, often you’ll want to subset the data in order to use certain parts. For example, you may want the first element of a vector, or the last 3 elements, or the first column from a matrix. You may even wish to subset a character vector and pick out, say, the first letter.
Subsetting typically works by indicting the position or index of the data that you want.
Exercise: Subsetting vectors
Preceded by set.seed(123), create the following vector and type ‘numbers[3]’. What do you get? Now try ‘numbers[3:5]’. Now what do you get?
numbers = rnorm(n=5)
Solution
[1] 1.558708, [1] 1.55870831 0.07050839 0.12928774 You have picked out the 3rd value, and then the 3rd - 5th values, respectively
Exercise: Subsetting matrices by column
Run the two following lines of code. What happens?
matrix_example = matrix(rnorm(n=50), nrow = 10, ncol = 5) matrix_example[,c(1:2)]
Solution
[,1] [,2] [1,] 1.7150650 1.7869131 [2,] 0.4609162 0.4978505 [3,] -1.2650612 -1.9666172 [4,] -0.6868529 0.7013559 [5,] -0.4456620 -0.4727914 [6,] 1.2240818 -1.0678237 [7,] 0.3598138 -0.2179749 [8,] 0.4007715 -1.0260044 [9,] 0.1106827 -0.7288912 [10,] -0.5558411 -0.6250393
Exercise: Subsetting matrices by row
Now type the following. What happens?
matrix_example[c(1:2),]
Solution
[,1] [,2] [,3] [,4] [,5] [1,] 1.7150650 1.7869131 -1.686693 0.6886403 -1.1231086 [2,] 0.4609162 0.4978505 0.837787 0.5539177 -0.4028848
Exercise: Subsetting strings
Run the two following lines of code. What happens?
patient = 'Mrs. C. Ode' substr(patient, 1,3)
Solution
[1] "Mrs"
Subsetting not only works by specifying index values, but it can also be done based upon logical (or Boolean) values. Effectively, picking out rows, columns or cells that are TRUE or FALSE. Earlier we created a vector called na_example. Let’s see how you could impute the missing values using this subsetting idea,
imputed = mean(na_example, na.rm = TRUE) #Determine the imputed value based upon the mean values
na_boolean = is.na(na_example) #Find the positions where 'na' occurs
na_example[na_boolean] = imputed #Set the na values to the imputed mean value
Key Points
Base R features and techniques