This lesson is in the early stages of development (Alpha version)

R introduction: R fundamentals

Overview

Teaching: min
Exercises: min
Questions
  • What are the basic features of R?

Objectives
  • Understand basic R commands relating to simple arithmatic

  • Be able to create and manipulate data structures in R, such as vectors

  • Use built-in functions and access their corresponding help file

  • Use R functions to generate data

  • Understand the basic R data types

  • Be able to sub-set data

Simple arithmetic

R is basically a very fancy calculator. You can use arithmetic operators as you would expect, such as,

addition

3+4
[1] 7

Exponentiation

2^3
[1] 8

Use of brackets

(5+5) / 5
[1] 2

Comparisons

Use greater than/less than symbols such as,

3 > 4
[1] FALSE

and,

5 == (2.5 * 2)
[1] TRUE

Where == means ‘equal to’. So, the above command is says 5 is equal to 5, which then evaluates to true.

Assigning variables

We can give variables names and then consequently use them,

sample1 = 6
sample2 = 4
sample1 + sample2
[1] 10

Here we’ve added 6 and 4 via the assigned variables.

Note that R is case sensitive. For example, ‘sample’ is not the sample as ‘Sample’. Also keep in mind that variables should be named in a sensible and clear way in order to enhance reproducibility. For example, if you return to your code after a long break, and see a variable called ‘b’, you’ll have some work to do to figure out what you meant. Life will be easier if you call it, for example, ‘biomarker_CRP_mgml’.

You can also add comments to your R code using the hash symbol #. Any text after this symbol will be ignored.

We can also assign words to variables,

patient_name = "Jim"

If we then call that variable, the data we’ve stored will be printed to the screen,

patient_name
[1] "Jim"

If we try to mix types, we’ll get an error,

patient_name + sample1
Error in patient_name + sample1: non-numeric argument to binary operator

Why? Because we’re trying to add a number to a word!

Vectors

Vector is just a fancy word for a collection of items. For example,

patients_ages = c(12,5,6,3,7,13)
patients_names = c('Jim', 'John', 'Brian', 'Susan', 'Keith', 'Geoff')

Note that we string them together the the letter c, which stands for combine.

Once we have a vector, we can pick out certain elements. For example,

patients_names[3]
[1] "Brian"

Use a colon to indicate items between two indices,

patients_names[3:5]
[1] "Brian" "Susan" "Keith"

Remember using == above? Let’s try it here,

patients_names == 'Jim'
[1]  TRUE FALSE FALSE FALSE FALSE FALSE

We’ve just said, ‘do the elements of our vector, patient_names, equal Jim?’

We can do something similar to the numeric vector,

patients_ages >= 10
[1]  TRUE FALSE FALSE FALSE FALSE  TRUE

We’ve just said ‘are the elements of our vector, patients_ages, greater than or equal to 10?’

Built-in Functions

R comes with a huge number of functions, covering every possible data analysis and statistical requirement.

Exercise: Built-in functions

Create the following vector and then investigate the following functions: and investigate the following functions; round(), sqrt(), median(), max(), sort()

numbers = c(2.34, 4.53, 1.42, 1.42, 9.41, 8.54)

Solution

round(): Rounds the numbers, sqrt(): Takes the square root, median(): Returns the median, max(): Returns the max, sort(): Sorts into size order

Getting help

The easiest way to get help is to write the name of the function you’re interested in, preceded by a question mark. For example,

#?plot

This brings up the help file for the function (a function is a package of code that performs a particular task). Beyond this, there are many websites with guides and user forums with guidance. Often, just searching on Google for the problem you’re trying to solve (e.g. ‘How do I plot a histogram in R?’) or searching for the error message (e.g. ‘non-numeric argument to binary operator’) will lead you to the answer.

There are also a number of websites that may be of use, including,

The Comprehensive R Archive Network

Stack Overflow

Stat Methods - Quick R

Exercise: Getting help

Type and run the following. What happens?

mean(numbers)

Solution

[1] 4.61
We see the mean values from the 'numbers' vector

Exercise: Built-in functions

Now type and run the following. What happens this time?

numbers = c(1,2,3,4,NA)
mean(numbers)

Solution

[1] NA
We get an 'NA' instead of a mean value. R will default to NA when NA values are present in the corresponding vector

Exercise: Built-in functions

Type the following and see if you can figure out how to amend the code in the above exercise to give you a sensible answer. Note, you may notice RStudio’s auto-complete feature when typing the answer.

?mean

Solution

mean(numbers, na.rm = TRUE)
We're telling the mean function to remove the NA values in the calculation using the 'na.rm' paramater

The above functions are useful for applying to existing numerical data. There are also functions that are useful for being applied to non-numerical data.

Exercise: Built-in functions

Create the following vector, and then investigate the following functions: sort(), order(), table()

terms = c('Serum', 'Blood', 'Plasma', 'Blood', 'Blood', 'Plasma')

Solution

sort(): Sorts into alphabetical order, order(): Returns the alphabetical order as a numerical vector, table(): Gives a table containing how many of each item are in the vector

There are also plenty of functions for creating data. For example, rnorm() will give you random numbers from the normal distribution.

Exercise: rnorm()

Investigate rnorm() from the help files and create 100 data-points with a mean of 10 and standard deviation of 2. Run your code several times. What do you notice?

?rnorm()

Solution

rnorm(n = 100, mean = 10, sd = 2)
The numbers change each time you run the code. This is because R is generating a new set of random data-points each time

Exercise: Setting the seed

Type and run the following above your code. Now, run both lines together several times. What do you notice?

set.seed(123)

Solution

The output from rnorm() is the same each time. This 'set seed' function forces R to choose the same 'random' data-points each time. This is essential if you want to ensure your outputs are the same each time

Other useful functions are seq() and rep(), which general a sequence of numbers and repeated numbers (or words), respectively.

Exercise: seq()

Create a vector of numbers from 1 to 10 using the seq() function

Solution

numbers = seq(from = 1, to = 10, by = 1) or numbers = seq(1:10)

Exercise: sqrt()

Create a second vector of the square root of the first vector

Solution

numbers_sqrt = sqrt(numbers)

Exercise: sample()

Pick 4 random values from the second vector using the ‘sample()’ function

Solution

sample(numbers_sqrt, size = 4)

Data types and structures

Data types in R mainly consist of the following,

The type of data you’re dealing with will limit the sort of things that you can do with them, so it’s important to know what you have. There are several ways to check, one of which is ‘typeof()’. We’ve already encountered each of these data types to some extent. The ‘logical’ data type is extremely important when dealing with missing values, and a very useful function when checking for missing values is is.na(). For example,

na_example = c(2, 5, 7, 7, NA, 3, 10, NA, 9, 2)
is.na(na_example)
 [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

This tells you where the missing values are, which you can then use with your code (which we’ll see later)

Exercise: Counting missing values

Using one of the above functions, how could you check how many missing values you have?

Solution

table(na_example, useNA = 'always')

Data structures are collections of the above, and we’ve already seen one of these, too (vectors). The other main ones that you’re likely to encounter are,

Lists are like vectors, but they can contain a mixture of data types. Matrices are 2-dimensional structures used in various areas of science and statistics. The data-type that most resembles a spreadsheet is the data-frame, which we’ll see in the next section on loading data.

Sub-setting

Whatever your data type, often you’ll want to subset the data in order to use certain parts. For example, you may want the first element of a vector, or the last 3 elements, or the first column from a matrix. You may even wish to subset a character vector and pick out, say, the first letter.

Subsetting typically works by indicting the position or index of the data that you want.

Exercise: Subsetting vectors

Preceded by set.seed(123), create the following vector and type ‘numbers[3]’. What do you get? Now try ‘numbers[3:5]’. Now what do you get?

numbers = rnorm(n=5)

Solution

[1] 1.558708, [1] 1.55870831 0.07050839 0.12928774
You have picked out the 3rd value, and then the 3rd - 5th values, respectively

Exercise: Subsetting matrices by column

Run the two following lines of code. What happens?

matrix_example = matrix(rnorm(n=50), nrow = 10, ncol = 5)
matrix_example[,c(1:2)]

Solution

            [,1]       [,2]
 [1,]  1.7150650  1.7869131
 [2,]  0.4609162  0.4978505
 [3,] -1.2650612 -1.9666172
 [4,] -0.6868529  0.7013559
 [5,] -0.4456620 -0.4727914
 [6,]  1.2240818 -1.0678237
 [7,]  0.3598138 -0.2179749
 [8,]  0.4007715 -1.0260044
 [9,]  0.1106827 -0.7288912
 [10,] -0.5558411 -0.6250393

Exercise: Subsetting matrices by row

Now type the following. What happens?

matrix_example[c(1:2),]

Solution

          [,1]      [,2]      [,3]      [,4]       [,5]
[1,] 1.7150650 1.7869131 -1.686693 0.6886403 -1.1231086
[2,] 0.4609162 0.4978505  0.837787 0.5539177 -0.4028848

Exercise: Subsetting strings

Run the two following lines of code. What happens?

patient = 'Mrs. C. Ode'
substr(patient, 1,3)

Solution

[1] "Mrs"

Subsetting not only works by specifying index values, but it can also be done based upon logical (or Boolean) values. Effectively, picking out rows, columns or cells that are TRUE or FALSE. Earlier we created a vector called na_example. Let’s see how you could impute the missing values using this subsetting idea,

imputed = mean(na_example, na.rm = TRUE) #Determine the imputed value based upon the mean values
na_boolean = is.na(na_example) #Find the positions where 'na' occurs
na_example[na_boolean] = imputed #Set the na values to the imputed mean value

Key Points

  • Base R features and techniques