Introduction to data analysis with R and Bioconductor: All Images

Data organisation with spreadsheets

Figure 1

SCreenshot of a spreadsheet and a README.txt file. — Spreadsheet example setip

Figure 2

A table showing the blood and rhesus types encoded together in one column. — Multiple variable in a single column.

Figure 3

A table showing the blood and rhesus types encoded together in different columns. — Variables encoded in a single column each.

Figure 4

Figure 5

A table extracted from a paper showing commonly used null values. — Commonly used null values.

Figure 6

A table showing several cells highlighted in yellow used to encode sample contamination. — Colours used to ecode information.

Figure 7

Same table as above with a new variable encoding contambination. — A new variable encoding sample contamination

Figure 8

Excel menu to export data in a csv format on MacOS — Saving an Excel file to CSV.

Figure 9

A table illustrating ill-imported data that contained a comma in a cell. That cell has been split in two cells. — The risks of having commas inside comma-separated data.

Figure 10

The tidy data analysis circle, iterating over data transformation, visualisation and modelling. — A typical data analysis workflow.

Manipulating and analysing data with dplyr

Figure 1

Image 1 of 1: ‘Figure showing the long table format on the left and wide table format on the right and arrows illustrating how the values in the 'sample' column on the left have become column names on the right and how the values in the 'expression' column on the left have become values on the the right. Below is the call to 'pivot_wider()' with annotations pointing to the 'sample' and 'expression' function arguments.’

Figure showing the long table format on the left and wide table format on the right and arrows illustrating how the values in the 'sample' column on the left have become column names on the right and how the values in the 'expression' column on the left have become values on the the right. Below is the call to 'pivot_wider()' with annotations pointing to the 'sample' and 'expression' function arguments. — Wide pivot of the `rna` data.

Figure 2

Image 1 of 1: ‘Figure showing the long table format on the left and wide table format on the right and arrows illustrating how the column names on the left have become a new column 'sample' on the left and the values in the wide table on the right have become a new column 'expression' on the left. Below is the call to 'pivot_wider()' with annotations pointing to the 'sample', 'expression' and the '-gene' arguments.’

Figure showing the long table format on the left and wide table format on the right and arrows illustrating how the column names on the left have become a new column 'sample' on the left and the values in the wide table on the right have become a new column 'expression' on the left. Below is the call to 'pivot_wider()' with annotations pointing to the 'sample', 'expression' and the '-gene' arguments. — Long pivot of the `rna` data.

Figure 3

A two by two table with X and Y row names and Female and Male column names showing the total counts for each X/Y and Female/Male combination. The Y/Female combinations shows 3 counts, while all other counts are above 2000 counts.

Data visualization

Figure 1

Default histogram produced by ggplot() and geom_histogram() for the expression data.

Figure 2

Histograms produced by ggplot() and geom_histogram() for the expression data with bin set of 15 (top) and binwith set to 2000 (bottom).

Figure 3

Figure 4

Histogram produced by ggplot() and geom_histogram() for the pre-computed log of expression.

Figure 5

Histogram produced by ggplot(), geom_histogram() and scale_x_log10() for the log of expression.

Figure 6

Scatter plot produced by ggplot() and geom_point() comparing the log-foldchanges computed above. All dots are black.

Figure 7

Scatter plot produced by ggplot() and geom_point() comparing the log-foldchanges computed above. All dots are semi-transparent black.

Figure 8

Scatter plot produced by ggplot() and geom_point() comparing the log-foldchanges computed above. All dots are semi-transparent blue.

Figure 9

Scatter plot produced by ggplot() and geom_point() comparing the log-foldchanges computed above. Dots are colour-coded based on the gene's biotype.

Figure 10

Figure 11

Scatter plot produced by ggplot() and geom_point() comparing the log-foldchanges computed above. Dots are colour-coded based on the gene's biotype. A black line of slope 1 crossing the origin was added by geom_abline().

Figure 12

Figure 13

Scatter plot produced by ggplot() and geom_hexbin() comparing the log-foldchanges computed above shows hexagons coloured based on the underlying dot density. A black line of slope 1 crossing the origin was added by geom_abline().

Figure 14

Figures showing a stretch of overlapping points indicating the log of expression + 1 for each sample. The points are coloured with different shades of blue for samples collected at different time points.

Figure 15

Boxplot showing the distribution of log expression + 1 values for each sample, as produced by geom_boxlpot(). Each boxplot is filled with white colour.

Figure 16

Boxplot and dots showing the distribution of log expression + 1 values for each sample, as produced by geom_boxlpot(). Each boxplot is transparent and the jittered dots are semi-transparent tomato-coloured and behind the boxpots.

Figure 17

Figure 18

Figure 19

Image 1 of 1: ‘Boxplot and dots showing the distribution of log expression + 1 values for each sample, as produced by geom_boxlpot(). On the first figure, each boxplot is transparent and the jittered dots are semi-transparent and coloured in different shares of blue. On the second figures, each boxplot is transparent and the jittered dots are semi-transparent and coloured red, green and blue.’