In this document, we will analyse RNAseq data (counts resulting from Salmon, aggregated at gene level using Tximport) using DESeq2. Data consists of 6 Pelobates Cultripes populations (from 2 geographic regions), each population being subjected to benign conditions (constant high water level) and stressful conditions (low water level).
We load and store the data into a variable called txi.
txi<-readRDS("salmon_gene_counts.rds")
We will load the libraries needed for this script.
# use pacman to load libraries
pacman::p_load(DESeq2, tidyverse, GGally)
The txi object is a list with 4 entries, three matrices (abundance, counts, length) and 1 character vector (countsFromAbundance).
The dimensions of each matrix are 106236 rows as the genes and 48 columns as the sample_id. This means the transcripts have been aggregated into 106236 genes.
Here is a brief description of each element in the txi object: * Counts: Estimate of the number of reads mapping to each transcript. * Abundance: Normalized counts in Transcripts Per Million (TPM). This means that per sample, the total counts add up to 1 million. Raw counts cannot be compared across sample_id because each library may vary slightly in terms of the total number of reads, differences in sequencing bias and difference in transcript lengths.
apply(X=txi$abundance, FUN=sum, MARGIN = 2)
## Bui1H9_nonrrna Bui1L9_nonrrna Bui2H12_nonrrna Bui2L1_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Bui3H10_nonrrna Bui3L8_nonrrna Bui4H14_nonrrna Bui4L4_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Can1H4_nonrrna Can1L1_nonrrna Can2H12_nonrrna Can2Lextra_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Can3H8_nonrrna Can3L6_nonrrna Can4Hextra_nonrrna Can4L7_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Esp1H7_nonrrna Esp1L7_nonrrna Esp2H2_nonrrna Esp2L3_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Esp4H1_nonrrna Esp4L1_nonrrna Esp5H5_nonrrna Esp5L2_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Jab1H6_nonrrna Jab1L2_nonrrna Jab2H5_nonrrna Jab2L3_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Jab4H6_nonrrna Jab4L5_nonrrna Jab5H6_nonrrna Jab5L4_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Lla1H4_nonrrna Lla1L7_nonrrna Lla2H3_nonrrna Lla2L4_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Lla4H5_nonrrna Lla4L6_nonrrna Lla5H7_nonrrna Lla5L3_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Tur1H12_nonrrna Tur1L12_nonrrna Tur2H6_nonrrna Tur2L6_nonrrna
## 1e+06 1e+06 1e+06 1e+06
## Tur3H6_nonrrna Tur3L3_nonrrna Tur4H2_nonrrna Tur4L13_nonrrna
## 1e+06 1e+06 1e+06 1e+06
Moreover, we can appreciate that the two first rows correspond to transcripts that mapped against mithocondrial RNA and non-coding RNA (nuclear ribosomal RNA) decoys introduced in the reference transcriptome.
We will remove these.
Regarding the non-coding RNA, we can see high counts in the counts matrix (without any normalization). Non-coding RNA molecules are abundant in the cell and are often not fully removed during RNA purification processes. All the transcripts were aggregated into one only “gene” or row called nrRNA, therefore showing a higher length in the length matrix. Longer transcripts are expected to garner more reads simply because they provide more binding sites for the sequencing reads. Thus, in the abundance matrix their abundance is adjusted downward to account for their length.
We have to filter out the first two rows of every matrix. These two rows correspond to the mitochondrial RNA and nuclear ribosomal RNA that mapped against the decoys introduced in the reference transcriptome.
Moreover, there are genes named with the prefix “PECUL23nc” that are also non-coding, and non codinn RNA baits (nr baits). We have to find these genes and remove them. too.
# Remove mtDNA, non-coding and nr baits
# make a list of genes we want to keep
whitelist<-txi$counts %>%
as_tibble(rownames = "gene_id") %>%
filter(!str_detect(gene_id, pattern = "mt|nr|nc")) %>%
pull(gene_id)
length(whitelist);head(whitelist) # we are keeping 32531 genes
## [1] 32531
## [1] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008"
## [5] "PECUL23A000011" "PECUL23A000012"
# filter txi tables
txi$abundance<-txi$abundance[whitelist,]
txi$counts<-txi$counts[whitelist,]
txi$length<-txi$length[whitelist,]
Load the design matrix for the dds object.
des.mat<-read_csv("./design_matrix.csv")
## Rows: 48 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): sample_id, treatment, population, region
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Re-order factor levels
des.mat <- des.mat %>%
mutate(population=factor(population, levels=c("Bui","Can","Tur","Esp","Jab","Lla"))) %>% # re-order factors for easy plotting later
mutate(pop_n = factor(rep(rep(1:3,each=8),2))) %>% # The new variable pop_n created in the code does not depend on the levels of the population factor and is created independently by repeating a sequence. Hence, pop_n is non-nested with respect to population.
mutate_if(is.character, as.factor) # convert characters to factor
We will filter samples that had substantially lower library size (Jab) and a potential outlier (Bui).
des.mat<-des.mat %>%
filter(!sample_id %in% c("Bui4H14_nonrrna","Jab5H6_nonrrna"))
# filter txi tables
txi$abundance<-txi$abundance[,as.character(des.mat$sample_id)]
txi$counts<-txi$counts[,as.character(des.mat$sample_id)]
txi$length<-txi$length[,as.character(des.mat$sample_id)]
# get column order of counts matrix and re-order des.mat to match
col_order<-match(colnames(txi$counts),des.mat$sample_id)
des.mat<-des.mat[col_order,]
des.mat$sample_id==colnames(txi$counts)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE
des.mat
## # A tibble: 46 × 5
## sample_id treatment population region pop_n
## <fct> <fct> <fct> <fct> <fct>
## 1 Bui1H9_nonrrna H Bui central 1
## 2 Bui1L9_nonrrna L Bui central 1
## 3 Bui2H12_nonrrna H Bui central 1
## 4 Bui2L1_nonrrna L Bui central 1
## 5 Bui3H10_nonrrna H Bui central 1
## 6 Bui3L8_nonrrna L Bui central 1
## 7 Bui4L4_nonrrna L Bui central 1
## 8 Can1H4_nonrrna H Can central 2
## 9 Can1L1_nonrrna L Can central 2
## 10 Can2H12_nonrrna H Can central 2
## # ℹ 36 more rows
Our objective is to test whether the effect of dropping water level on tadpoles from plastic and less plastic populations. We therefore want to contrast High vs. Low water for each of the regions and so we have to include a region:treatment interaction effect. However, we are expecting region alone (genetic variations) to be an important factor affecting gene expression and we also want to correct for differences of having sampled different populations within each region.
In the experimental design, populations are treated as replicates per region, being numbered from 1 to 3 within each region. Otherwise, each population would be unique to each region and would return an error.
We also drop the intercept (first term in the formula is 0) to allow for easier comparison between contrasts. If not, the base level expression would be that of the intercept and we would have to re-level the baseline level for each comparison.
dds <- DESeqDataSetFromTximport(txi,
colData = des.mat,
design = ~0+region + region:pop_n + region:treatment)
## using counts and average transcript lengths from tximport
# making sure metadata is in matching order
des.mat$sample_id==colnames(assay(dds))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE
The results function of the DESeq2 package that we will be using afterwards, performs its own independent filtering by default using the mean of normalized counts as a filter statistic. However, it might be worth applying a pre-filtering step with the aim of reducing the size of the matrix, speeding up downstream calculations and reducing memory requirements. This is also recommended in DESeq2 documentation.
# Looking at the distribution without any filtering. Representing the sum of counts for each gene across all samples.
rowSums(counts(dds)) %>%
enframe() %>%
ggplot(aes(x=value)) +
geom_histogram() +
scale_x_log10()
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8426 rows containing non-finite outside the scale range
## (`stat_bin()`).
# We will apply 2 different filters, a mild one and a strict one
dds1<-dds[rowSums(counts(dds) >= 1) >= 12,] # genes that have at least 1 count for 12 samples (e.g. one per treatment per population)
dim(dds1) # we are left with ~18k genes
## [1] 17987 46
dds2<-dds[rowSums(counts(dds) >= 50) >= 12,] # genes that have at least 10 counts for 12 samples (e.g. one per treatment per population)
dim(dds2) # we are left with ~11k genes
## [1] 11657 46
Although the Differential Expression Analysis utilizes raw count data, for visualization or clustering, it makes more sense to use normalized data. DESeq authors argue that Variance Stabilized Transformation (VST) or regularized logarithm (rlog) have certain advantages. VST is faster, i.e. good for large datasets, rlog is more often used for small datasets.
# VST transformation
vst_dds<-vst(dds, blind = T) # no filtering
## using 'avgTxLength' from assays(dds), correcting for library size
vst_dds1<-vst(dds1, blind = T) # mild filtering
## using 'avgTxLength' from assays(dds), correcting for library size
vst_dds2<-vst(dds2, blind = T) # hard filtering
## using 'avgTxLength' from assays(dds), correcting for library size
# Perform PCA on transposed scaled, centered data
vst_pca<- prcomp(t(assay(vst_dds)), center = T)
vst_pca1<- prcomp(t(assay(vst_dds1)), center = T)
vst_pca2<- prcomp(t(assay(vst_dds2)), center = T)
vst_pca$x %>%
as_tibble(rownames = "sample_id") %>%
left_join(des.mat) %>% # add the experimental design information
ggpairs(columns = c("PC1", "PC2", "PC3", "PC4","PC5","PC6"), aes(color=region, shape=treatment)) +
ggtitle("VST PCA (no filtering)")
## Joining with `by = join_by(sample_id)`
vst_pca1$x %>%
as_tibble(rownames = "sample_id") %>%
left_join(des.mat) %>% # add the experimental design information
ggpairs(columns = c("PC1", "PC2", "PC3", "PC4","PC5","PC6"), aes(color=region, shape=treatment)) +
ggtitle("VST PCA (mild liftering)")
## Joining with `by = join_by(sample_id)`
vst_pca2$x %>%
as_tibble(rownames = "sample_id") %>%
left_join(des.mat) %>% # add the experimental design information
ggpairs(columns = c("PC1", "PC2", "PC3", "PC4","PC5","PC6"), aes(color=region, shape=treatment)) +
ggtitle("VST PCA (hard liftering)")
## Joining with `by = join_by(sample_id)`
The diagonal displays the density distributions of a specific individual PC. The scatterplots show the pairwise relationships between different PCs, where each point represents a sample, colored by treatment and shaped by region. The X axis corresponds to the PC on the right and the Y axis corresponds to the PC on the top. Correlation coefficients quantify the linear relationship between the pair of PCs for each region. PC1 separates regions and PC3 separates treatments.
Filtering doesn’t have a noticeable effect.
vst_pca1$x %>%
as_tibble(rownames = "sample_id") %>%
left_join(des.mat) %>% # add the experimental design information
ggpairs(columns = c("PC1", "PC2", "PC3", "PC4","PC5","PC6"), aes(color=treatment, shape=region)) +
ggtitle("VST PCA")
## Joining with `by = join_by(sample_id)`
Same PCA but colouring treatments instead of regions. It is not until susequent PCs that we get a treatment effect. PC1 vs PC3 may give us the best separation of both region and treatment.
Plot specific axes again to highlight some of this more clearly.
First, we will simplify the names of the samples:
vst_pca1$x %>%
as_tibble(rownames = "sample_id")
## # A tibble: 46 × 47
## sample_id PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Bui1H9_no… -54.0 45.5 -7.06 3.88 -10.8 1.33 -20.9 4.90 -7.64
## 2 Bui1L9_no… -59.0 1.11 -12.4 -14.2 14.3 20.5 -21.5 -12.1 3.62
## 3 Bui2H12_n… -31.9 -30.6 -7.51 5.56 5.47 0.858 -21.8 -1.74 -8.28
## 4 Bui2L1_no… -38.5 -16.4 -19.1 -1.66 19.0 -5.99 -20.7 -11.9 -4.25
## 5 Bui3H10_n… 11.7 11.2 14.4 -30.4 1.54 -7.96 1.90 -17.8 -26.3
## 6 Bui3L8_no… -2.91 5.75 -14.6 -28.5 10.3 -12.2 5.91 -12.4 -9.39
## 7 Bui4L4_no… -8.49 30.5 1.34 -9.35 13.4 -21.9 -0.873 -8.54 -17.1
## 8 Can1H4_no… -35.7 -22.1 0.258 -18.3 -13.7 -6.96 4.32 20.4 1.06
## 9 Can1L1_no… -45.9 -9.68 -7.46 -23.3 -9.10 -2.16 3.24 14.6 8.42
## 10 Can2H12_n… -30.1 -4.31 22.9 -18.5 -19.0 0.0976 5.17 4.09 -3.85
## # ℹ 36 more rows
## # ℹ 37 more variables: PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>,
## # PC14 <dbl>, PC15 <dbl>, PC16 <dbl>, PC17 <dbl>, PC18 <dbl>, PC19 <dbl>,
## # PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>, PC24 <dbl>, PC25 <dbl>,
## # PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>, PC30 <dbl>, PC31 <dbl>,
## # PC32 <dbl>, PC33 <dbl>, PC34 <dbl>, PC35 <dbl>, PC36 <dbl>, PC37 <dbl>,
## # PC38 <dbl>, PC39 <dbl>, PC40 <dbl>, PC41 <dbl>, PC42 <dbl>, PC43 <dbl>, …
# mapping of old sample IDs to new sample IDs
sample_id_mapping <- c("Bui1H9_nonrrna" = "Bui1H",
"Bui1L9_nonrrna" = "Bui1L",
"Bui2H12_nonrrna" = "Bui2H",
"Bui2L1_nonrrna" = "Bui2L",
"Bui3H10_nonrrna" = "Bui3H",
"Bui3L8_nonrrna" = "Bui3L",
"Bui4L4_nonrrna" = "Bui4L",
"Can1H4_nonrrna" = "Can1H",
"Can1L1_nonrrna" = "Can1L",
"Can2H12_nonrrna" = "Can2H",
"Can2Lextra_nonrrna" = "Can2L",
"Can3H8_nonrrna" = "Can3H",
"Can3L6_nonrrna" = "Can3L",
"Can4Hextra_nonrrna" = "Can4H",
"Can4L7_nonrrna" = "Can4L",
"Esp1H7_nonrrna" = "Esp1H",
"Esp1L7_nonrrna" = "Esp1L",
"Esp2H2_nonrrna" = "Esp2H",
"Esp2L3_nonrrna" = "Esp2L",
"Esp4H1_nonrrna" = "Esp3H",
"Esp4L1_nonrrna" = "Esp3L",
"Esp5H5_nonrrna" = "Esp4H",
"Esp5L2_nonrrna" = "Esp4L",
"Jab1H6_nonrrna" = "Jab1H",
"Jab1L2_nonrrna" = "Jab1L",
"Jab2H5_nonrrna" = "Jab2H",
"Jab2L3_nonrrna" = "Jab2L",
"Jab4H6_nonrrna" = "Jab3H",
"Jab4L5_nonrrna" = "Jab3L",
"Jab5L4_nonrrna" = "Jab4L",
"Lla1H4_nonrrna" = "Lla1H",
"Lla1L7_nonrrna" = "Lla1L",
"Lla2H3_nonrrna" = "Lla2H",
"Lla2L4_nonrrna" = "Lla2L",
"Lla4H5_nonrrna" = "Lla3H",
"Lla4L6_nonrrna" = "Lla3L",
"Lla5H7_nonrrna" = "Lla4H",
"Lla5L3_nonrrna" = "Lla4L",
"Tur1H12_nonrrna" = "Tur1H",
"Tur1L12_nonrrna" = "Tur1L",
"Tur2H6_nonrrna" = "Tur2H",
"Tur2L6_nonrrna" = "Tur2L",
"Tur3H6_nonrrna" = "Tur3H",
"Tur3L3_nonrrna" = "Tur3L",
"Tur4H2_nonrrna" = "Tur4H",
"Tur4L13_nonrrna" = "Tur4L")
# Create a named vector for population full names
population_names <- c(
"Bui" = "Buitrago",
"Can" = "Canencia",
"Tur" = "Turrubuelo",
"Esp" = "Espajosas",
"Jab" = "Jabata",
"Lla" = "Llano"
)
# Modify the sample IDs in vst_pca1$x
vst_pca1$x <- vst_pca1$x %>%
as_tibble(rownames = "sample_id") %>%
mutate(sample_id = recode(sample_id, !!!sample_id_mapping)) %>%
column_to_rownames(var = "sample_id")
# Modify the sample IDs in des.mat
des.mat <- des.mat %>%
mutate(sample_id = recode(sample_id, !!!sample_id_mapping))
# calculate hulls
pca_hull <-
vst_pca1$x %>%
as_tibble(rownames = "sample_id") %>%
left_join(des.mat) %>%
group_by(population, treatment) %>%
dplyr::slice(chull(PC1, PC3))
## Joining with `by = join_by(sample_id)`
# plot PCA and facet into different populations
vst_pca1$x %>%
as_tibble(rownames = "sample_id") %>%
left_join(des.mat) %>% # add the experimental design information
ggplot(aes(x=PC1, y=PC3, color=population, shape=treatment)) +
geom_hline(yintercept = 0, linewidth=0.1) +
geom_vline(xintercept = 0, linewidth=0.1) +
geom_point(size=3) +
geom_polygon(data = pca_hull,
aes(fill = population,
colour = population,
linetype=treatment),
alpha = 0.1,
show.legend = FALSE) +
ggrepel::geom_text_repel(aes(label=sample_id),size=2) +
scale_color_manual(values=c("Bui"="#0FBA3D", "Can"="#0FBA93", "Tur"="#0F8CBA",
Esp="#900C3F", "Jab"="#FF5733", "Lla"="#FFC300"))+
scale_fill_manual(values=c("Bui"="#0FBA3D", "Can"="#0FBA93", "Tur"="#0F8CBA",
Esp="#900C3F", "Jab"="#FF5733", "Lla"="#FFC300"))+
facet_wrap(~population, labeller = labeller(population = population_names)) +
theme_minimal() +
theme(
strip.text = element_text(hjust = 0.6, vjust = 0.5, margin = margin(t = 5, b = 5)))
## Joining with `by = join_by(sample_id)`
ggsave("PCA_bet_quality.jpeg", width= 8, height = 5, dpi=600)
Warm colours (southern populations) are found on the left side of the plotting area and cold colours (central populations) on the right side. So PC1 is splitting regions quite well. the Y axis (PC3) on the other hand is fairly consistently splitting the triangles from the squares.