In this document, we will analyse RNAseq data (counts resulting from Salmon, aggregated at gene level using Tximport) using DESeq2. Data consists of 6 Pelobates Cultripes populations (from 2 geographic regions), each population being subjected to benign conditions (constant high water level) and stressful conditions (low water level).

Load the data

We load and store the data into a variable called txi.

txi<-readRDS("salmon_gene_counts.rds")

We will load the libraries needed for this script.

# use pacman to load libraries
pacman::p_load(DESeq2, tidyverse, GGally)

Exploring the data

The txi object is a list with 4 entries, three matrices (abundance, counts, length) and 1 character vector (countsFromAbundance).

The dimensions of each matrix are 106236 rows as the genes and 48 columns as the sample_id. This means the transcripts have been aggregated into 106236 genes.

Here is a brief description of each element in the txi object: * Counts: Estimate of the number of reads mapping to each transcript. * Abundance: Normalized counts in Transcripts Per Million (TPM). This means that per sample, the total counts add up to 1 million. Raw counts cannot be compared across sample_id because each library may vary slightly in terms of the total number of reads, differences in sequencing bias and difference in transcript lengths.

apply(X=txi$abundance, FUN=sum, MARGIN = 2)

##     Bui1H9_nonrrna     Bui1L9_nonrrna    Bui2H12_nonrrna     Bui2L1_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##    Bui3H10_nonrrna     Bui3L8_nonrrna    Bui4H14_nonrrna     Bui4L4_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##     Can1H4_nonrrna     Can1L1_nonrrna    Can2H12_nonrrna Can2Lextra_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##     Can3H8_nonrrna     Can3L6_nonrrna Can4Hextra_nonrrna     Can4L7_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##     Esp1H7_nonrrna     Esp1L7_nonrrna     Esp2H2_nonrrna     Esp2L3_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##     Esp4H1_nonrrna     Esp4L1_nonrrna     Esp5H5_nonrrna     Esp5L2_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##     Jab1H6_nonrrna     Jab1L2_nonrrna     Jab2H5_nonrrna     Jab2L3_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##     Jab4H6_nonrrna     Jab4L5_nonrrna     Jab5H6_nonrrna     Jab5L4_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##     Lla1H4_nonrrna     Lla1L7_nonrrna     Lla2H3_nonrrna     Lla2L4_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##     Lla4H5_nonrrna     Lla4L6_nonrrna     Lla5H7_nonrrna     Lla5L3_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##    Tur1H12_nonrrna    Tur1L12_nonrrna     Tur2H6_nonrrna     Tur2L6_nonrrna 
##              1e+06              1e+06              1e+06              1e+06 
##     Tur3H6_nonrrna     Tur3L3_nonrrna     Tur4H2_nonrrna    Tur4L13_nonrrna 
##              1e+06              1e+06              1e+06              1e+06

Length: Effective length of the target transcript.
countsFromAbundance: A character vector indicating whether counts were taken directly from the quantifier (salmon) or whether they have been calculated from the abundances by tximport. Default is no (counts are from salmon).

Moreover, we can appreciate that the two first rows correspond to transcripts that mapped against mithocondrial RNA and non-coding RNA (nuclear ribosomal RNA) decoys introduced in the reference transcriptome.

We will remove these.

Regarding the non-coding RNA, we can see high counts in the counts matrix (without any normalization). Non-coding RNA molecules are abundant in the cell and are often not fully removed during RNA purification processes. All the transcripts were aggregated into one only “gene” or row called nrRNA, therefore showing a higher length in the length matrix. Longer transcripts are expected to garner more reads simply because they provide more binding sites for the sequencing reads. Thus, in the abundance matrix their abundance is adjusted downward to account for their length.

Filtering out mtRNA and nrRNA

We have to filter out the first two rows of every matrix. These two rows correspond to the mitochondrial RNA and nuclear ribosomal RNA that mapped against the decoys introduced in the reference transcriptome.

Moreover, there are genes named with the prefix “PECUL23nc” that are also non-coding, and non codinn RNA baits (nr baits). We have to find these genes and remove them. too.

# Remove mtDNA, non-coding and nr baits

# make a list of genes we want to keep
whitelist<-txi$counts %>%
  as_tibble(rownames = "gene_id") %>%
  filter(!str_detect(gene_id, pattern = "mt|nr|nc")) %>%
  pull(gene_id)

length(whitelist);head(whitelist) # we are keeping 32531 genes

## [1] 32531

## [1] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008"
## [5] "PECUL23A000011" "PECUL23A000012"

# filter txi tables
txi$abundance<-txi$abundance[whitelist,]
txi$counts<-txi$counts[whitelist,]
txi$length<-txi$length[whitelist,]

Load design matrix

Load the design matrix for the dds object.

des.mat<-read_csv("./design_matrix.csv")

## Rows: 48 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): sample_id, treatment, population, region
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Re-order factor levels
des.mat <- des.mat %>%
  mutate(population=factor(population, levels=c("Bui","Can","Tur","Esp","Jab","Lla"))) %>% # re-order factors for easy plotting later
  mutate(pop_n = factor(rep(rep(1:3,each=8),2))) %>% # The new variable pop_n created in the code does not depend on the levels of the population factor and is created independently by repeating a sequence. Hence, pop_n is non-nested with respect to population. 
  mutate_if(is.character, as.factor) # convert characters to factor

Filtering out samples

We will filter samples that had substantially lower library size (Jab) and a potential outlier (Bui).

des.mat<-des.mat %>%
  filter(!sample_id %in% c("Bui4H14_nonrrna","Jab5H6_nonrrna")) 

# filter txi tables
txi$abundance<-txi$abundance[,as.character(des.mat$sample_id)]
txi$counts<-txi$counts[,as.character(des.mat$sample_id)]
txi$length<-txi$length[,as.character(des.mat$sample_id)]

# get column order of counts matrix and re-order des.mat to match
col_order<-match(colnames(txi$counts),des.mat$sample_id)
des.mat<-des.mat[col_order,]
des.mat$sample_id==colnames(txi$counts)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE

des.mat

## # A tibble: 46 × 5
##    sample_id       treatment population region  pop_n
##    <fct>           <fct>     <fct>      <fct>   <fct>
##  1 Bui1H9_nonrrna  H         Bui        central 1    
##  2 Bui1L9_nonrrna  L         Bui        central 1    
##  3 Bui2H12_nonrrna H         Bui        central 1    
##  4 Bui2L1_nonrrna  L         Bui        central 1    
##  5 Bui3H10_nonrrna H         Bui        central 1    
##  6 Bui3L8_nonrrna  L         Bui        central 1    
##  7 Bui4L4_nonrrna  L         Bui        central 1    
##  8 Can1H4_nonrrna  H         Can        central 2    
##  9 Can1L1_nonrrna  L         Can        central 2    
## 10 Can2H12_nonrrna H         Can        central 2    
## # ℹ 36 more rows

Make DESeq objects

Our objective is to test whether the effect of dropping water level on tadpoles from plastic and less plastic populations. We therefore want to contrast High vs. Low water for each of the regions and so we have to include a region:treatment interaction effect. However, we are expecting region alone (genetic variations) to be an important factor affecting gene expression and we also want to correct for differences of having sampled different populations within each region.

In the experimental design, populations are treated as replicates per region, being numbered from 1 to 3 within each region. Otherwise, each population would be unique to each region and would return an error.

We also drop the intercept (first term in the formula is 0) to allow for easier comparison between contrasts. If not, the base level expression would be that of the intercept and we would have to re-level the baseline level for each comparison.

dds <- DESeqDataSetFromTximport(txi, 
                                colData = des.mat, 
                                design = ~0+region + region:pop_n + region:treatment)

## using counts and average transcript lengths from tximport

# making sure metadata is in matching order
des.mat$sample_id==colnames(assay(dds))

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE

Pre-filtering

The results function of the DESeq2 package that we will be using afterwards, performs its own independent filtering by default using the mean of normalized counts as a filter statistic. However, it might be worth applying a pre-filtering step with the aim of reducing the size of the matrix, speeding up downstream calculations and reducing memory requirements. This is also recommended in DESeq2 documentation.

# Looking at the distribution without any filtering. Representing the sum of counts for each gene across all samples.

rowSums(counts(dds)) %>%
          enframe() %>%
  ggplot(aes(x=value)) +
  geom_histogram() +
  scale_x_log10()

## Warning in scale_x_log10(): log-10 transformation introduced infinite values.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 8426 rows containing non-finite outside the scale range
## (`stat_bin()`).

# We will apply 2 different filters, a mild one and a strict one

dds1<-dds[rowSums(counts(dds) >= 1) >= 12,] # genes that have at least 1 count for 12 samples (e.g. one per treatment per population)
dim(dds1) # we are left with ~18k genes

## [1] 17987    46

dds2<-dds[rowSums(counts(dds) >= 50) >= 12,] # genes that have at least 10 counts for 12 samples (e.g. one per treatment per population)
dim(dds2) # we are left with ~11k genes

## [1] 11657    46

Principal Component Analysis

Although the Differential Expression Analysis utilizes raw count data, for visualization or clustering, it makes more sense to use normalized data. DESeq authors argue that Variance Stabilized Transformation (VST) or regularized logarithm (rlog) have certain advantages. VST is faster, i.e. good for large datasets, rlog is more often used for small datasets.

# VST transformation
vst_dds<-vst(dds, blind = T) # no filtering

## using 'avgTxLength' from assays(dds), correcting for library size

vst_dds1<-vst(dds1, blind = T) # mild filtering

## using 'avgTxLength' from assays(dds), correcting for library size

vst_dds2<-vst(dds2, blind = T) # hard filtering

## using 'avgTxLength' from assays(dds), correcting for library size

# Perform PCA on transposed scaled, centered data
vst_pca<- prcomp(t(assay(vst_dds)),  center = T)
vst_pca1<- prcomp(t(assay(vst_dds1)), center = T)
vst_pca2<- prcomp(t(assay(vst_dds2)), center = T)

PCA emphasizing regions

vst_pca$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(des.mat) %>% # add the experimental design information
  ggpairs(columns = c("PC1", "PC2", "PC3", "PC4","PC5","PC6"), aes(color=region, shape=treatment)) +
  ggtitle("VST PCA (no filtering)")

## Joining with `by = join_by(sample_id)`

vst_pca1$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(des.mat) %>% # add the experimental design information
  ggpairs(columns = c("PC1", "PC2", "PC3", "PC4","PC5","PC6"),  aes(color=region, shape=treatment)) +
  ggtitle("VST PCA (mild liftering)")

## Joining with `by = join_by(sample_id)`

vst_pca2$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(des.mat) %>% # add the experimental design information
  ggpairs(columns = c("PC1", "PC2", "PC3", "PC4","PC5","PC6"),  aes(color=region, shape=treatment)) +
  ggtitle("VST PCA (hard liftering)")

## Joining with `by = join_by(sample_id)`

The diagonal displays the density distributions of a specific individual PC. The scatterplots show the pairwise relationships between different PCs, where each point represents a sample, colored by treatment and shaped by region. The X axis corresponds to the PC on the right and the Y axis corresponds to the PC on the top. Correlation coefficients quantify the linear relationship between the pair of PCs for each region. PC1 separates regions and PC3 separates treatments.

Filtering doesn’t have a noticeable effect.

PCA emphasizing treatment

vst_pca1$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(des.mat) %>% # add the experimental design information
  ggpairs(columns = c("PC1", "PC2", "PC3", "PC4","PC5","PC6"),  aes(color=treatment, shape=region)) +
  ggtitle("VST PCA")

## Joining with `by = join_by(sample_id)`

Same PCA but colouring treatments instead of regions. It is not until susequent PCs that we get a treatment effect. PC1 vs PC3 may give us the best separation of both region and treatment.

Plot specific axes again to highlight some of this more clearly.

First, we will simplify the names of the samples:

vst_pca1$x %>%
  as_tibble(rownames = "sample_id")

## # A tibble: 46 × 47
##    sample_id     PC1    PC2     PC3    PC4    PC5      PC6     PC7    PC8    PC9
##    <chr>       <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>  <dbl>
##  1 Bui1H9_no… -54.0   45.5   -7.06    3.88 -10.8    1.33   -20.9     4.90  -7.64
##  2 Bui1L9_no… -59.0    1.11 -12.4   -14.2   14.3   20.5    -21.5   -12.1    3.62
##  3 Bui2H12_n… -31.9  -30.6   -7.51    5.56   5.47   0.858  -21.8    -1.74  -8.28
##  4 Bui2L1_no… -38.5  -16.4  -19.1    -1.66  19.0   -5.99   -20.7   -11.9   -4.25
##  5 Bui3H10_n…  11.7   11.2   14.4   -30.4    1.54  -7.96     1.90  -17.8  -26.3 
##  6 Bui3L8_no…  -2.91   5.75 -14.6   -28.5   10.3  -12.2      5.91  -12.4   -9.39
##  7 Bui4L4_no…  -8.49  30.5    1.34   -9.35  13.4  -21.9     -0.873  -8.54 -17.1 
##  8 Can1H4_no… -35.7  -22.1    0.258 -18.3  -13.7   -6.96     4.32   20.4    1.06
##  9 Can1L1_no… -45.9   -9.68  -7.46  -23.3   -9.10  -2.16     3.24   14.6    8.42
## 10 Can2H12_n… -30.1   -4.31  22.9   -18.5  -19.0    0.0976   5.17    4.09  -3.85
## # ℹ 36 more rows
## # ℹ 37 more variables: PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>,
## #   PC14 <dbl>, PC15 <dbl>, PC16 <dbl>, PC17 <dbl>, PC18 <dbl>, PC19 <dbl>,
## #   PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>, PC24 <dbl>, PC25 <dbl>,
## #   PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>, PC30 <dbl>, PC31 <dbl>,
## #   PC32 <dbl>, PC33 <dbl>, PC34 <dbl>, PC35 <dbl>, PC36 <dbl>, PC37 <dbl>,
## #   PC38 <dbl>, PC39 <dbl>, PC40 <dbl>, PC41 <dbl>, PC42 <dbl>, PC43 <dbl>, …

# mapping of old sample IDs to new sample IDs
sample_id_mapping <- c("Bui1H9_nonrrna" = "Bui1H",
                       "Bui1L9_nonrrna" = "Bui1L",
                       "Bui2H12_nonrrna" = "Bui2H",
                       "Bui2L1_nonrrna" = "Bui2L",
                       "Bui3H10_nonrrna" = "Bui3H",
                       "Bui3L8_nonrrna" = "Bui3L",
                       "Bui4L4_nonrrna" = "Bui4L",
                       "Can1H4_nonrrna" = "Can1H",
                       "Can1L1_nonrrna" = "Can1L",
                       "Can2H12_nonrrna" = "Can2H",
                       "Can2Lextra_nonrrna" = "Can2L",
                       "Can3H8_nonrrna" = "Can3H",
                       "Can3L6_nonrrna" = "Can3L",
                       "Can4Hextra_nonrrna" = "Can4H",
                       "Can4L7_nonrrna" = "Can4L",
                       "Esp1H7_nonrrna" = "Esp1H",
                       "Esp1L7_nonrrna" = "Esp1L",
                       "Esp2H2_nonrrna" = "Esp2H",
                       "Esp2L3_nonrrna" = "Esp2L",
                       "Esp4H1_nonrrna" = "Esp3H",
                       "Esp4L1_nonrrna" = "Esp3L",
                       "Esp5H5_nonrrna" = "Esp4H",
                       "Esp5L2_nonrrna" = "Esp4L",
                       "Jab1H6_nonrrna" = "Jab1H",
                       "Jab1L2_nonrrna" = "Jab1L",
                       "Jab2H5_nonrrna" = "Jab2H",
                       "Jab2L3_nonrrna" = "Jab2L",
                       "Jab4H6_nonrrna" = "Jab3H",
                       "Jab4L5_nonrrna" = "Jab3L",
                       "Jab5L4_nonrrna" = "Jab4L",
                       "Lla1H4_nonrrna" = "Lla1H",
                       "Lla1L7_nonrrna" = "Lla1L",
                       "Lla2H3_nonrrna" = "Lla2H",
                       "Lla2L4_nonrrna" = "Lla2L",
                       "Lla4H5_nonrrna" = "Lla3H",
                       "Lla4L6_nonrrna" = "Lla3L",
                       "Lla5H7_nonrrna" = "Lla4H",
                       "Lla5L3_nonrrna" = "Lla4L",
                       "Tur1H12_nonrrna" = "Tur1H",
                       "Tur1L12_nonrrna" = "Tur1L",
                       "Tur2H6_nonrrna" = "Tur2H",
                       "Tur2L6_nonrrna" = "Tur2L",
                       "Tur3H6_nonrrna" = "Tur3H",
                       "Tur3L3_nonrrna" = "Tur3L",
                       "Tur4H2_nonrrna" = "Tur4H",
                       "Tur4L13_nonrrna" = "Tur4L")

# Create a named vector for population full names
population_names <- c(
  "Bui" = "Buitrago",
  "Can" = "Canencia",
  "Tur" = "Turrubuelo",
  "Esp" = "Espajosas",
  "Jab" = "Jabata",
  "Lla" = "Llano"
)

# Modify the sample IDs in vst_pca1$x
vst_pca1$x <- vst_pca1$x %>% 
  as_tibble(rownames = "sample_id") %>%
  mutate(sample_id = recode(sample_id, !!!sample_id_mapping)) %>%
  column_to_rownames(var = "sample_id")

# Modify the sample IDs in des.mat
des.mat <- des.mat %>%
  mutate(sample_id = recode(sample_id, !!!sample_id_mapping))

# calculate hulls
pca_hull <- 
  vst_pca1$x %>% 
  as_tibble(rownames = "sample_id") %>%
  left_join(des.mat) %>%
  group_by(population, treatment) %>% 
  dplyr::slice(chull(PC1, PC3))

## Joining with `by = join_by(sample_id)`

# plot PCA and facet into different populations
vst_pca1$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(des.mat) %>% # add the experimental design information
  ggplot(aes(x=PC1, y=PC3, color=population, shape=treatment)) +
  geom_hline(yintercept = 0, linewidth=0.1) +
  geom_vline(xintercept = 0, linewidth=0.1) +
  geom_point(size=3) +
  geom_polygon(data = pca_hull,
               aes(fill = population,
                   colour = population,
                   linetype=treatment),
               alpha = 0.1,
               show.legend = FALSE) +
  ggrepel::geom_text_repel(aes(label=sample_id),size=2) +
  scale_color_manual(values=c("Bui"="#0FBA3D", "Can"="#0FBA93", "Tur"="#0F8CBA",
                              Esp="#900C3F", "Jab"="#FF5733", "Lla"="#FFC300"))+
  scale_fill_manual(values=c("Bui"="#0FBA3D", "Can"="#0FBA93", "Tur"="#0F8CBA",
                              Esp="#900C3F", "Jab"="#FF5733", "Lla"="#FFC300"))+
  facet_wrap(~population, labeller = labeller(population = population_names)) +
  theme_minimal() +
   theme(
    strip.text = element_text(hjust = 0.6, vjust = 0.5, margin = margin(t = 5, b = 5)))

## Joining with `by = join_by(sample_id)`

ggsave("PCA_bet_quality.jpeg", width= 8, height = 5, dpi=600)

Warm colours (southern populations) are found on the left side of the plotting area and cold colours (central populations) on the right side. So PC1 is splitting regions quite well. the Y axis (PC3) on the other hand is fairly consistently splitting the triangles from the squares.

Data_Exploratory_Analysis

clagrabel

2024-06-24