RNA-seq downstream analyses hands-on

INSERM

Before visiting this page and running the RNA-seq downstream analyses, you need to make sure everything is set up correctly for computation.

Since RNA-seq downstream analyses are not very resource demanding, they will be run on the output of the GS RNA-seq pipeline obtained with the complete dataset (100% of the reads from the 4 samples, complete GRCh38 genome).

The assembly and quantification sub-directories provided by the pipeline on the complete dataset can be found at /data/course_material/rnaseq/pipeline/wg.100pcent.

Here we will be using analysis scripts located in /data/course_material/rnaseq/code/analysis.scripts that essentially come from my Script github repository

1   Make the proper directory structure for the analyses

mkdir ~/rnaseq/analysis
mkdir ~/rnaseq/analysis/visualisation
mkdir ~/rnaseq/analysis/visualisation/gene.expression
mkdir ~/rnaseq/analysis/visualisation/gene.expression/hierachical.clustering
mkdir ~/rnaseq/analysis/visualisation/gene.expression/summary.statistics
mkdir ~/rnaseq/analysis/differential.expression
mkdir ~/rnaseq/analysis/gene.ontology

2   Create and load the rnaseq conda environment

env=/data/course_material/rnaseq/code/analysis.scripts/environment.yml
conda env create --force --name rnaseq -f $env
conda activate rnaseq

3   Look at the metadata file for the 4 samples

cat /data/course_material/rnaseq/metadata/bamfile.metadata.tsv

will output:

path        labExpId        celltype        biorep
/home/sdjebali/rnaseq/pipeline/maps/ctGM12878_bn1_1pcent.bam        ctGM12878_bn1   GM12878 1
/home/sdjebali/rnaseq/pipeline/maps/ctGM12878_bn2_1pcent.bam        ctGM12878_bn2   GM12878 2
/home/sdjebali/rnaseq/pipeline/maps/ctfibroblastoflung_bn1_1pcent.bam       ctfibroblastoflung_bn1  fibroblastoflung        1
/home/sdjebali/rnaseq/pipeline/maps/ctfibroblastoflung_bn2_1pcent.bam       ctfibroblastoflung_bn2  fibroblastoflung        2

4   Launch the different analysis steps

4.1   Gene expression visualisation and quality control

cd ~/rnaseq/analysis/visualisation/gene.expression/summary.statistics
gntpm=/data/course_material/rnaseq/pipeline/wg.100pcent/quantification/reference_genes_TPM.tsv
gncount=/data/course_material/rnaseq/pipeline/wg.100pcent/quantification/reference_genes_counts.tsv
meta=/data/course_material/rnaseq/metadata/bamfile.metadata.tsv
time plot_gene_expression.sh $gntpm $gncount $meta labExpId $PWD

Look at the different pdf files that have been produced.

4.2   Heatmap and hierachical clustering of samples

4.2.1   Make the pairwise sample similarity matrix and look at it (understand the command)

cd ~/rnaseq/analysis/visualisation/gene.expression/hierachical.clustering
gntpm=/data/course_material/rnaseq/pipeline/wg.100pcent/quantification/refgene.tpm.eachsample.tsv
time matrix_to_dist.R -i $gntpm --log10 -c pearson -p 0.001 -o similarity.matrix.tsv
cat similarity.matrix.tsv

4.2.2   Make the pairwise sample similarity heatmap and sample hierachical clustering

cd ~/rnaseq/analysis/visualisation/gene.expression/hierachical.clustering
meta=/data/course_material/rnaseq/metadata/bamfile.metadata.tsv
palette=/data/course_material/rnaseq/code/analysis.scripts/palettes/terrain.colors.3.txt
time ggheatmap.R -i similarity.matrix.tsv -d i --row_metadata $meta --col_dendro --row_dendro -B 10 \
--matrix_palette=$palette --rowSide_by celltype --matrix_legend_title "Sample pairwise similarity" \
--matrix_fill_limits 0.85,1 -v -o 4sample.heatmap.pdf

Look at the 4sample.heatmap.pdf file

Look at the help of the ggheatmap.R script using the -h option

4.3   Differential gene expression analysis

4.3.1   Run the script with -h to see its options

edgeR.analysis.R -h

4.3.2   Run the script on the data and look at the outputs

cd ~/rnaseq/analysis/differential.expression
gncount=/data/course_material/rnaseq/pipeline/wg.100pcent/quantification/refgene.count.eachsample.tsv
meta=/data/course_material/rnaseq/metadata/bamfile.metadata.tsv
time edgeR.analysis.R -i $gncount -m $meta -f celltype > refgene.edgeR.out 2> refgene.edgeR.err

c edgeR.cpm1.n2.out.tsv
# GM12878   fibroblastoflung        logFC   logCPM  PValue  FDR
# ENSG00000111799.21        -21.23  -5.41   -22.82  11.13   7.1e-113        1.1e-108
# 1 (6 fields)
# 15573 (7 fields)

Look at the edgeR.cpm1.n2.out.pdf file

Note that a nice and complete differential expression tutorial and hands-on based on EdgeR is also available from Nathalie Vialaneix, INRAE here.

4.3.3   Make files of genes over-expressed in GM12878 in ensembl and entrez nomenclatures

awk '$NF<0.01 && $4>2{print $1"\toverGM12878"}' edgeR.cpm1.n2.out.tsv > edgeR.0.01.overGM12878.tsv
c edgeR.0.01.overGM12878.tsv
awk -v fileRef=/data/references/entrez.ensembl.gnids.tsv  'BEGIN{while (getline < fileRef >0){entrez[$2]=$1}} {split($1,a,"."); if(entrez[a[1]]!=""){print entrez[a[1]]}}' edgeR.0.01.overGM12878.tsv > edgeR.0.01.overGM12878.entrez.txt
c edgeR.0.01.overGM12878.entrez.txt

4.3.4   Make files of genes over-expressed in lung fibroblasts in ensembl and entrez nomenclatures

awk '$NF<0.01 && $4<-2{print $1"\toverfibroblast_of_lung"}' edgeR.cpm1.n2.out.tsv > edgeR.0.01.overfibroblast_of_lung.tsv
c edgeR.0.01.overfibroblast_of_lung.tsv
awk -v fileRef=/data/references/entrez.ensembl.gnids.tsv  'BEGIN{while (getline < fileRef >0){entrez[$2]=$1}} {split($1,a,"."); if(entrez[a[1]]!=""){print entrez[a[1]]}}' edgeR.0.01.overfibroblast_of_lung.tsv > edgeR.0.01.overfibroblast_of_lung.entrez.txt
c edgeR.0.01.overfibroblast_of_lung.entrez.txt

4.4   GO term enrichment analysis on differentially expressed genes

4.4.1   Run the script with -h to see its options

GO_enrichment.R -h

4.4.2   Run the script on the data and look at the outputs

cd ~/rnaseq/analysis/gene.ontology
univ=/data/references/gene_ids.entrez.txt
for ct in GM12878 fibroblast_of_lung
do
for go in BP MF CC
do
    GO_enrichment.R -u $univ -G ../differential.expression/edgeR.0.01.over$ct.entrez.txt -c $go -v -o edgeR.0.01.$ct > edgeR.0.01.$ct.GO.$go.out 2> edgeR.0.01.$ct.GO.$go.err
head edgeR.0.01.$ct.$go.tsv
done
done

Do you think the terms that pop up for each cell type make sense, i.e. are indeed related to the cell type ?