Before visiting this page and running the RNA-seq downstream analyses, you need to make sure everything is set up correctly for computation.
Since RNA-seq downstream analyses are not very resource demanding, they will be run on the output of the GS RNA-seq pipeline
obtained with the complete dataset (100% of the reads from the 4 samples, complete GRCh38
genome).
The assembly and quantification sub-directories provided by the pipeline on the complete dataset can be found at /data/course_material/rnaseq/pipeline/wg.100pcent
.
Here we will be using analysis scripts located in /data/course_material/rnaseq/code/analysis.scripts
that essentially come from my Script github repository
mkdir ~/rnaseq/analysis mkdir ~/rnaseq/analysis/visualisation mkdir ~/rnaseq/analysis/visualisation/gene.expression mkdir ~/rnaseq/analysis/visualisation/gene.expression/hierachical.clustering mkdir ~/rnaseq/analysis/visualisation/gene.expression/summary.statistics mkdir ~/rnaseq/analysis/differential.expression mkdir ~/rnaseq/analysis/gene.ontology
env=/data/course_material/rnaseq/code/analysis.scripts/environment.yml conda env create --force --name rnaseq -f $env conda activate rnaseq
cat /data/course_material/rnaseq/metadata/bamfile.metadata.tsv
will output:
path labExpId celltype biorep /home/sdjebali/rnaseq/pipeline/maps/ctGM12878_bn1_1pcent.bam ctGM12878_bn1 GM12878 1 /home/sdjebali/rnaseq/pipeline/maps/ctGM12878_bn2_1pcent.bam ctGM12878_bn2 GM12878 2 /home/sdjebali/rnaseq/pipeline/maps/ctfibroblastoflung_bn1_1pcent.bam ctfibroblastoflung_bn1 fibroblastoflung 1 /home/sdjebali/rnaseq/pipeline/maps/ctfibroblastoflung_bn2_1pcent.bam ctfibroblastoflung_bn2 fibroblastoflung 2
cd ~/rnaseq/analysis/visualisation/gene.expression/summary.statistics gntpm=/data/course_material/rnaseq/pipeline/wg.100pcent/quantification/reference_genes_TPM.tsv gncount=/data/course_material/rnaseq/pipeline/wg.100pcent/quantification/reference_genes_counts.tsv meta=/data/course_material/rnaseq/metadata/bamfile.metadata.tsv time plot_gene_expression.sh $gntpm $gncount $meta labExpId $PWD
Look at the different pdf files that have been produced.
cd ~/rnaseq/analysis/visualisation/gene.expression/hierachical.clustering gntpm=/data/course_material/rnaseq/pipeline/wg.100pcent/quantification/refgene.tpm.eachsample.tsv time matrix_to_dist.R -i $gntpm --log10 -c pearson -p 0.001 -o similarity.matrix.tsv cat similarity.matrix.tsv
cd ~/rnaseq/analysis/visualisation/gene.expression/hierachical.clustering meta=/data/course_material/rnaseq/metadata/bamfile.metadata.tsv palette=/data/course_material/rnaseq/code/analysis.scripts/palettes/terrain.colors.3.txt time ggheatmap.R -i similarity.matrix.tsv -d i --row_metadata $meta --col_dendro --row_dendro -B 10 \ --matrix_palette=$palette --rowSide_by celltype --matrix_legend_title "Sample pairwise similarity" \ --matrix_fill_limits 0.85,1 -v -o 4sample.heatmap.pdf
Look at the 4sample.heatmap.pdf
file
Look at the help of the ggheatmap.R
script using the -h
option
edgeR.analysis.R -h
cd ~/rnaseq/analysis/differential.expression gncount=/data/course_material/rnaseq/pipeline/wg.100pcent/quantification/refgene.count.eachsample.tsv meta=/data/course_material/rnaseq/metadata/bamfile.metadata.tsv time edgeR.analysis.R -i $gncount -m $meta -f celltype > refgene.edgeR.out 2> refgene.edgeR.err c edgeR.cpm1.n2.out.tsv # GM12878 fibroblastoflung logFC logCPM PValue FDR # ENSG00000111799.21 -21.23 -5.41 -22.82 11.13 7.1e-113 1.1e-108 # 1 (6 fields) # 15573 (7 fields)
Look at the edgeR.cpm1.n2.out.pdf
file
Note that a nice and complete differential expression tutorial and hands-on based on EdgeR is also available from Nathalie Vialaneix, INRAE here.
awk '$NF<0.01 && $4>2{print $1"\toverGM12878"}' edgeR.cpm1.n2.out.tsv > edgeR.0.01.overGM12878.tsv c edgeR.0.01.overGM12878.tsv awk -v fileRef=/data/references/entrez.ensembl.gnids.tsv 'BEGIN{while (getline < fileRef >0){entrez[$2]=$1}} {split($1,a,"."); if(entrez[a[1]]!=""){print entrez[a[1]]}}' edgeR.0.01.overGM12878.tsv > edgeR.0.01.overGM12878.entrez.txt c edgeR.0.01.overGM12878.entrez.txt
awk '$NF<0.01 && $4<-2{print $1"\toverfibroblast_of_lung"}' edgeR.cpm1.n2.out.tsv > edgeR.0.01.overfibroblast_of_lung.tsv c edgeR.0.01.overfibroblast_of_lung.tsv awk -v fileRef=/data/references/entrez.ensembl.gnids.tsv 'BEGIN{while (getline < fileRef >0){entrez[$2]=$1}} {split($1,a,"."); if(entrez[a[1]]!=""){print entrez[a[1]]}}' edgeR.0.01.overfibroblast_of_lung.tsv > edgeR.0.01.overfibroblast_of_lung.entrez.txt c edgeR.0.01.overfibroblast_of_lung.entrez.txt
GO_enrichment.R -h
cd ~/rnaseq/analysis/gene.ontology univ=/data/references/gene_ids.entrez.txt for ct in GM12878 fibroblast_of_lung do for go in BP MF CC do GO_enrichment.R -u $univ -G ../differential.expression/edgeR.0.01.over$ct.entrez.txt -c $go -v -o edgeR.0.01.$ct > edgeR.0.01.$ct.GO.$go.out 2> edgeR.0.01.$ct.GO.$go.err head edgeR.0.01.$ct.$go.tsv done done
Do you think the terms that pop up for each cell type make sense, i.e. are indeed related to the cell type ?