Before visiting this page and running the RNA-seq pipeline, you need to make sure everything is set up correctly for computation.
This page provides instructions on how to run the GENE-SWitCH (GS) RNA-seq pipeline on a dataset made of 4 samples:
2 bioreplicates from the adult female GM12878 lymphoblastoid cell line
2 bioreplicates from adult male lung fibroblasts
After selection of long (+200 bp) polyadenylated RNAs from the above samples, deep RNA sequencing (+100 million PE 75 bp reads) was performed in a directional/stranded way, and reads were downloaded from the ENCODE website.
In order for the pipeline to finish in a reasonable time and not to be too resource demanding, we will only consider for this part:
A randomly selected set of 1% of the reads from each of the 4 samples (located in /data/reads/rnaseq/1pcent
)
A chromosome 22 (50 Mb long) genome and annotated transcriptome (locared in /data/references
)
Here is a cheatsheet with basic Linux commands
Here is a complete STAR manual
Here is a stringtie manual
mkdir -p ~/rnaseq/code cd ~/rnaseq/code git clone https://github.com/FAANG/proj-gs-rna-seq.git --branch 0.1.0 --depth 1 cd proj-gs-rna-seq ll
Look at how the code is organised.
Look at the nextflow rnaseq.nf
file which includes the different pipeline steps (process
keyword).
mkdir -p ~/rnaseq/singularity cd ~/rnaseq/singularity cp /home/sdjebali/SIB_august2020/code/containers/singularity/registry.gitlab.com-chbk-rnaseq.img .
It may take long as the file is about 1Gb
I downloaded the data from the ENCODE website
Samples are two bioreplicates of two cell lines: GM12878 and fibroblast of lung
Polyadenylated and long RNAs were selected
Sequencing was done in a paired end (2x76bp) and directional way on an Illumina Genome Analyse IIx machine
Data were produced by Thomas Gingeras' laboratory
cat /data/references/gencode.v34.annotation.gtf
Look at the format of the file (GTF), how many tabs per row? how many spaces?
What is the number of genes / transcripts / exons / coding exons ?
What is the longest gene?
What is/are the transcript(s) with the more exons and what is its associated gene?
How many different gene biotypes / transcript biotypes?
mkdir -p ~/rnaseq/pipeline export SINGULARITY_PULLFOLDER=~/rnaseq/singularity export SINGULARITY_CACHEDIR=~/rnaseq/singularity export SINGULARITY_TMPDIR=~/rnaseq/singularity code=~/rnaseq/code/proj-gs-rna-seq datadir=/data outdir=~/rnaseq/pipeline time $code/run $code/rnaseq.nf \ --profile singularity --output $outdir \ --reads $datadir/reads/rnaseq/1pcent/*.fastq.gz \ --annotation $datadir/references/gencode.v34.chr22.gtf \ --genome $datadir/references/GRCh38.p13.chr22.fa \ --max-cpus 4 --max-memory 6 --keep-temp \ --resume > $outdir/nextflow.out
Since the pipeline may still be running when you want to look at its results,
you can look at a previous run on the same data located in the /home/sdjebali/rnaseq/pipeline
directory.
The $outdir/nextflow.out
output file of the pipeline tells you where to find the inputs, outputs and command of each type of
processes run by the pipeline.
For example this row:
[15/06a98f] process > map (4) [100%] 4 of 4 ✔
means one of the map processes was run in a directory starting with $outdir/temp/work/15/06a98f
So go there:
cd $outdir/temp/work/15/06a98f*
and you will see inputs and outputs of this process.
Then type:
ls -a
and you will see a .command.sh
file that includes the command that was run for this process.
The .command.sh
file content is:
#!/bin/bash -ue STAR --runThreadN 16 \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --genomeDir index \ --readFilesIn ctGM12878_bn2_1pcent_R1_trimmed.fq.gz ctGM12878_bn2_1pcent_R2_trimmed.fq.gz \ --outFileNamePrefix "ctGM12878_bn2_1pcent". mv *.bam "ctGM12878_bn2_1pcent".bam
The .command.sh
file for this step looks like this:
#!/bin/bash -ue stringtie ctGM12878_bn2_1pcent.bam --rf -G gencode.v34.chr22.gtf -o "ctGM12878_bn2_1pcent".gff
The .command.sh
file for this step looks like this:
#!/bin/bash -ue stringtie --merge \ ctfibroblastoflung_bn2_1pcent.gff ctfibroblastoflung_bn1_1pcent.gff ctGM12878_bn1_1pcent.gff ctGM12878_bn2_1pcent.gff \ -G gencode.v34.chr22.gtf -o assembly.gff
The .command.sh
file for this step starts like this:
#!/bin/bash -ue stringtie ctGM12878_bn2_1pcent.bam \ --rf \ -e \ -B \ -G assembly.gff \ -A "ctGM12878_bn2_1pcent"."assembly"_genes_TPM.tsv \ -o "ctGM12878_bn2_1pcent"."assembly".gtf
Using a web browser from your laptop, open the server remote file /home/sdjebali/rnaseq/pipeline/logs/report.html
(or the corresponding one in your home ~/rnaseq/pipeline/logs/report.html
if the pipeline has finished running).
The next and last part of the rnaseq hands-on are the downstream analyses.