RNA-seq pipeline hands-on

INSERM

Before visiting this page and running the RNA-seq pipeline, you need to make sure everything is set up correctly for computation.

This page provides instructions on how to run the GENE-SWitCH (GS) RNA-seq pipeline on a dataset made of 4 samples:

After selection of long (+200 bp) polyadenylated RNAs from the above samples, deep RNA sequencing (+100 million PE 75 bp reads) was performed in a directional/stranded way, and reads were downloaded from the ENCODE website.

In order for the pipeline to finish in a reasonable time and not to be too resource demanding, we will only consider for this part:

1   Get the latest version (0.1.0) of the GS RNA-seq pipeline on the server

mkdir -p ~/rnaseq/code
cd ~/rnaseq/code
git clone https://github.com/FAANG/proj-gs-rna-seq.git --branch 0.1.0 --depth 1
cd proj-gs-rna-seq
ll

Look at how the code is organised.

Look at the nextflow rnaseq.nf file which includes the different pipeline steps (process keyword).

2   Copy the singularity image under your home directory

mkdir -p ~/rnaseq/singularity
cd ~/rnaseq/singularity
cp /home/sdjebali/SIB_august2020/code/containers/singularity/registry.gitlab.com-chbk-rnaseq.img .

It may take long as the file is about 1Gb

3   (Opt) Can you find the data used in our hands-on?

4   (Opt) Look at the Gencode gene annotation file and answer some questions

cat /data/references/gencode.v34.annotation.gtf

5   Make an output directory for the GS RNA-seq pipeline and run it on the server

mkdir -p ~/rnaseq/pipeline
export SINGULARITY_PULLFOLDER=~/rnaseq/singularity
export SINGULARITY_CACHEDIR=~/rnaseq/singularity
export SINGULARITY_TMPDIR=~/rnaseq/singularity
code=~/rnaseq/code/proj-gs-rna-seq
datadir=/data
outdir=~/rnaseq/pipeline
time $code/run $code/rnaseq.nf \
--profile singularity --output $outdir \
--reads $datadir/reads/rnaseq/1pcent/*.fastq.gz \
--annotation $datadir/references/gencode.v34.chr22.gtf \
--genome $datadir/references/GRCh38.p13.chr22.fa \
--max-cpus 4 --max-memory 6  --keep-temp \
--resume > $outdir/nextflow.out

6   For each step, look at the command line, inputs, outputs, and parameters

Since the pipeline may still be running when you want to look at its results, you can look at a previous run on the same data located in the /home/sdjebali/rnaseq/pipeline directory.

The $outdir/nextflow.out output file of the pipeline tells you where to find the inputs, outputs and command of each type of processes run by the pipeline.

For example this row:

[15/06a98f] process > map (4)             [100%] 4 of 4 ✔

means one of the map processes was run in a directory starting with $outdir/temp/work/15/06a98f

So go there:

cd $outdir/temp/work/15/06a98f*

and you will see inputs and outputs of this process.

Then type:

ls -a

and you will see a .command.sh file that includes the command that was run for this process.

6.1   Read mapping with STAR

The .command.sh file content is:

#!/bin/bash -ue
STAR --runThreadN 16 \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--genomeDir index \
--readFilesIn ctGM12878_bn2_1pcent_R1_trimmed.fq.gz ctGM12878_bn2_1pcent_R2_trimmed.fq.gz \
--outFileNamePrefix "ctGM12878_bn2_1pcent".

mv *.bam "ctGM12878_bn2_1pcent".bam

6.2   Transcript assembly with stringtie

The .command.sh file for this step looks like this:

#!/bin/bash -ue
stringtie ctGM12878_bn2_1pcent.bam --rf -G gencode.v34.chr22.gtf -o "ctGM12878_bn2_1pcent".gff

6.3   Transcript merge/combining with stringtie

The .command.sh file for this step looks like this:

#!/bin/bash -ue
stringtie --merge \
ctfibroblastoflung_bn2_1pcent.gff ctfibroblastoflung_bn1_1pcent.gff ctGM12878_bn1_1pcent.gff ctGM12878_bn2_1pcent.gff \
-G gencode.v34.chr22.gtf -o assembly.gff

6.4   Reference and assembly transcript/gene expression quantification with stringtie

The .command.sh file for this step starts like this:

#!/bin/bash -ue
stringtie ctGM12878_bn2_1pcent.bam \
--rf \
-e \
-B \
-G assembly.gff \
-A "ctGM12878_bn2_1pcent"."assembly"_genes_TPM.tsv \
-o "ctGM12878_bn2_1pcent"."assembly".gtf

7   Look at the ressources needed by each step

Using a web browser from your laptop, open the server remote file /home/sdjebali/rnaseq/pipeline/logs/report.html (or the corresponding one in your home ~/rnaseq/pipeline/logs/report.html if the pipeline has finished running).

8   Go to downstream analysis part of the hands-on

The next and last part of the rnaseq hands-on are the downstream analyses.