RNA-seq pipeline hands-on

Before visiting this page and running the RNA-seq pipeline, you need to make sure everything is set up correctly for computation.

This page provides instructions on how to run the GENE-SWitCH (GS) RNA-seq pipeline on a dataset made of 4 samples:

2 bioreplicates from the adult female GM12878 lymphoblastoid cell line

2 bioreplicates from adult male lung fibroblasts

After selection of long (+200 bp) polyadenylated RNAs from the above samples, deep RNA sequencing (+100 million PE 75 bp reads) was performed in a directional/stranded way, and reads were downloaded from the ENCODE website.

In order for the pipeline to finish in a reasonable time and not to be too resource demanding, we will only consider for this part:

A randomly selected set of 1% of the reads from each of the 4 samples (located in /data/reads/rnaseq/1pcent)

A chromosome 22 (50 Mb long) genome and annotated transcriptome (locared in /data/references)

Here is a cheatsheet with basic Linux commands
Here is a complete STAR manual
Here is a stringtie manual

1 Get the latest version (0.1.0) of the GS RNA-seq pipeline on the server

mkdir -p ~/rnaseq/code
cd ~/rnaseq/code
git clone https://github.com/FAANG/proj-gs-rna-seq.git --branch 0.1.0 --depth 1
cd proj-gs-rna-seq
ll

Look at how the code is organised.

Look at the nextflow rnaseq.nf file which includes the different pipeline steps (process keyword).

2 Copy the singularity image under your home directory

mkdir -p ~/rnaseq/singularity
cd ~/rnaseq/singularity
cp /home/sdjebali/SIB_august2020/code/containers/singularity/registry.gitlab.com-chbk-rnaseq.img .

It may take long as the file is about 1Gb

3 (Opt) Can you find the data used in our hands-on?

I downloaded the data from the ENCODE website
Samples are two bioreplicates of two cell lines: GM12878 and fibroblast of lung
Polyadenylated and long RNAs were selected
Sequencing was done in a paired end (2x76bp) and directional way on an Illumina Genome Analyse IIx machine
Data were produced by Thomas Gingeras' laboratory

4 (Opt) Look at the Gencode gene annotation file and answer some questions

cat /data/references/gencode.v34.annotation.gtf

Look at the format of the file (GTF), how many tabs per row? how many spaces?
What is the number of genes / transcripts / exons / coding exons ?
What is the longest gene?
What is/are the transcript(s) with the more exons and what is its associated gene?
How many different gene biotypes / transcript biotypes?

5 Make an output directory for the GS RNA-seq pipeline and run it on the server

mkdir -p ~/rnaseq/pipeline
export SINGULARITY_PULLFOLDER=~/rnaseq/singularity
export SINGULARITY_CACHEDIR=~/rnaseq/singularity
export SINGULARITY_TMPDIR=~/rnaseq/singularity
code=~/rnaseq/code/proj-gs-rna-seq
datadir=/data
outdir=~/rnaseq/pipeline
time $code/run $code/rnaseq.nf \
--profile singularity --output $outdir \
--reads $datadir/reads/rnaseq/1pcent/*.fastq.gz \
--annotation $datadir/references/gencode.v34.chr22.gtf \
--genome $datadir/references/GRCh38.p13.chr22.fa \
--max-cpus 4 --max-memory 6  --keep-temp \
--resume > $outdir/nextflow.out

6 For each step, look at the command line, inputs, outputs, and parameters

Since the pipeline may still be running when you want to look at its results, you can look at a previous run on the same data located in the /home/sdjebali/rnaseq/pipeline directory.

The $outdir/nextflow.out output file of the pipeline tells you where to find the inputs, outputs and command of each type of processes run by the pipeline.

For example this row:

[15/06a98f] process > map (4)             [100%] 4 of 4 ✔

means one of the map processes was run in a directory starting with $outdir/temp/work/15/06a98f

So go there:

cd $outdir/temp/work/15/06a98f*

and you will see inputs and outputs of this process.

Then type:

ls -a

and you will see a .command.sh file that includes the command that was run for this process.

6.1 Read mapping with STAR

The .command.sh file content is:

#!/bin/bash -ue
STAR --runThreadN 16 \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--genomeDir index \
--readFilesIn ctGM12878_bn2_1pcent_R1_trimmed.fq.gz ctGM12878_bn2_1pcent_R2_trimmed.fq.gz \
--outFileNamePrefix "ctGM12878_bn2_1pcent".

mv *.bam "ctGM12878_bn2_1pcent".bam

6.2 Transcript assembly with stringtie

The .command.sh file for this step looks like this:

#!/bin/bash -ue
stringtie ctGM12878_bn2_1pcent.bam --rf -G gencode.v34.chr22.gtf -o "ctGM12878_bn2_1pcent".gff

6.3 Transcript merge/combining with stringtie

The .command.sh file for this step looks like this:

#!/bin/bash -ue
stringtie --merge \
ctfibroblastoflung_bn2_1pcent.gff ctfibroblastoflung_bn1_1pcent.gff ctGM12878_bn1_1pcent.gff ctGM12878_bn2_1pcent.gff \
-G gencode.v34.chr22.gtf -o assembly.gff

6.4 Reference and assembly transcript/gene expression quantification with stringtie

The .command.sh file for this step starts like this:

#!/bin/bash -ue
stringtie ctGM12878_bn2_1pcent.bam \
--rf \
-e \
-B \
-G assembly.gff \
-A "ctGM12878_bn2_1pcent"."assembly"_genes_TPM.tsv \
-o "ctGM12878_bn2_1pcent"."assembly".gtf

7 Look at the ressources needed by each step

Using a web browser from your laptop, open the server remote file /home/sdjebali/rnaseq/pipeline/logs/report.html (or the corresponding one in your home ~/rnaseq/pipeline/logs/report.html if the pipeline has finished running).

8 Go to downstream analysis part of the hands-on

The next and last part of the rnaseq hands-on are the downstream analyses.