This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The pipeline is built using Nextflow and processes data using the following steps:
--skip_markduplicates
--sentieon
Sarek
pre-processes raw FASTQ
files or unmapped BAM
files, based on GATK best practices.
bwa is a software package for mapping low-divergent sequences against a large reference genome.
Such files are intermediate and not kept in the final files delivered to users.
BWA-mem2 is a software package for mapping low-divergent sequences against a large reference genome.
Such files are intermediate and not kept in the final files delivered to users.
By default, Sarek
will use GATK MarkDuplicatesSpark, Spark
implementation of GATK MarkDuplicates, which locates and tags duplicate reads in a BAM
or SAM
file, where duplicate reads are defined as originating from a single fragment of DNA.
Specify --no_gatk_spark
to use GATK MarkDuplicates
instead.
This directory is the location for the BAM
files delivered to users.
Besides the duplicates-marked BAM
files, the recalibration tables (*.recal.table
) are also stored, and can be used to create recalibrated BAM
files.
For all samples:
Output directory: results/Preprocessing/[SAMPLE]/DuplicatesMarked
[SAMPLE].md.bam
and [SAMPLE].md.bai
BAM
file and indexFor further reading and documentation see the data pre-processing for variant discovery from the GATK best practices.
GATK BaseRecalibrator generates a recalibration table based on various co-variates.
For all samples:
Output directory: results/Preprocessing/[SAMPLE]/DuplicatesMarked
[SAMPLE].recal.table
duplicates-marked BAM
file.GATK ApplyBQSR recalibrates the base qualities of the input reads based on the recalibration table produced by the GATK BaseRecalibrator tool.
This directory is the location for the final recalibrated BAM
files.
Recalibrated BAM
files are usually 2-3 times larger than the duplicates-marked BAM
files.
To re-generate recalibrated BAM
file you have to apply the recalibration table delivered to the DuplicatesMarked\
folder either using Sarek
( --step recalibrate
) , or doing this recalibration yourself.
For all samples:
Output directory: results/Preprocessing/[SAMPLE]/Recalibrated
[SAMPLE].recal.bam
and [SAMPLE].recal.bam.bai
BAM
file and indexFor further reading and documentation see the data pre-processing for variant discovery from the GATK best practices.
The TSV
files are auto-generated and can be used by Sarek
for further processing and/or variant calling.
For further reading and documentation see the --input
section in the usage documentation.
For all samples:
Output directory: results/Preprocessing/TSV
duplicates_marked_no_table.tsv
, duplicates_marked.tsv
and recalibrated.tsv
TSV
files to start Sarek
from prepare_recalibration
, recalibrate
or variantcalling
steps.duplicates_marked_no_table_[SAMPLE].tsv
, duplicates_marked_[SAMPLE].tsv
and recalibrated_[SAMPLE].tsv
TSV
files to start Sarek
from prepare_recalibration
, recalibrate
or variantcalling
steps for a specific sample.--skip_markduplicates
WARNING Only with
--skip_markduplicates
For all samples:
Output directory: results/Preprocessing/TSV
mapped.tsv
, mapped_no_duplicates_marked.tsv
and recalibrated.tsv
TSV
files to start Sarek
from prepare_recalibration
, recalibrate
or variantcalling
steps.mapped_[SAMPLE].tsv
, mapped_no_duplicates_marked_[SAMPLE].tsv
and recalibrated_[SAMPLE].tsv
TSV
files to start Sarek
from prepare_recalibration
, recalibrate
or variantcalling
steps for a specific sample.--sentieon
WARNING Only with
--sentieon
For all samples:
Output directory: results/Preprocessing/TSV
sentieon_deduped.tsv
and recalibrated_sentieon.tsv
TSV
files to start Sarek
from variantcalling
step.sentieon_deduped_[SAMPLE].tsv
and recalibrated_sentieon_[SAMPLE].tsv
TSV
files to start Sarek
from variantcalling
step for a specific sample.All the results regarding Variant Calling are collected in this directory.
If some results from a variant caller do not appear here, please check out the --tools
section in the usage documentation.
Recalibrated BAM
files can used as an input to start the Variant Calling.
FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/FreeBayes
FreeBayes_[SAMPLE].vcf.gz
and FreeBayes_[SAMPLE].vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the FreeBayes manual.
GATK HaplotypeCaller calls germline SNPs and indels via local re-assembly of haplotypes.
Germline calls are provided for all samples, to enable comparison of both, tumor and normal, for possible mixup.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/HaploTypeCaller
HaplotypeCaller_[SAMPLE].vcf.gz
and HaplotypeCaller_[SAMPLE].vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the HaplotypeCaller manual.
GATK GenotypeGVCFs performs joint genotyping on one or more samples pre-called with HaplotypeCaller.
Germline calls are provided for all samples, to enable comparison of both, tumor and normal, for possible mixup.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/HaplotypeCallerGVCF
HaplotypeCaller_[SAMPLE].g.vcf.gz
and HaplotypeCaller_[SAMPLE].g.vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the GenotypeGVCFs manual.
GATK Mutect2 calls somatic SNVs and indels via local assembly of haplotypes.
For further reading and documentation see the Mutect2 manual.
It is recommended to have panel of normals (PON) for this version of GATK Mutect2
using at least 40 normal samples.
Additionally, you can add your PON
file to get filtered somatic calls.
For a Tumor/Normal pair:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/Mutect2
Files created:
Mutect2_unfiltered_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz
and Mutect2_unfiltered_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.tbi
VCF
with Tabix indexMutect2_filtered_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz
and Mutect2_filtered_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.tbi
VCF
with Tabix index: these entries have a PASS
filter, you can get these when supplying a panel of normals using the --pon
option[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.stats
[TUMORSAMPLE]_contamination.table
samtools mpileup generates pileup of a BAM
file.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/mpileup
[SAMPLE].pileup.gz
SM
) identifiers in @RG
header lines.For further reading and documentation see the samtools manual.
Strelka2 is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/Strelka
Strelka_Sample_genome.vcf.gz
and Strelka_Sample_genome.vcf.gz.tbi
VCF
with Tabix indexStrelka_Sample_variants.vcf.gz
and Strelka_Sample_variants.vcf.gz.tbi
VCF
with Tabix indexFor a Tumor/Normal pair:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/Strelka
Strelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz
and Strelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz.tbi
VCF
with Tabix indexStrelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz
and Strelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz.tbi
VCF
with Tabix indexUsing Strelka Best Practices with the candidateSmallIndels
from Manta
:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/Strelka
StrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz
and StrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz.tbi
VCF
with Tabix indexStrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz
and StrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the Strelka2 user guide.
WARNING Only with
--sentieon
Sentieon DNAseq implements the same mathematics used in the Broad Institute's BWA-GATK HaplotypeCaller 3.3-4.1 Best Practices Workflow pipeline.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/SentieonDNAseq
DNAseq_Sample.vcf.gz
and DNAseq_Sample.vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the Sentieon DNAseq user guide.
WARNING Only with
--sentieon
Sentieon DNAscope calls SNPs and small indels.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/SentieonDNAscope
DNAscope_Sample.vcf.gz
and DNAscope_Sample.vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the Sentieon DNAscope user guide.
WARNING Only with
--sentieon
Sentieon TNscope calls SNPs and small indels on an Tumor/Normal pair.
For a Tumor/Normal pair:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/SentieonTNscope
TNscope_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz
and TNscope_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the Sentieon TNscope user guide.
Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads.
It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs.
Manta
provides a candidate list for small indels that can be fed to Strelka
following Strelka Best Practices.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/Manta
Manta_[SAMPLE].candidateSmallIndels.vcf.gz
and Manta_[SAMPLE].candidateSmallIndels.vcf.gz.tbi
VCF
with Tabix indexManta_[SAMPLE].candidateSV.vcf.gz
and Manta_[SAMPLE].candidateSV.vcf.gz.tbi
VCF
with Tabix indexFor Normal sample only:
Manta_[NORMALSAMPLE].diploidSV.vcf.gz
and Manta_[NORMALSAMPLE].diploidSV.vcf.gz.tbi
VCF
with Tabix indexFor a Tumor sample only:
Manta_[TUMORSAMPLE].tumorSV.vcf.gz
and Manta_[TUMORSAMPLE].tumorSV.vcf.gz.tbi
VCF
with Tabix indexFor a Tumor/Normal pair:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/Manta
Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSmallIndels.vcf.gz
and Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSmallIndels.vcf.gz.tbi
VCF
with Tabix indexManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSV.vcf.gz
and Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSV.vcf.gz.tbi
VCF
with Tabix indexManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].diploidSV.vcf.gz
and Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].diploidSV.vcf.gz.tbi
VCF
with Tabix indexManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].somaticSV.vcf.gz
and Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].somaticSV.vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the Manta user guide.
TIDDIT identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions.
Germline calls are provided for all samples, to enable comparison of both, tumor and normal, for possible mixup.
Low quality calls are removed internally, to simplify processing of variant calls but they are saved by Sarek
.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/TIDDIT
TIDDIT_[SAMPLE].vcf.gz
and TIDDIT_[SAMPLE].vcf.gz.tbi
VCF
with Tabix indexTIDDIT_[SAMPLE].signals.tab
TIDDIT_[SAMPLE].ploidy.tab
TIDDIT_[SAMPLE].old.vcf
VCF
including the low qualiy callsTIDDIT_[SAMPLE].wig
TIDDIT_[SAMPLE].gc.wig
For further reading and documentation see the TIDDIT manual.
WARNING Only with
--sentieon
Sentieon DNAscope can perform structural variant calling in addition to calling SNPs and small indels.
For all samples:
Output directory: results/VariantCalling/[SAMPLE]/SentieonDNAscope
DNAscope_SV_Sample.vcf.gz
and DNAscope_SV_Sample.vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the Sentieon DNAscope user guide.
Running ASCAT on NGS data requires that the BAM
files are converted into BAF and LogR values.
This can be done using the software AlleleCount followed by the provided ConvertAlleleCounts R-script.
For a Tumor/Normal pair:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/ASCAT
[TUMORSAMPLE].BAF
and [NORMALSAMPLE].BAF
[TUMORSAMPLE].LogR
and [NORMALSAMPLE].LogR
ASCAT is a software for performing allele-specific copy number analysis of tumor samples and for estimating tumor ploidy and purity (normal contamination).
It infers tumor purity and ploidy and calculates whole-genome allele-specific copy number profiles.
ASCAT
is written in R
and available here: github.com/Crick-CancerGenomics/ascat.
The ASCAT
process gives several images as output, described in detail in this book chapter.
For a Tumor/Normal pair:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/ASCAT
[TUMORSAMPLE].aberrationreliability.png
[TUMORSAMPLE].ASCATprofile.png
[TUMORSAMPLE].ASPCF.png
[TUMORSAMPLE].rawprofile.png
[TUMORSAMPLE].sunrise.png
[TUMORSAMPLE].tumour.png
[TUMORSAMPLE].cnvs.txt
[TUMORSAMPLE].LogR.PCFed.txt
[TUMORSAMPLE].purityploidy.txt
The text file [TUMORSAMPLE].cnvs.txt
countains predictions about copy number state for all the segments.
The output is a tab delimited text file with the following columns:
The file [TUMORSAMPLE].cnvs.txt
contains all segments predicted by ASCAT, both those with normal copy number (nMinor = 1 and nMajor =1) and those corresponding to copy number aberrations.
For further reading and documentation see the ASCAT manual.
Control-FREEC is a tool for detection of copy-number changes and allelic imbalances (including loss of heterozygoity (LOH)) using deep-sequencing data.
Control-FREEC
automatically computes, normalizes, segments copy number and beta allele frequency profiles, then calls copy number alterations and LOH.
And also detects subclonal gains and losses and evaluate the most likely average ploidy of the sample.
For a Tumor/Normal pair:
Output directory: results/VariantCalling/[TUMOR_vs_NORMAL]/ControlFREEC
[TUMORSAMPLE]_vs_[NORMALSAMPLE].config.txt
[TUMORSAMPLE].pileup.gz_CNVs
and [TUMORSAMPLE].pileup.gz_normal_CNVs
[TUMORSAMPLE].pileup.gz_ratio.txt
and [TUMORSAMPLE].pileup.gz_normal_ratio.txt
[TUMORSAMPLE].pileup.gz_BAF.txt
and [NORMALSAMPLE].pileup.gz_BAF.txt
For further reading and documentation see the Control-FREEC manual.
Microsatellite instability is a genetic condition associated to deficiencies in the mismatch repair (MMR) system which causes a tendency to accumulate a high number of mutations (SNVs and indels). An altered distribution of microsatellite length is associated to a missed replication slippage which would be corrected under normal MMR conditions.
MSIsensor is a tool to detect the MSI status of a tumor scanning the length of the microsatellite regions. It requires a normal sample for each tumour to differentiate the somatic and germline cases.
For a Tumor/Normal pair:
Output directory: results/VariantCalling/[TUMORSAMPLE]_vs_[NORMALSAMPLE]/MSIsensor
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_msisensor
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_msisensor_dis
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_msisensor_germline
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_msisensor_somatic
For further reading see the MSIsensor paper.
This directory contains results from the final annotation steps: two tools are used for annotation, snpEff and VEP.
Only a subset of the VCF
files are annotated, and only variants that have a PASS
filter.
Currently, FreeBayes
results are not annotated as we are lacking a decent somatic filter.
snpeff is a genetic variant annotation and effect prediction toolbox.
It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations.
The generated VCF
header contains the software version and the used command line.
For all samples:
Output directory: results/Annotation/[SAMPLE]/snpEff
VariantCaller_Sample_snpEff.ann.vcf.gz
and VariantCaller_Sample_snpEff.ann.vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the snpEff manual
VEP (Variant Effect Predictor), based on Ensembl
, is a tool to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs.
The generated VCF
header contains the software version, also the version numbers for additional databases like Clinvar
or dbSNP
used in the VEP
line.
The format of the consequence annotations is also in the VCF
header describing the INFO
field.
Currently, it contains:
For all samples:
Output directory: results/Annotation/[SAMPLE]/VEP
VariantCaller_Sample_VEP.ann.vcf.gz
and VariantCaller_Sample_VEP.ann.vcf.gz.tbi
VCF
with Tabix indexFor further reading and documentation see the VEP manual
FastQC gives general quality metrics about your sequenced reads.
It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C
), adapter contamination and overrepresented sequences.
For all samples:
Output directory: results/Reports/[SAMPLE]/fastqc
sample_R1_XXX_fastqc.html
and sample_R2_XXX_fastqc.html
FastQC
report containing quality metrics for your untrimmed raw FASTQ
filessample_R1_XXX_fastqc.zip
and sample_R2_XXX_fastqc.zip
NB: The
FastQC
plots displayed in theMultiQC
report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.
For further reading and documentation see the FastQC help pages.
Qualimap bamqc reports information for the evaluation of the quality of the provided alignment data. In short, the basic statistics of the alignment (number of reads, coverage, GC-content, etc.) are summarized and a number of useful graphs are produced.
Plot will show:
For all samples:
Output directory: results/Reports/[SAMPLE]/bamQC
VariantCaller_[SAMPLE].bcf.tools.stats.out
MultiQC
For further reading and documentation see the Qualimap bamqc manual
More information in the GATK MarkDuplicates section
Duplicates can arise during sample preparation e.g. library construction using PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.
For all samples:
Output directory: results/Reports/[SAMPLE]/MarkDuplicates
[SAMPLE].bam.metrics
MultiQC
For further reading and documentation see the MarkDuplicates manual.
samtools stats collects statistics from BAM
files and outputs in a text format.
Plots will show:
For all samples:
Output directory: results/Reports/[SAMPLE]/SamToolsStats
[SAMPLE].bam.samtools.stats.out
MultiQC
For further reading and documentation see the samtools
manual
bcftools is a program for variant calling and manipulating VCF
files.
Plot will show:
For all samples:
Output directory: results/Reports/[SAMPLE]/BCFToolsStats
VariantCaller_[SAMPLE].bcf.tools.stats.out
MultiQC
For further reading and documentation see the bcftools stats manual
VCFtools is a program package designed for working with VCF
files.
Plots will show:
FILTER
category.For all samples:
Output directory: results/Reports/[SAMPLE]/VCFTools
VariantCaller_[SAMPLE].FILTER.summary
MultiQC
VariantCaller_[SAMPLE].TsTv.count
MultiQC
VariantCaller_[SAMPLE].TsTv.qual
MultiQC
For further reading and documentation see the VCFtools manual
snpeff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations.
Plots will shows :
For all samples:
Output directory: results/Reports/[SAMPLE]/snpEff
VariantCaller_Sample_snpEff.csv
MultiQC
VariantCaller_Sample_snpEff.html
VariantCaller_Sample_snpEff.genes.txt
For further reading and documentation see the snpEff manual
VEP (Variant Effect Predictor), based on Ensembl
, is a tools to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs.
For all samples:
Output directory: results/Reports/[SAMPLE]/VEP
VariantCaller_Sample_VEP.summary.html
For further reading and documentation see the VEP manual
MultiQC is a visualization tool that generates a single HTML report summarizing all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
The pipeline has special steps which also allow the software versions to be reported in the MultiQC
output for future traceability.
Output files:
multiqc/
multiqc_report.html
multiqc_data/
multiqc_plots/
For more information about how to use MultiQC
reports, see https://multiqc.info.
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
Output files:
pipeline_info/
execution_report.html
, execution_timeline.html
, execution_trace.txt
and pipeline_dag.dot
/pipeline_dag.svg
.pipeline_report.html
, pipeline_report.txt
and software_versions.csv
.results_description.html
.