| | | Gene prediction: filling in the form |
Gene prediction: filling in the form
The query sequence and a probabilistic model for the target organism
must be indicated here.
Two example sequences are provided: a GC-rich noisy sequence with a
frameshift in the noisy region (region with 'N's) and a finished
AT-rich sequence. Clicking the corresponding "Example" buttons will
fill in the frame the corresponding sequence.
The prediction of FrameD can be controlled by several parameters.
Default values for all these parameters are provided.
- Compute mean expected prediction: this flag controls if
FrameD will not only perform gene prediction by computing one
optimal prediction but also compute a "mean" prediction. This
"mean" prediction gives for each nucleotide the probability that
this nucleotide is coding, non coding... considering all possible
predictions. This allows to identify positions where FrameD optimal
prediction may be unreliable: alternative choice starts, possible
frameshifts...
- Increased sensitivity: by penalizing intergenic regions
in the prediction score, this flag artificially increases FrameD
sensitivity. This can be useful to analyze sequences with non
standard coding statistics (eg. containing gene obtained through
horizontal transfers) and which are usually visible in FrameD
graphical interface as open reading frames with an unstable coding
potential.
- Frameshift penalty: score penalty for predicting a
frameshift. For finished sequences, this must take a high value
typically around 20 (default value is 18). Lower penalties increase
the probability that a frameshift will be predicted and are well
suited eg. to EST cluster analysis and unfinished sequences.
- Stop penalty: score penalty for using a translation
STOP in the prediction. The default value is 4. Lower penalties
allow the prediction of smaller genes.
- Generate corrected sequence: when a frameshift is
predicted, FrameD will generate a sequence where frameshifts have
been corrected. Detected inserted nucleotides will be followed by
two 'N's to correct the phase without loosing information and
deleted nucleotides will be inserted back as 'N'. The corrected
sequence is available for download on the prediction page.
- Generate translated sequence: for every gene predicted
by FrameD that contains no predicted frameshift, the corresponding
amino acid sequence is generated in a FASTA file. Each amino-acid
sequence name is built from the DNA sequence name followed by the
position of the gene in the original sequence. The phase of the gene
follows as a comment. The translated sequences are available for
download on the prediction page.
- GC content: adjust internal FrameD parameters for the
GC content of your genome. Default parameters are adequate for
rich and medium GC% genomes.
- Matured eukaryotic sequences analysis (ATG start only):
this flag must be activated for intronless eukaryotic sequence
analysis. It restricts Start codon prediction to the ATG codons,
deactivates the ribosome binding site search, changes the a
priori probabilities of being coding or not and disables gene
overlapping
- TGA is not a STOP codon option should be activated for
the analysis of sequences from organisms that uses a non standard
genetic code where TGA codes for an amino-acid. Also
available on the page for learning new probabilistic models for new
organisms.
- Default RBS pattern indicates the RBS pattern used for
RBS detection. The default value is ATTCCTCCA from E.
Coli.
To enhance prediction quality, FrameD can take into account protein
similarities detected using the NCBI-Blastx program. You can paste
(or upload) here information about similarities using the so-called
"tabulated" format of NCBI-BlastX (obtained using the -m8
flag in NCBI BlastX rel. 2.2.5). The choice of the expectation
threshold (BlastX -e flag) and protein databases used is left
to the user. The hits for the two example sequences can be directly
inserted in the text field by clicking on the corresponding
"Example" buttons.
These similarities can be given more or less confidence (see the
Options frame). With a default 0 confidence, similarities are
visualized in FrameD output but do not affect the prediction process.
Higher confidences increase the likelihood of being coding for regions
with strong similarities.
The parameters in this frame configure the output format and contents
but do not affect prediction in itself.
- Verbose: display running information (read files, cost of
the optimal prediction...) before output prediction when set and
'Text' format is asked.
- Format: define the contents of the output 'Text' for
textual output, 'Image' for graphical output or 'Both' for both.
- Length per image: number of nucleotides presented per
image. The default value is 6,000 nuc. (or the size of the sequence
if it is shorter).
- Overlap between images: number of nucleotides that
overlap between successive images (for long sequences). The default
value is heuristically determined.
- First nucleotide: specify the position of the first
nucleotide visualized in the graphical output. The default value is
1, larger values allows to zoom.
- Last nucleotide: specify the position of the last
nucleotide visualized in the graphical output. The default value is
equal to the sequence length.
- X-resolution (pixels): define the horizontal definition
of the graphical representation. The larger the resolution, the
finer the image but the larger the files. The default value is 900.
A maximum of 1200 is enforced on the Web interface.
- Y-resolution (pixels): define the horizontal definition
of the graphical representation. The larger the resolution, the
finer the image but the larger the files. The default value is 300.
A maximum of 500 is enforced on the Web interface.
- Score smoothing window: all the statistics (coding/non
coding curve) presented on graphical representation are smoothed and
normalized using a sliding window. You can specify the half-size of
the window here. By default the window used is 97=1+2*48
nucleotides wide. The larger the window, the smoother the curves
that represent coding score. The prediction itself use raw
non-smoothed scores.
- Score normalization: FrameD always compares the
relative likelihood of being non coding or coding in the 6 phases.
By default, the graphical output normalizes these 7 scores but you
can ask for non normalization or to normalize independently each
coding phase w.r.t. the non coding hypothesis. This does not affect
the prediction, only the graphical output.
- Select graphical elements: if you think the graphical
output is bit too rich, you can select which graphical elements
appear in the graphical output.
| | | Gene prediction: filling in the form |