Gene prediction: filling in the form

Gene prediction: filling in the form

Sequence frame

The query sequence and a probabilistic model for the target organism must be indicated here.

FASTA file: if you click to the button on the right (usually indicating "Browse..." but it is language and browser dependent), you will see a file requester. Select the FASTA (or raw DNA) file that you want to submit. The filename will appear in the text window. Then click on any "Submit" button to get a result. You may also prefer to just paste your sequence in the large text area below. You can either paste a FASTA format file or a raw DNA format sequence.
Probabilistic model: this indicates to FrameD the probabilistic model to use for coding/non coding score computation. Several models may be available for a single organism (for example to take into account isochores in eukaryotic sequences or non standard codon usage in prokaryotes).
If no satisfactory probabilistic model appears in the list, you may try to build a model of the coding regions by yourself by clicking on the link just below the organism selection (denoted "learn a new one and add it to the system"). See section *.

Two example sequences are provided: a GC-rich noisy sequence with a frameshift in the noisy region (region with 'N's) and a finished AT-rich sequence. Clicking the corresponding "Example" buttons will fill in the frame the corresponding sequence.

Options frame

The prediction of FrameD can be controlled by several parameters. Default values for all these parameters are provided.

Compute mean expected prediction: this flag controls if FrameD will not only perform gene prediction by computing one optimal prediction but also compute a "mean" prediction. This "mean" prediction gives for each nucleotide the probability that this nucleotide is coding, non coding... considering all possible predictions. This allows to identify positions where FrameD optimal prediction may be unreliable: alternative choice starts, possible frameshifts...
Increased sensitivity: by penalizing intergenic regions in the prediction score, this flag artificially increases FrameD sensitivity. This can be useful to analyze sequences with non standard coding statistics (eg. containing gene obtained through horizontal transfers) and which are usually visible in FrameD graphical interface as open reading frames with an unstable coding potential.
Frameshift penalty: score penalty for predicting a frameshift. For finished sequences, this must take a high value typically around 20 (default value is 18). Lower penalties increase the probability that a frameshift will be predicted and are well suited eg. to EST cluster analysis and unfinished sequences.
Stop penalty: score penalty for using a translation STOP in the prediction. The default value is 4. Lower penalties allow the prediction of smaller genes.
Generate corrected sequence: when a frameshift is predicted, FrameD will generate a sequence where frameshifts have been corrected. Detected inserted nucleotides will be followed by two 'N's to correct the phase without loosing information and deleted nucleotides will be inserted back as 'N'. The corrected sequence is available for download on the prediction page.
Generate translated sequence: for every gene predicted by FrameD that contains no predicted frameshift, the corresponding amino acid sequence is generated in a FASTA file. Each amino-acid sequence name is built from the DNA sequence name followed by the position of the gene in the original sequence. The phase of the gene follows as a comment. The translated sequences are available for download on the prediction page.
GC content: adjust internal FrameD parameters for the GC content of your genome. Default parameters are adequate for rich and medium GC% genomes.
Matured eukaryotic sequences analysis (ATG start only): this flag must be activated for intronless eukaryotic sequence analysis. It restricts Start codon prediction to the ATG codons, deactivates the ribosome binding site search, changes the a priori probabilities of being coding or not and disables gene overlapping
TGA is not a STOP codon option should be activated for the analysis of sequences from organisms that uses a non standard genetic code where TGA codes for an amino-acid. Also available on the page for learning new probabilistic models for new organisms.
Default RBS pattern indicates the RBS pattern used for RBS detection. The default value is ATTCCTCCA from E. Coli.

Protein similarities frame

To enhance prediction quality, FrameD can take into account protein similarities detected using the NCBI-Blastx program. You can paste (or upload) here information about similarities using the so-called "tabulated" format of NCBI-BlastX (obtained using the -m8 flag in NCBI BlastX rel. 2.2.5). The choice of the expectation threshold (BlastX -e flag) and protein databases used is left to the user. The hits for the two example sequences can be directly inserted in the text field by clicking on the corresponding "Example" buttons.

These similarities can be given more or less confidence (see the Options frame). With a default 0 confidence, similarities are visualized in FrameD output but do not affect the prediction process. Higher confidences increase the likelihood of being coding for regions with strong similarities.

Output parameters frame

The parameters in this frame configure the output format and contents but do not affect prediction in itself.

Verbose: display running information (read files, cost of the optimal prediction...) before output prediction when set and 'Text' format is asked.
Format: define the contents of the output 'Text' for textual output, 'Image' for graphical output or 'Both' for both.
Length per image: number of nucleotides presented per image. The default value is 6,000 nuc. (or the size of the sequence if it is shorter).
Overlap between images: number of nucleotides that overlap between successive images (for long sequences). The default value is heuristically determined.
First nucleotide: specify the position of the first nucleotide visualized in the graphical output. The default value is 1, larger values allows to zoom.
Last nucleotide: specify the position of the last nucleotide visualized in the graphical output. The default value is equal to the sequence length.
X-resolution (pixels): define the horizontal definition of the graphical representation. The larger the resolution, the finer the image but the larger the files. The default value is 900. A maximum of 1200 is enforced on the Web interface.
Y-resolution (pixels): define the horizontal definition of the graphical representation. The larger the resolution, the finer the image but the larger the files. The default value is 300. A maximum of 500 is enforced on the Web interface.
Score smoothing window: all the statistics (coding/non coding curve) presented on graphical representation are smoothed and normalized using a sliding window. You can specify the half-size of the window here. By default the window used is 97=1+2*48 nucleotides wide. The larger the window, the smoother the curves that represent coding score. The prediction itself use raw non-smoothed scores.
Score normalization: FrameD always compares the relative likelihood of being non coding or coding in the 6 phases. By default, the graphical output normalizes these 7 scores but you can ask for non normalization or to normalize independently each coding phase w.r.t. the non coding hypothesis. This does not affect the prediction, only the graphical output.
Select graphical elements: if you think the graphical output is bit too rich, you can select which graphical elements appear in the graphical output.

Gene prediction: filling in the form