Create a DDBJ annotation file from GFF3 and FASTA files
Project description
GFF3-to-DDBJ
日本語版はこちら。
Overview
GFF3-to-DDBJ converts GFF3 and FASTA files into the DDBJ annotation format required for submission. It is the DDBJ-specific equivalent of tools like table2asn (NCBI) or EMBLmyGFF3 (ENA).
View the tests/golden directory for example output (.ann files).
Accuracy and Validation
Since "perfect" GFF3-to-DDBJ conversion is not formally defined, this tool uses RefSeq GFF3-GenBank correspondence as a gold standard. We validate output by:
- Comparing
gff3-to-ddbjresults against GenBank-sourced annotations via an internalgenbank-to-ddbjtool. - Passing all output through the DDBJ BioProject/BioSample/Sequence Data (MSS) Parser.
Installation
Via Bioconda
conda create -n ddbj -c conda-forge -c bioconda gff3toddbj
conda activate ddbj
Via PyPI
conda create -n ddbj -c conda-forge -c bioconda pip samtools
conda activate ddbj
python -m pip install gff3toddbj
Via GitHub (Nightly)
conda create -n ddbj pip
conda activate ddbj
python -m pip install 'git+https://github.com/yamaton/gff3toddbj'
Usage
gff3-to-ddbj \
--fasta myfile.fa \ # Required
--gff3 myfile.gff3 \ # Strongly Recommended (bare-minimum if absent)
--metadata mymetadata.toml \ # Optional
--locus_tag_prefix PREFIX_ \ # Required for BioSample
--transl_table 1 \ # Default: 1 (Standard)
--output output.ann # Optional: stdout by default
Argument Details
--locus_tag_prefix: The prefix assigned by BioSample.--transl_table: Genetic code index (e.g., 11 for Bacteria). See DDBJ Genetic Codes.
Under the Hood
GFF3-to-DDBJ processes your data through the following pipeline:
1. Data Preparation
- FASTA Compression: If the input is standard Gzip, the tool re-compresses it using
bgzip(e.g., creatingmyfile_bgzip.fa.gz). This enables indexing and reduces memory usage; the resulting file remains compatible with standardgziptools. - Gap Detection: Scans FASTA sequences for
Nruns and automatically generatesassembly_gapfeatures. - Topology Handling: If GFF3 has
Is_circular=true, the tool inserts aTOPOLOGYfeature and manages origin-spanning features.
2. Feature & Qualifier Mapping
- SO-to-INSDC Translation: Maps GFF3 "types" to DDBJ "Features" based on Sequence Ontology.
- Example:
transcript(SO:0000673) is translated to amisc_RNAfeature.
- Example:
- Qualifier Renaming: Converts GFF3 attributes to DDBJ-compliant qualifiers based on renaming rules.
- Example:
ID=foobarbecomes/note="ID:foobar".
- Example:
- Genetic Code Assignment: Automatically adds the
/transl_tablequalifier to everyCDSfeature based on the user-provided index (default: 1).
3. Coordinate Processing
- Joining: Features sharing a parent are merged using
join()notation. This applies toCDS,exon,mat_peptide,V_segment,C_region,D-loop, andmisc_feature. - RNA/Exon Logic: The location of joined
exonsis assigned to the parent RNA's location, and individualexonentries are discarded.- Note:
exonsare not joined if their direct parent is agene.
- Note:
- Partialness: Adds partial indicators (
<or>) toCDSlocations if start or stop codons are missing. (See: Offset of the frame at translation initiation by codon_start).
4. DDBJ Compliance Logic (Product & Gene)
- Product Enforcement: To conform to DDBJ instructions, each
CDSis restricted to a single/product:- Even if there are multiple general names for the same product, do not enter multiple names in 'product'. Do not use needless symbolic letters as delimiter for multiple names. If you would like to describe more than two names, please enter one of the most representative name in /product qualifier, and other(s) in /note qualifier.
- If the name and function are not known, we recommend to describe as "hypothetical protein".
- Gene Consistency:
- Ensures the
/genequalifier has a single value; additional values move to/gene_synonym. (Reference: Definition of Qualifier key: /gene). - Copies
/geneand/gene_synonymqualifiers from parentgenefeatures to all children (e.g.,mRNA,CDS).
- Ensures the
5. Metadata & Filtering
- Metadata Injection: Inserts
sourceinformation and global qualifiers from the metadata file. See "Metadata Configuration" in "Customization" below. - Compliance Filtering: Removes features and qualifiers violating the DDBJ usage matrix.
- Note: The
genefeature is discarded by default in this process.
- Note: The
- Deduplication: Removes redundant qualifier values generated during processing.
6. Final Formatting
-
Sorting: Lines are ordered by start position, feature priority (placing
sourceandTOPOLOGYat the top), and end position. -
Validation Logs: Displays all discarded items via
stderr:WARNING: [Discarded] feature -------> gene (count: 49911) WARNING: [Discarded] (Feature, Qualifier) = (mRNA, Parent) (count: 57304)
Customization
Metadata Configuration
Use a TOML file (e.g., metadata.toml) to provide information absent from GFF3/FASTA files, such as submitter details and common qualifiers.
- Example: See metadata_ddbj_example.toml.
- Default: If
--metadatais omitted, the tool uses this default configuration.
Key Sections
-
COMMON Entry: Define
SUBMITTER,REFERENCE, andCOMMENTblocks. -
Global Qualifiers (DDBJ-side injection): Use the
[COMMON.feature]syntax to instruct the DDBJ system to insert qualifiers into every occurrence of a feature.[COMMON.assembly_gap] estimated_length = "unknown" gap_type = "within scaffold" linkage_evidence = "paired-ends"
-
Local Injection (Tool-side injection): Use the
[feature]syntax (without theCOMMONprefix) to havegff3-to-ddbjexplicitly insert these qualifiers into the generated.annfile.[assembly_gap] estimated_length = "<COMPUTE>" # Automatically calculate gap size from "N" runs gap_type = "within scaffold" linkage_evidence = "paired-ends"
Note: Currently, only
[source]and[assembly_gap]are supported for local injection.
[Advanced] Feature and Qualifier Renaming
GFF3 and DDBJ formats do not share a 1:1 nomenclature. GFF3 "types" (column 3) map to DDBJ "Features," while GFF3 "attributes" (column 9) map to DDBJ "Qualifiers."
gff3-to-ddbj uses a default translation table to handle these conversions. You can override these rules using --config_rename <FILE>.
Customization Examples:
-
Renaming Types: Map a GFF3 type to a specific DDBJ feature key.
[five_prime_UTR] feature_key = "5'UTR"
-
Renaming Attributes: Map GFF3 attributes to DDBJ qualifiers. Use
__ANY__to apply a rule across all feature types.[__ANY__.ID] qualifier_key = "note" qualifier_value_prefix = "ID:" # optional
-
Complex Translations: Map a GFF3 type to a DDBJ feature/qualifier pair (e.g.,
snRNAtoncRNAwith a class).[snRNA] feature_key = "ncRNA" qualifier_key = "ncRNA_class" qualifier_value = "snRNA"
-
Attribute-to-Feature Mapping: Convert specific attribute values into distinct DDBJ features (e.g.,
RNAtype withbiotype=misc_RNAattribute becomes amisc_RNAfeature).[RNA.biotype.misc_RNA] feature_key = "misc_RNA"
[Advanced] Feature and Qualifier Filtering
To comply with the DDBJ usage matrix, output is filtered by a default configuration. Only features and qualifiers explicitly allowed in this TOML file will appear in the final output.
To use a custom filter, provide a TOML file via --config_filter <FILE> using the following structure:
# Only these qualifiers will be kept for the CDS feature
CDS = ["EC_number", "inference", "locus_tag", "note", "product"]
Troubleshooting
Validate GFF3
It might be a good practice to validate your GFF3 files. GFF3 online validator is useful though the file size is limited to 50MB.
Split FASTA from GFF3 (if needed)
GFF3_to_DDBJ does not work when GFF3 contains FASTA information inside with ##FASTA directive. Attached tool split-fasta reads a GFF3 file and saves GFF3 (without FASTA info) and FASTA.
split-fasta path/to/myfile.gff3 --suffix "_splitted"
This creates two files, myfile_splitted.gff3 and myfile_splitted.fa.
Normalize entry names (if needed)
Letters like =|>" [] are not allowed in the 1st column (= "Entry") of the DDBJ annotation. The attached program normalize-entry-names renames such entries. This program converts an ID like ERS324955|SC|contig000013 into ERS324955:SC:contig000013 for example.
normalize-entry-names myannotation_output.txt
This command create as files myannotation_output_renamed.txt if the invalid letters are found. Otherwise, you'll see no output.
Known Issues
Biological & Sequence Logic
- Trans-splicing: The tool does not currently support coordinate correction or the
join()syntax for features containing the/trans_splicingqualifier. - Translation Exceptions: Coordinate handling for
/transl_exceptat start or stop codons is not yet implemented. - Missing Qualifiers: The tool does not automatically generate a
/translationqualifier when an/exceptionqualifier is present, which may lead to DDBJ validation errors. - Inter-base Coordinates: "Between-position" locations (e.g.,
123^124) are not currently supported and may be processed incorrectly.
Performance
- Execution Speed: To ensure maximum accuracy, the tool currently utilizes a single-process architecture. Expect longer runtimes on large genomic datasets.
Acknowledgments
The design of GFF3-to-DDBJ is inspired by EMBLmyGFF3, a versatile tool used for converting GFF3 data into the EMBL annotation format.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gff3toddbj-0.4.3.tar.gz.
File metadata
- Download URL: gff3toddbj-0.4.3.tar.gz
- Upload date:
- Size: 55.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
626705cf3efce51f0137c81b33a69ebc00dcb0cae763684d0aec82443823540d
|
|
| MD5 |
e303add92ceb0245ea413c0aa5ff2aea
|
|
| BLAKE2b-256 |
00228984d68f075e5dd16ec4e0e310ad04ad5ee79966717d36d562b948a944d1
|
File details
Details for the file gff3toddbj-0.4.3-py3-none-any.whl.
File metadata
- Download URL: gff3toddbj-0.4.3-py3-none-any.whl
- Upload date:
- Size: 56.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1a4764d8454308641c2baa2b0ef82348b0fe077c3735b1a7a2897f6242ca534
|
|
| MD5 |
3e30554fb8fab5d84e5421bcfed3b32f
|
|
| BLAKE2b-256 |
cd40f49edac10d7cee09210ebc335cc1b61531601bb722f103f093c2186b7b63
|