Create a DDBJ annotation file from GFF3 and FASTA files
Project description
GFF3-to-DDBJ
日本語版はこちら。
[TOC]
What is this?
GFF3-to-DDBJ creates DDBJ's annotation file from GFF3 and FASTA files. It also works from FASTA alone.
Initial setup
[NOT available yet] Install via bioconda
# Create a conda environment named "ddbj", and install relevant packages from bioconda channel
$ conda create -n ddbj -c bioconda gff3toddbj
# Activate the environment "ddbj"
$ conda activate ddbj
Install from the source
# Download
$ wget https://github.com/yamaton/gff3_to_ddbj/archive/refs/heads/main.zip
# Extract, rename, and change directory
$ unzip main.zip && mv gff3toddbj-main gff3toddbj && cd gff3toddbj
# Create a conda environment named "ddbj"
$ conda create -n ddbj
# Activate the environment "ddbj"
$ conda activate ddbj
# Install dependencies to "ddbj"
$ conda install -c bioconda -c conda-forge biopython bcbio-gff
# Install gff3-to-ddbj and extra tools
$ python setup.py install
Create DDBJ annotation from GFF3 and FASTA
3. Run gff3-to-ddbj
Let's run the main program to get some ideas. Here is the options.
-
--gff3 <FILE>
takes GFF3 file -
--fasta <FILE>
takes FASTA file -
--config <FILE>
takes the configuration file in TOML -
--locus_tag_prefix <STRING>
takes the prefix of locus tag obtained from BioSample. You can skip this for now. -
--transl_table <INT>
: Choose appropriate one from The Genetic Codes. The default value is 1 ("standard"). -
--output <FILE>
sets the path the annotation output.
gff3-to-ddbj
--gff3 myfile.gff3 \ # produces the minimum without this line
--fasta myfile.fa \ # <<REQUIRED>>
--config config.toml \ # produces the minimum without this line
--locus_tag_prefix MYOWNPREFIX_ \ # set to "LOCUSTAGPREFIX_" without this line
--transl_table 1 \ # set to 1 without this line
--output myawesome_output.ann # standard output without this line
Customize the behavior
Edit config.toml
You need to edit a configuration file in TOML, say config.toml
. Take a look at the sample TOML file with COMMON, or the one without COMMON. The configuration contains following items. They are all optional, and GFF3-to-DDBJ works even without config.toml
.
-
Basic features in COMMON
-
"meta-description" in COMMON
-
DDBJ annotation supports features under COMMON that are inserted to each of repeatedly-occurring features in the resulting flat file.
-
Here is an example.
[COMMON.assembly_gap] estimated_length = "unknown" gap_type = "within scaffold" linkage_evidence = "paired-ends"
-
-
Feature-Qualifier information inserted to each occurrence by done GFF3-to-DDBJ
-
This should work effectively the same purpose as the "meta-description" item above. But this repeated insertions are done by GFF3-to-DDBJ, and appears in the annotation output. This configuration is mutually exclusive with the "metadata-description" configuration. I'm keeping both simply because I'm undecided yet.
-
Here is an example: Difference from the previous one is only at
[assembly_gap]
as opposed to[COMMON.assembly_gap]
.[assembly_gap] estimated_length = "unknown" # Set it "<COMPUTE>" to count the number of N's gap_type = "within scaffold" linkage_evidence = "paired-ends"
-
[advanced] Edit translation table
GFF3 and DDBJ annotation have rough correspondence as follows:
- GFF3 column 3 --> DDBJ annotation column 2 as "Feature"
- GFF3 column 9 --> DDBJ annotation column 4 and 5 as "Qualifier key", and "Qualifier value"
but nomenclatures in GFF3 often do not conform the INSDC definitions. Furthermore, DDBJ lists up the feature-qualifier pairs they accepts, which is stricter than INSDC.
To satisfy requirement, I have prepared translation tables for features and qualifiers, and GFF3-to-DDBJ uses the table. For example, GFF3 may contain five_prime_UTR
in the column 2, but 5'UTR
is the translated name in the outcome. You can edit the translation tables and feed them with
--translate_feature <file>
for feature translation--translate_qualifeirs <file>
for qualifier translation
And here is an example call.
gff3-to-ddbj
--gff3 myfile.gff3 \
--fasta myfile.fa \
--config config.toml \
--locus_tag_prefix MYOWNPREFIX_ \
--transl_table 1 \
--translate_features translate_features.toml \
--translate_qualifiers translate_qualifiers.toml \
--output myawesome_output.ann
Troubleshooting
Validate GFF3
It might be a good practice to validate your GFF3 files. GFF3 online validator is useful though the file size is limited to 50MB.
Split FASTA from GFF3 (if needed)
GFF3_to_DDBJ does not work when GFF3 contains FASTA information inside with ##FASTA
directive. Attached tool under split-fasta
reads a GFF3 file and saves GFF3 (without FASTA info) and FASTA.
split-fasta path/to/myfile.gff3 --suffix "_modified"
This creates two files, myfile_modified.gff3
and myfile_modified.fa
.
Fix entry names (if needed)
Letters like =|>" []
are not allowed in the 1st column (= "Entry") of the DDBJ annotation. So, you need to rename the 1st column (= "SeqID") of your GFF3 and headers in your FASTA. Attached tool rename-ids
might be useful. This program converts an ID like ERS324955|SC|contig000013
into ERS324955:SC:contig000013
.
rename-ids \
--gff3=path/to/foo.gff3 \
--fasta=path/to/bar.fasta \
--suffix="_renamed_ids"
This command saves two files, foo_renamed_ids.gff3
and bar_renamed_ids.fasta
if the invalid letters are found. Otherwise, you'll see no output.
fo
Under the Hood
Here is the list of operations done by gff3-to-ddbj
.
-
Rename Feature / Qualifiers keys using the translation tables
-
Search for
assembly_gap
s in FASTA -
Add
/transl_table
to each CDS -
Insert information from configuration fie
-
Merge
CDS
s having the same parent withjoin
notation -
Merge
mRNA
andexon
in GFF3 and createmRNA
feature withjoin
notation -
Check start codon consistency. (Except for
/codon_start=1
for now) -
Let CDS have a single
/product
value. Move the rest to/note
.-
This is to conform the instruction on
/product
.-
Even if there are multiple general names for the same product, do not enter multiple names in 'product'. Do not use needless symbolic letters as delimiter for multiple names. If you would like to describe more than two names, please enter one of the most representative name in /product qualifier, and other(s) in /note qualifier.
-
If the name and function are not known, we recommend to describe as "hypothetical protein".
-
-
-
Remove duplicates in qualifier values
-
Sort lines in annotation
-
Filter out Feature-Qualifier pairs following the table.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gff3toddbj-0.1.1.tar.gz
.
File metadata
- Download URL: gff3toddbj-0.1.1.tar.gz
- Upload date:
- Size: 36.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c2ace61e2ee8c645ab102bc8e7fd2ec3fcad1bc0a05b8950fafa37e7ec7b290 |
|
MD5 | 4961c17f4e0f660995f358cdab7441d4 |
|
BLAKE2b-256 | 848bafc39b0ca59883574739d1d4cdc3ac2d3b3e4daf0601aa4bd1a3118465db |
File details
Details for the file gff3toddbj-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: gff3toddbj-0.1.1-py3-none-any.whl
- Upload date:
- Size: 38.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af4420967e0a517a4d734eb79bb52e494fe73a7309b546399b1d99168747f434 |
|
MD5 | b3a2492a0850e182a64e743e4401d3aa |
|
BLAKE2b-256 | 5684d8f15a0065a3f18a3cb6df6b3eaefdf0a56674859288017f89449a829ec4 |