Create a DDBJ annotation file from GFF3 and FASTA files

These details have not been verified by PyPI

Project links

Project description

GFF3-to-DDBJ

日本語版はこちら。

GitHub tag (latest by date)

[TOC]

What is this?
Initial setup
Create DDBJ annotation from GFF3 and FASTA
- Run gff3-to-ddbj
Under the Hood
Customize the behavior
Troubleshooting
Credit

Table of contents generated with markdown-toc

What is this?

GFF3-to-DDBJ creates the annotation file for submission to DDBJ by taking GFF3 and FASTA files as input. It also works with FASTA alone.

Analogous programs are GAG for submissions to NCBI, and EMBLmyGFF3 for submissions to EMBL.

Initial setup

Install with bioconda

# Create a conda environment named "ddbj", and install relevant packages from bioconda channel
$ conda create -n ddbj -c bioconda -c conda-forge gff3toddbj

# Activate the environment "ddbj"
$ conda activate ddbj

Install with pip

# Create a conda environment named "ddbj" and install pip
$ conda create -n ddbj pip

# Activate the environment "ddbj"
$ conda activate ddbj

# Install from pip
$ pip install gff3toddbj

Install from the source

# Download
$ wget https://github.com/yamaton/gff3_to_ddbj/archive/refs/heads/main.zip

# Extract, rename, and change directory
$ unzip main.zip && mv gff3toddbj-main gff3toddbj && cd gff3toddbj

# Create a conda environment named "ddbj"
$ conda create -n ddbj

# Activate the environment "ddbj"
$ conda activate ddbj

# Install dependencies to "ddbj"
$ conda install -c bioconda -c conda-forge biopython bcbio-gff toml setuptools

# Install gff3-to-ddbj and extra tools
$ python setup.py install

Create DDBJ annotation from GFF3 and FASTA

Run `gff3-to-ddbj`

Let's run the main program to get some ideas.

gff3-to-ddbj \
  --gff3 myfile.gff3 \                # bare-minimum output if omitted
  --fasta myfile.fa \                 # <<REQUIRED>>
  --metadata mymetadata.toml \        # example metadata used if omitted
  --locus_tag_prefix MYOWNPREFIX_ \   # default is "LOCUSTAGPREFIX_"
  --transl_table 1 \                  # default is 1
  --output myawesome_output.ann       # standard output if omitted

Here is the options:

--gff3 <FILE> takes GFF3 file
--fasta <FILE> takes FASTA file
--metadata <FILE> takes the metadata file in TOML
--locus_tag_prefix <STRING> takes the prefix of locus tag obtained from BioSample. You can skip this for now.
--transl_table <INT>: Choose appropriate one from The Genetic Codes. The default value is 1 ("standard").
--output <FILE> sets the path the annotation output.

Under the Hood

Here is the list of operations gff3-to-ddbj will do:

Store FASTA sequences to SQLite database to save memory use
- The database is deleted after the operation.
Rename Feature / Qualifiers keys using the translation tables
Search for assembly_gap s in FASTA
Add /transl_table to each CDS
Insert source information from the metadata fie
Merge CDSs having the same parent with join notation
Merge mRNA and exon in GFF3 and create mRNA feature with join notation
Modify locations with inequality signs (< and >) if start/stop codon is absent.
- See Offset of the frame at translation initiation by codon_start
Let CDS have a single /product value: Set it to "hypothetical protein" if absent. Move the rest of exising values to /note.
- This is to conform the instruction on /product.
  - Even if there are multiple general names for the same product, do not enter multiple names in 'product'. Do not use needless symbolic letters as delimiter for multiple names. If you would like to describe more than two names, please enter one of the most representative name in /product qualifier, and other(s) in /note qualifier.
  - If the name and function are not known, we recommend to describe as "hypothetical protein".
Remove duplicates in qualifier values
Sort lines in annotation
Filter out Feature-Qualifier pairs following the table.

Customize the behavior

Metadata file

To enter information missing in GFF3 or FASTA, such as submitter names and certain qualifier values, you need to feed a metadata file in TOML, say mymetadata.toml. Take a look at an example matching the example annotation in the DDBJ page.

The file accommodates following and they are all optional. That is, GFF3-to-DDBJ works even with an empty file.

Basic features in the COMMON entry
- ... such as SUBMITTER, REFERENCE, and COMMENT.
"meta-description" in the COMMON entry
- Here is an example with this notation:
```
[COMMON.assembly_gap]
estimated_length = "unknown"
gap_type = "within scaffold"
linkage_evidence = "paired-ends"
```
- DDBJ annotation supports "meta" values with features under COMMON such that the items are inserted to each occurrence in the resulting flat file produced by DDBJ. Here is an example to insert assembly_gap feature under COMMON entry.
Feature-qualifier items inserted to each occurrence
- Here is an example: Difference from the previous case is only at [assembly_gap] as opposed to[COMMON.assembly_gap].
```
[assembly_gap]
estimated_length = "unknown"   # Set it "<COMPUTE>" to count the number of N's
gap_type = "within scaffold"
linkage_evidence = "paired-ends"
```
- While this should work effectively the same as the "meta-description" item above, use this notation if you insert values repeatedly in the annotation file produced by GFF3-to-DDBJ.
- Currently supporting [source] and [assembly_gap] only.
- If both [COMMON.assembly_gap] and [COMMON.assembly] exist in the metadata file, gff3-to-ddbj takes the one with COMMON.

For more examples, see WGS in COMMON and WGS provided by DDBJ as annotation examples, and corresponding metadata files metadata_WGS_COMMON.toml and metadata_WGS.toml in this repository.

[Advanced] Feature/Qualifier rename setting

GFF3 and DDBJ annotation have rough correspondence as follows:

GFF3 column 3 "type" → DDBJ annotation column 2 as "Feature"
GFF3 column 9 "attribute" → DDBJ annotation column 4 and 5 as "Qualifier key", and "Qualifier value"

but nomenclatures in GFF3 often do not conform the annotations set by INSDC. Furthermore, DDBJ lists up the feature-qualifier pairs they accepts, a subset of the INSDC definitions.

To meet convensions with the requirement, GFF3-to-DDBJ comes with a TOML file to rename feature keys and qualifier keys/values.

Here is a way to customize the renaming schema.

Rename types/feature keys

The default setting renames five_prime_UTR "type" in GFF3 into 5'UTR "feature key" in the annotation. This transformation is expressed in TOML as follows:

[five_prime_UTR]
feature_key = "5'UTR"

Rename attributes/qualifier keys

This is about renaming attributes under arbitrary types. By default, ID=foobar "attribute" in a GFF3 becomes /note="ID:foobar" qualifier in the annotation. (Here I follow the convention putting slash (like /note) to denote qualifier. But DDBJ annotation does NOT include slash hence no slash is used in any of TOML files.)

Here is the TOML defining the transformation. __ANY__ is the special name representing arbitrary types. ID is the original attribute key. note is the name of corresponding qualifier key. ID: is attached as the prefix of the qualifier value.

[__ANY__]  # This lineis required for structural reason
[__ANY__.ID]
qualifier_key = "note"
qualifier_value_prefix = "ID:"  # optional

Translate types to featuress with qualifiers

Sometimes we want to replace a certain types with features WITH qualifiers. For example, snRNA is an invalid feature in INSDC/DDBJ hence we replace it with ncRNA feature with /ncRNA_class="snRNA" qualifier. Such transformation is written in TOML as following.

[snRNA]
feature_key = "ncRNA"
qualifier_key = "ncRNA_class"
qualifier_value = "snRNA"

Translate (type, attribute) items to features

Example: some annotation programs produce a GFF3 line containing RNA as the type and biotype=misc_RNA as one of the attributes. Then it should be translated to misc_RNA feature in annoation.

[RNA]    # Required though redundant
[RNA.biotype]
attribute_value = "misc_RNA"
feature_key = "misc_RNA"

Run with a custom file

See translate_features_qualifiers.toml for the default behavior. To feed a custom translation table, use the CLI option:

--rename_setting <FILE>

And here is an example call:

gff3-to-ddbj \
  --gff3 myfile.gff3 \
  --fasta myfile.fa \
  --metadata mymetadata.toml \
  --locus_tag_prefix MYOWNPREFIX_ \
  --transl_table 1 \
  --rename_setting my_translate_features_qualifiers.toml \  # Set your customized file here
  --output myawesome_output.ann

[Advanced] Feature/Qualifier filter setting

DDBJ specifies recommended Feature/Qualifier usage matrix. To conform this rule, features and qualifiers appearing in the annotation output are filtered by the filtering file in TOML by default. The file is in TOML format with the structure like this:

CDS = [
"EC_number",
"inference",
"locus_tag",
"note",
"product",
]

exon = [
"gene",
"locus_tag",
"note",
]

The left-hand side of the equal sign = represents an allowed feature key, and the right-hand side is a list of allowed qualifier keys. In this example, only CDS and exon features will show up in the annotation, and qualifiers are limited to the listed items. To customize this filtering function, edit the TOML file first and pass the file with the CLI option:

--filter_setting <FILE>

Troubleshooting

Validate GFF3

It might be a good practice to validate your GFF3 files. GFF3 online validator is useful though the file size is limited to 50MB.

Split FASTA from GFF3 (if needed)

GFF3_to_DDBJ does not work when GFF3 contains FASTA information inside with ##FASTA directive. Attached tool split-fasta reads a GFF3 file and saves GFF3 (without FASTA info) and FASTA.

split-fasta path/to/myfile.gff3 --suffix "_splitted"

This creates two files, myfile_splitted.gff3 and myfile_splitted.fa.

Normalize entry names (if needed)

Letters like =|>" [] are not allowed in the 1st column (= "Entry") of the DDBJ annotation. The attached program normalize-entry-names renames such entries. This program converts an ID like ERS324955|SC|contig000013 into ERS324955:SC:contig000013 for example.

normalize-entry-names myannotation_output.txt

This command create as files myannotation_output_renamed.txt if the invalid letters are found. Otherwise, you'll see no output.

Credit

GFF3-to-DDBJ's design is deeply indebted to EMBLmyGFF3, a versatile coversion for EMBL annotation format.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Oct 22, 2021

0.3.0

Oct 12, 2021

0.2.4

Oct 4, 2021

0.2.3

Sep 30, 2021

This version

0.2.1

Sep 28, 2021

0.2.0

Sep 28, 2021

0.1.1

Sep 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gff3toddbj-0.2.1.tar.gz (43.5 kB view details)

Uploaded Sep 28, 2021 Source

Built Distribution

gff3toddbj-0.2.1-py3-none-any.whl (43.8 kB view details)

Uploaded Sep 28, 2021 Python 3

File details

Details for the file gff3toddbj-0.2.1.tar.gz.

File metadata

Download URL: gff3toddbj-0.2.1.tar.gz
Upload date: Sep 28, 2021
Size: 43.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/0.0.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for gff3toddbj-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`55cf9a264872b4074d899f89dfad55b8681b7a547c261b849f9c255562a2c819`
MD5	`61679cd8ad0e5d74af3e88a7d3419ecd`
BLAKE2b-256	`9068a15104a0b38303cc8db0a3481e880518852610a2283a5748362d34dc0586`

See more details on using hashes here.

File details

Details for the file gff3toddbj-0.2.1-py3-none-any.whl.

File metadata

Download URL: gff3toddbj-0.2.1-py3-none-any.whl
Upload date: Sep 28, 2021
Size: 43.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/0.0.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for gff3toddbj-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`acf3b193ebfec40d5d9299f8968138342eea9affd7357b0194f0fc37ea69b011`
MD5	`04446f7b18f763c07a3ede3de690c19d`
BLAKE2b-256	`097407bf0f6502b17b35dfc8e368adef3e6cd3d8871774f2b1c12b9f8d30cee1`

See more details on using hashes here.

gff3toddbj 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GFF3-to-DDBJ

Table of Contents

What is this?

Initial setup

Install with bioconda

Install with pip

Install from the source

Create DDBJ annotation from GFF3 and FASTA

Run gff3-to-ddbj

Under the Hood

Customize the behavior

Metadata file

[Advanced] Feature/Qualifier rename setting

Rename types/feature keys

Rename attributes/qualifier keys

Translate types to featuress with qualifiers

Translate (type, attribute) items to features

Run with a custom file

[Advanced] Feature/Qualifier filter setting

Troubleshooting

Validate GFF3

Split FASTA from GFF3 (if needed)

Normalize entry names (if needed)

Credit

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Run `gff3-to-ddbj`