Skip to main content

Create a DDBJ annotation file from GFF3 and FASTA files

Project description

GFF3-to-DDBJ

日本語版はこちら

[TOC]

What is this?

GFF3-to-DDBJ creates DDBJ's annotation file from GFF3 and FASTA files. It also works from FASTA alone.

Initial setup

[NOT available yet] Install via bioconda

# Create a conda environment named "ddbj", and install relevant packages from bioconda channel
$ conda create -n ddbj -c bioconda gff3toddbj

# Activate the environment "ddbj"
$ conda activate ddbj

Install from the source

# Download
$ wget https://github.com/yamaton/gff3_to_ddbj/archive/refs/heads/main.zip

# Extract, rename, and change directory
$ unzip main.zip && mv gff3toddbj-main gff3toddbj && cd gff3toddbj

# Create a conda environment named "ddbj"
$ conda create -n ddbj

# Activate the environment "ddbj"
$ conda activate ddbj

# Install dependencies to "ddbj"
$ conda install -c bioconda -c conda-forge biopython bcbio-gff

# Install gff3-to-ddbj and extra tools
$ python setup.py install

Create DDBJ annotation from GFF3 and FASTA

3. Run gff3-to-ddbj

Let's run the main program to get some ideas. Here is the options.

  • --gff3 <FILE> takes GFF3 file

  • --fasta <FILE> takes FASTA file

  • --config <FILE> takes the configuration file in TOML

  • --locus_tag_prefix <STRING> takes the prefix of locus tag obtained from BioSample. You can skip this for now.

  • --transl_table <INT>: Choose appropriate one from The Genetic Codes. The default value is 1 ("standard").

  • --output <FILE> sets the path the annotation output.

gff3-to-ddbj
  --gff3 myfile.gff3 \                # produces the minimum without this line
  --fasta myfile.fa \                 # <<REQUIRED>>
  --config config.toml \              # produces the minimum without this line
  --locus_tag_prefix MYOWNPREFIX_ \   # set to "LOCUSTAGPREFIX_" without this line
  --transl_table 1 \                  # set to 1 without this line
  --output myawesome_output.ann       # standard output without this line

Customize the behavior

Edit config.toml

You need to edit a configuration file in TOML, say config.toml. Take a look at the sample TOML file with COMMON, or the one without COMMON. The configuration contains following items. They are all optional, and GFF3-to-DDBJ works even without config.toml.

  • Basic features in COMMON

  • "meta-description" in COMMON

    • DDBJ annotation supports features under COMMON that are inserted to each of repeatedly-occurring features in the resulting flat file.

    • Here is an example.

      [COMMON.assembly_gap]
      estimated_length = "unknown"
      gap_type = "within scaffold"
      linkage_evidence = "paired-ends"
      
  • Feature-Qualifier information inserted to each occurrence by done GFF3-to-DDBJ

    • This should work effectively the same purpose as the "meta-description" item above. But this repeated insertions are done by GFF3-to-DDBJ, and appears in the annotation output. This configuration is mutually exclusive with the "metadata-description" configuration. I'm keeping both simply because I'm undecided yet.

    • Here is an example: Difference from the previous one is only at [assembly_gap] as opposed to[COMMON.assembly_gap].

      [assembly_gap]
      estimated_length = "unknown"   # Set it "<COMPUTE>" to count the number of N's
      gap_type = "within scaffold"
      linkage_evidence = "paired-ends"
      

[advanced] Edit translation table

GFF3 and DDBJ annotation have rough correspondence as follows:

  1. GFF3 column 3 --> DDBJ annotation column 2 as "Feature"
  2. GFF3 column 9 --> DDBJ annotation column 4 and 5 as "Qualifier key", and "Qualifier value"

but nomenclatures in GFF3 often do not conform the INSDC definitions. Furthermore, DDBJ lists up the feature-qualifier pairs they accepts, which is stricter than INSDC.

To satisfy requirement, I have prepared translation tables for features and qualifiers, and GFF3-to-DDBJ uses the table. For example, GFF3 may contain five_prime_UTR in the column 2, but 5'UTR is the translated name in the outcome. You can edit the translation tables and feed them with

  • --translate_feature <file> for feature translation
  • --translate_qualifeirs <file> for qualifier translation

And here is an example call.

gff3-to-ddbj
  --gff3 myfile.gff3 \
  --fasta myfile.fa \
  --config config.toml \
  --locus_tag_prefix MYOWNPREFIX_ \
  --transl_table 1 \
  --translate_features translate_features.toml \
  --translate_qualifiers translate_qualifiers.toml \
  --output myawesome_output.ann

Troubleshooting

Validate GFF3

It might be a good practice to validate your GFF3 files. GFF3 online validator is useful though the file size is limited to 50MB.

Split FASTA from GFF3 (if needed)

GFF3_to_DDBJ does not work when GFF3 contains FASTA information inside with ##FASTA directive. Attached tool under split-fasta reads a GFF3 file and saves GFF3 (without FASTA info) and FASTA.

split-fasta path/to/myfile.gff3 --suffix "_modified"

This creates two files, myfile_modified.gff3 and myfile_modified.fa.

Fix entry names (if needed)

Letters like =|>" [] are not allowed in the 1st column (= "Entry") of the DDBJ annotation. So, you need to rename the 1st column (= "SeqID") of your GFF3 and headers in your FASTA. Attached tool rename-ids might be useful. This program converts an ID like ERS324955|SC|contig000013 into ERS324955:SC:contig000013.

rename-ids \
  --gff3=path/to/foo.gff3 \
  --fasta=path/to/bar.fasta \
  --suffix="_renamed_ids"

This command saves two files, foo_renamed_ids.gff3 and bar_renamed_ids.fasta if the invalid letters are found. Otherwise, you'll see no output.

fo

Under the Hood

Here is the list of operations done by gff3-to-ddbj.

  • Rename Feature / Qualifiers keys using the translation tables

  • Search for assembly_gap s in FASTA

  • Add /transl_table to each CDS

  • Insert information from configuration fie

  • Merge CDSs having the same parent with join notation

  • Merge mRNA and exon in GFF3 and create mRNA feature with join notation

  • Check start codon consistency. (Except for /codon_start=1 for now)

  • Let CDS have a single /product value. Move the rest to /note.

    • This is to conform the instruction on /product.

      • Even if there are multiple general names for the same product, do not enter multiple names in 'product'. Do not use needless symbolic letters as delimiter for multiple names. If you would like to describe more than two names, please enter one of the most representative name in /product qualifier, and other(s) in /note qualifier.

      • If the name and function are not known, we recommend to describe as "hypothetical protein".

  • Remove duplicates in qualifier values

  • Sort lines in annotation

  • Filter out Feature-Qualifier pairs following the table.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gff3toddbj-0.1.1.tar.gz (36.9 kB view details)

Uploaded Source

Built Distribution

gff3toddbj-0.1.1-py3-none-any.whl (38.9 kB view details)

Uploaded Python 3

File details

Details for the file gff3toddbj-0.1.1.tar.gz.

File metadata

  • Download URL: gff3toddbj-0.1.1.tar.gz
  • Upload date:
  • Size: 36.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for gff3toddbj-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1c2ace61e2ee8c645ab102bc8e7fd2ec3fcad1bc0a05b8950fafa37e7ec7b290
MD5 4961c17f4e0f660995f358cdab7441d4
BLAKE2b-256 848bafc39b0ca59883574739d1d4cdc3ac2d3b3e4daf0601aa4bd1a3118465db

See more details on using hashes here.

File details

Details for the file gff3toddbj-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: gff3toddbj-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 38.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for gff3toddbj-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af4420967e0a517a4d734eb79bb52e494fe73a7309b546399b1d99168747f434
MD5 b3a2492a0850e182a64e743e4401d3aa
BLAKE2b-256 5684d8f15a0065a3f18a3cb6df6b3eaefdf0a56674859288017f89449a829ec4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page