Skip to main content

Convert GenBank format files to a swath of other formats

Project description

genbank_to

Edwards Lab DOI License: MIT GitHub language count PyPi Documentation Status

A straightforward application to convert NCBI GenBank format files to a swath of other formats. Hopefully we have the format you need, but if not either post an issue using our template, or if you have already got it working, post a PR so we can add it and add you to the project.

You might also be interested deprekate's package called genbank which includes several of the features here, and you can import genbank into your Python projects.

Documentation

For comprehensive documentation, including installation instructions, usage examples, and API reference, please visit:

https://genbank-to.readthedocs.io

What it does

Read an NCBI GenBank format file (like our test data) and convert it to one of many different formats.

Input formats

At the moment we only support NCBI GenBank format. If you want us to read other common formats, let us know and we'll add them.

Output formats

Here are the output formats you can request. You can request as many of these at once as you like!

These outputs are assuming you provide a (for example) genome file that contains ORFs, Proteins, and Genomes.

Nucleotide output

  • -n or --nucleotide outputs the whole DNA sequence (e.g. the genome)
  • -o or --orfs outputs the DNA sequence of the open reading frames

Protein output

  • -a or --aminoacids outputs the protein sequence for each of the open reading frames

Complex formats

  • -p or --ptt NCBI ptt protein table. This is a somewhat deprecated NCBI format from their genomes downloads
  • -f or --functions outputs tab separated data of protein ID and protein function (also called the product)
  • --gff3 outputs GFF3 format
  • --amr outputs a GFF file, an amino acid fasta file, and a nucleotide fasta file as required by AMR Finder Plus. Note that this format checks for validity that often crashes AMRFinderPlus
  • --phage_finder outputs a unique format required by phage_finder
  • --bakta-json outputs JSON format genome files similar to those created by Bakta.
    • This option also allows you to specify additional information which can be recorded in the JSON output, including:
      • --gram [should be + or -] whether the strain is Gram +ve or Gram -ve. Note that if not provided we compute some from our list of Bacteria
      • --translation-table if you dind't use 11

Output options

  • --pseudo normally we skip pseudogenes (e.g. in creating amino acid fasta files). This will try and include pseudogenes, but often biopython complains and ignores them!
  • -i or --seqid only output this sequence, or these sequences if you specify more than one -i/--seqid
  • -z or --zip compress some of the outputs
  • --log write logs to a different file

Separate multi-GenBank files

If your GenBank files contains multiple sequence records (separated with //), you can provide the --separate flag. This will write each entry into its own file. This is compatible with -n/--nucleotide, -o/--orfs, and -a/--aminoacids. However, if you provide the --separate flag on its own, it will write each entry in your multi-GenBank file to its own GenBank file.

Examples

All of these examples use our test data

  1. Extract a fasta of the genome:
genbank_to -g test/NC_001417.gbk -n test/NC_001417.fna
  1. Extract the DNA sequences of the ORFs to a single file
genbank_to -g test/NC_001417.gbk -o test/NC_001417.orfs
  1. Extract the protein (amino acid) sequences of the ORFs to a file
genbank_to -g test/NC_001417.gbk -a test/NC_001417.faa
  1. Extract Bakta format JSON
genbank_to -g test/NC_001417.gbk --bakta-json test/NC_001417.json
  1. Do all of these at once
genbank_to -g test/NC_001417.gbk -n test/NC_001417.fna -o test/NC_001417.orfs -a test/NC_001417.faa --bakta-json test/NC_001417.json

Installation

You can install genbank_to in three different ways:

  1. Using conda

This is the easiest and recommended method.

mamba create -n genbank_to genbank_to
conda activate genbank_to
genbank_to --help
  1. Using pip

I recommend putting this into a virtual environment:

virtualenv venv
source venv/bin/activate
pip install genbank_to
genbank_to --help
  1. Directly from this repository

(Not really recommended as things might break)

git clone https://github.com/linsalrob/genbank_to.git
cd genbank_to
virtualenv venv
source venv/bin/activate
pip install .
genbank_to --help

More Information

For detailed documentation, including:

  • Comprehensive usage examples
  • Complete API reference for the Python library
  • All output format specifications
  • Contributing guidelines
  • And much more!

Please visit our documentation at https://genbank-to.readthedocs.io

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genbank_to-0.54.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genbank_to-0.54-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file genbank_to-0.54.tar.gz.

File metadata

  • Download URL: genbank_to-0.54.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for genbank_to-0.54.tar.gz
Algorithm Hash digest
SHA256 7190a4f51e122fc7c9281bf3d161a92ce324ddd2dab61420ed9cc12c43c6084c
MD5 01d65d04a7ac1303d6fa27d256cc0c0e
BLAKE2b-256 89c14b200764cd8d16c9e859d3e36c59a8ade0d388e4ad9dc20fb8681ad8119f

See more details on using hashes here.

Provenance

The following attestation bundles were made for genbank_to-0.54.tar.gz:

Publisher: python-publish.yml on linsalrob/genbank_to

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file genbank_to-0.54-py3-none-any.whl.

File metadata

  • Download URL: genbank_to-0.54-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for genbank_to-0.54-py3-none-any.whl
Algorithm Hash digest
SHA256 ffbd587c41cf29ac86ebaa855fca8bc2107968bb754f869356936409d73f1361
MD5 95c7bde452f6e5a0fb2493697fa92f8d
BLAKE2b-256 96a626e34ca44ebbdb877661fe37a53aaaa309a7c284c9a59514ec666a61bfc6

See more details on using hashes here.

Provenance

The following attestation bundles were made for genbank_to-0.54-py3-none-any.whl:

Publisher: python-publish.yml on linsalrob/genbank_to

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page