Convert GenBank format files to a swath of other formats
Project description
genbank_to
A straightforward application to convert NCBI GenBank format files to a swath of other formats. Hopefully we have the format you need, but if not either post an issue using our template, or if you have already got it working, post a PR so we can add it and add you to the project.
You might also be interested deprekate's package called genbank which includes
several of the features here, and you can import genbank into your Python projects.
Documentation
For comprehensive documentation, including installation instructions, usage examples, and API reference, please visit:
https://genbank-to.readthedocs.io
What it does
Read an NCBI GenBank format file (like our test data) and convert it to one of many different formats.
Input formats
At the moment we only support NCBI GenBank format. If you want us to read other common formats, let us know and we'll add them.
Output formats
Here are the output formats you can request. You can request as many of these at once as you like!
These outputs are assuming you provide a (for example) genome file that contains ORFs, Proteins, and Genomes.
Nucleotide output
-nor--nucleotideoutputs the whole DNA sequence (e.g. the genome)-oor--orfsoutputs the DNA sequence of the open reading frames
Protein output
-aor--aminoacidsoutputs the protein sequence for each of the open reading frames
Complex formats
-por--pttNCBI ptt protein table. This is a somewhat deprecated NCBI format from their genomes downloads-for--functionsoutputs tab separated data ofprotein IDandprotein function(also called theproduct)--gff3outputs GFF3 format--amroutputs a GFF file, an amino acid fasta file, and a nucleotide fasta file as required by AMR Finder Plus. Note that this format checks for validity that often crashes AMRFinderPlus--phage_finderoutputs a unique format required by phage_finder--bakta-jsonoutputs JSON format genome files similar to those created by Bakta.- This option also allows you to specify additional information which can be recorded in the JSON output, including:
--gram[should be+or-] whether the strain is Gram +ve or Gram -ve. Note that if not provided we compute some from our list of Bacteria--translation-tableif you dind't use 11
- This option also allows you to specify additional information which can be recorded in the JSON output, including:
Output options
--pseudonormally we skip pseudogenes (e.g. in creating amino acid fasta files). This will try and include pseudogenes, but often biopython complains and ignores them!-ior--seqidonly output this sequence, or these sequences if you specify more than one-i/--seqid-zor--zipcompress some of the outputs--logwrite logs to a different file
Separate multi-GenBank files
If your GenBank files contains multiple sequence records (separated with //), you can provide the --separate flag.
This will write each entry into its own file. This is compatible with -n/--nucleotide, -o/--orfs, and
-a/--aminoacids. However, if you provide the --separate flag on its own, it will write each entry in your
multi-GenBank file to its own GenBank file.
Examples
All of these examples use our test data
- Extract a
fastaof the genome:
genbank_to -g test/NC_001417.gbk -n test/NC_001417.fna
- Extract the DNA sequences of the ORFs to a single file
genbank_to -g test/NC_001417.gbk -o test/NC_001417.orfs
- Extract the protein (amino acid) sequences of the ORFs to a file
genbank_to -g test/NC_001417.gbk -a test/NC_001417.faa
- Extract Bakta format JSON
genbank_to -g test/NC_001417.gbk --bakta-json test/NC_001417.json
- Do all of these at once
genbank_to -g test/NC_001417.gbk -n test/NC_001417.fna -o test/NC_001417.orfs -a test/NC_001417.faa --bakta-json test/NC_001417.json
Installation
You can install genbank_to in three different ways:
- Using conda
This is the easiest and recommended method.
mamba create -n genbank_to genbank_to
conda activate genbank_to
genbank_to --help
- Using pip
I recommend putting this into a virtual environment:
virtualenv venv
source venv/bin/activate
pip install genbank_to
genbank_to --help
- Directly from this repository
(Not really recommended as things might break)
git clone https://github.com/linsalrob/genbank_to.git
cd genbank_to
virtualenv venv
source venv/bin/activate
pip install .
genbank_to --help
More Information
For detailed documentation, including:
- Comprehensive usage examples
- Complete API reference for the Python library
- All output format specifications
- Contributing guidelines
- And much more!
Please visit our documentation at https://genbank-to.readthedocs.io
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genbank_to-0.54.tar.gz.
File metadata
- Download URL: genbank_to-0.54.tar.gz
- Upload date:
- Size: 26.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7190a4f51e122fc7c9281bf3d161a92ce324ddd2dab61420ed9cc12c43c6084c
|
|
| MD5 |
01d65d04a7ac1303d6fa27d256cc0c0e
|
|
| BLAKE2b-256 |
89c14b200764cd8d16c9e859d3e36c59a8ade0d388e4ad9dc20fb8681ad8119f
|
Provenance
The following attestation bundles were made for genbank_to-0.54.tar.gz:
Publisher:
python-publish.yml on linsalrob/genbank_to
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genbank_to-0.54.tar.gz -
Subject digest:
7190a4f51e122fc7c9281bf3d161a92ce324ddd2dab61420ed9cc12c43c6084c - Sigstore transparency entry: 804393969
- Sigstore integration time:
-
Permalink:
linsalrob/genbank_to@a2f5a87db4c1e48003d1a9b6a75f41fdb34bbbf0 -
Branch / Tag:
refs/tags/v0.54 - Owner: https://github.com/linsalrob
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@a2f5a87db4c1e48003d1a9b6a75f41fdb34bbbf0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file genbank_to-0.54-py3-none-any.whl.
File metadata
- Download URL: genbank_to-0.54-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffbd587c41cf29ac86ebaa855fca8bc2107968bb754f869356936409d73f1361
|
|
| MD5 |
95c7bde452f6e5a0fb2493697fa92f8d
|
|
| BLAKE2b-256 |
96a626e34ca44ebbdb877661fe37a53aaaa309a7c284c9a59514ec666a61bfc6
|
Provenance
The following attestation bundles were made for genbank_to-0.54-py3-none-any.whl:
Publisher:
python-publish.yml on linsalrob/genbank_to
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genbank_to-0.54-py3-none-any.whl -
Subject digest:
ffbd587c41cf29ac86ebaa855fca8bc2107968bb754f869356936409d73f1361 - Sigstore transparency entry: 804393971
- Sigstore integration time:
-
Permalink:
linsalrob/genbank_to@a2f5a87db4c1e48003d1a9b6a75f41fdb34bbbf0 -
Branch / Tag:
refs/tags/v0.54 - Owner: https://github.com/linsalrob
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@a2f5a87db4c1e48003d1a9b6a75f41fdb34bbbf0 -
Trigger Event:
release
-
Statement type: