This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

Goodbye, GenBank converts SeqFeature sequence annotations from NCBI GenBank records to a common and simplified format. GenBank feature annotations have a feature key and reasonably well defined qualifiers, but non-standard and discontinued feature types and qualifiers are commonly used and often the feature key is something someone made up and not a valid GenBank feature key. And even when a valid GenBank feature key is used, it is often incomplete and useless without additional details in the qualifiers.

This package converts most feature keys to appropriate Sequence Ontology terms used by GFF3 and SBOL. Non-standard qualifiers are repaired or removed.

Goodbye, GenBank is intended for those who wish to clean-up their GenBank files and then transition to a different format. The philosophy of this project is to salvage what is salvageable and to discard what is not. GenBank feature types are translated to Sequence Ontology terms; qualifiers are converted into a reduced set that contains only the parts that are not broken. Qualifiers are also converted to their correct type: int for integers, list only for qualifiers that can appear multiple times, bool for flags.

Moreover, different options are available to configure what is kept and what is thrown away.

Installation

pip install gbgb

Example

>>> feature
SeqFeature(FeatureLocation(ExactPosition(2931), ExactPosition(2936), strand=1), type='-10_signal')
>>> feature.qualifiers
{'ApEinfo_fwdcolor': ['pink'],
 'ApEinfo_graphicformat': ['arrow_data {{0 1 2 0 0 -1} {} 0} width 5 offset 0'],
 'ApEinfo_revcolor': ['pink'],
 'label': ['RNAII Promoter (-10 signal)']}
>>>
>>> from gbgb import convert_feature
>>> feature = convert_feature(feature)
>>> feature
SeqFeature(FeatureLocation(ExactPosition(2931), ExactPosition(2936), strand=1), type='minus_10_signal')
>>> feature.qualifiers
{'note': 'RNAII Promoter (-10 signal)'}
>>>
>>> from gbgb import genbank_feature_key
>>> genbank_feature_key('minus_10_signal')
'regulatory'

Design considerations

For the most part, Goodbye, GenBank attempts to be idempotent, i.e. features and their types/keys and qualifiers can be safely transformed any number times with the same settings. The apparent mismatch between the conversion to Sequence Ontology feature terms and valid/fixed GenBank qualifiers is to simplify downstream processing. It is up to the users which qualifiers they wish to keep, but at least the choices they are given are reasonable.

Contributing

If you have any questions or suggestions or if you have found a unique new specimen of GenBank files that you would like to convert, please open an issue.

Issues

  • SO Term: “regulatory” feature type with /regulatory_class=”enhancer_blocking_element”

    There is apparently no matching Sequence Ontology term. An enhancer blocking element behaves like an insulator, but is not an insulator. It is a transcriptional cis regulatory region, but that description is too broad.

  • SO Term: “misc_structure” feature type

    GenBank uses this feature type for secondary and tertiary nucleotide structures. There appears to be no matching Sequence Ontology term.

  • SO Term: “assembly_gap” feature type

    GenBank has both “gap” and “assembly_gap” feature types, which appear to have slightly different meanings. However, SO only has a “gap” term, which refers to assembly gaps.

  • GFF3 export

    There is no good GFF3 exporter out there, so why not write one?

    Skeleton code in gbgb.export.gff3

  • Reduction of SO terms

    Allow users to specify a set of Sequence Ontology terms (inheriting from “sequence_feature”). Feature types will be reduced to the nearest Ontology term. This is to simplify downstream analysis.

  • /pseudo qualifier without /pseudogene=”“

    There is no matching Sequence Ontology term for this. Several GenBank files contain /pseudo without /pseudogene=”” to mean pseudogene.

  • Mandatory qualifiers

    These should be filled in using a reasonable guess or errors should be thrown when trying to convert a feature without its mandatory qualifiers.

Release History

Release History

0.1.0

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.0.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
gbgb-0.1.0.tar.gz (11.3 kB) Copy SHA256 Checksum SHA256 Source Apr 27, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting