Skip to main content

A miniformat for expressing arrangements of sequence features

Project description

Fardes: Features Arrangement Description Miniformat

The mini-format described here allows to describe the relative arrangement of named sequence features on one or multiple molecules, in terms of their order, length of the interval between them, possible presence of further features between them, strand, position on the same or different molecule.

It was developed for an application in which the expected genome contents of prokaryotic genomes is expressed as a set of rules, which in some cases concern the relative arrangement of features.

Specification

The miniformat is described in the Markdown document SPECIFICATION.md in this repository.

Examples

Here are examples on how the format can be used to express different arrangements:

A,B,C,D: this is a list without any interval specifications, thus the features (whose IDs are given) will just follow each other without any other relevant feature in between.

A,1,B,C: in this case, between A and B, there is a further feature.

A,1(gene),C: in this case, between A and B, there is exacly one feature, which is of type gene

A,3(rRNA;tRNA),C: in this case, between A and B, there are 3 features, of type rRNA or tRNA.

A,8:10,B: in this case, between A and B, there are 8 to 10 other features.

A,1:*,B or A,>=1,B: these are two equivalent ways to express the fact that between A and B there is at least one other feature.

A,<10,B: between A and B there are less than 10 features (max 9).

A,0[1000:3000],B: there are no features between A and B, but there are between 1000 and 3000 bases.

A,0:1[1kb:3kb],B or A,<1[1kb:3kb],B: there are between 1000 and 3000 bases and eventually a feature in this interval

A,[>30kbp],B or A,>=0[>30kbp],B: there are at least 30000 bases between A and B, including any number of features.

A,><,B,<>,C: A and B are close to each other (and thus also on the same molecule) and distant from C (which can be on the same molecule or another)

A,><,B,<.>,C: A and B are close to each other and distant from C, but all three are on the same molecule

A,><,B,<|>,C: A and B are close to each other on the same molecule, while C is on another molecule

A,&,^B: A and B overlap each other, on different strands

A,B,>C,^D: the order of the features is A, B, C and D with no other feature in between them; thereby C and D are on opposite strands, while A and B can be on any strand)

A,B,=C: the feature C is on the same strand as A, but B can be on the same or on the oppposite strand.

A,><,=B,><,=C: all three features are on the same strand and close to each other, with no features in between.

Implementation as a Python package

The miniformat has been implemented as a TextFormats specification (fardes.tf.yaml).

This has been included in a Python module fardes, which additionally include cross-checking not expressable in TextFormats and normalizes the elements while parsing a string (e.g. by including implicit values and applying multipliers). The module can be installed using pip install fardes.

Example usage of the Python parser

Here is an example of usage of the module:

import fardes
elements = fardes.parse("A,1:10[1kb:3kb],>B,1(rRNA;tRNA),>C,1[2],>D,=E,[3:*],F,1:*[>2Mb],G,<>,H,>0,I,<4,J,[~3kb],K,<|>,L,><,M,&,N")

will result in the following:

[{'type': 'unit', 'unit': 'A', 'prefix': ''},
 {'type': 'interval', 'length': {'min': 1000, 'max': 3000}, 'n_features': {'min': 1, 'max': 10}},
 {'type': 'unit', 'unit': 'B', 'prefix': '>'},
 {'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 1, 'max': 1, 'type_spec': {'types': ['rRNA', 'tRNA']}}},
 {'type': 'unit', 'unit': 'C', 'prefix': '>'},
 {'type': 'interval', 'length': {'min': 2, 'max': 2}, 'n_features': {'min': 1, 'max': 1}},
 {'type': 'unit', 'unit': 'D', 'prefix': '>'},
 {'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 0, 'max': 0}},
 {'type': 'unit', 'unit': 'E', 'prefix': '='},
 {'type': 'interval', 'length': {'min': 3, 'max': None}, 'n_features': {'min': 0, 'min': None}},
 {'type': 'unit', 'unit': 'F', 'prefix': ''},
 {'type': 'interval', 'length': {'min': 2000001, 'max': None}, 'n_features': {'min': 1, 'max': None}},
 {'type': 'unit', 'unit': 'G', 'prefix': ''},
 {'type': 'interval', 'special': 'distant'}
 {'type': 'unit', 'unit': 'H', 'prefix': ''},
 {'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 1, 'max': None}},
 {'type': 'unit', 'unit': 'I', 'prefix': ''},
 {'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 0, 'min': 3}},
 {'type': 'unit', 'unit': 'J', 'prefix': ''},
 {'type': 'interval', 'length': {'approx': 3000}, 'n_features': {'min': 0, 'max': None}},
 {'type': 'unit', 'unit': 'K', 'prefix': ''},
 {'type': 'interval', 'special': 'other_molecule'},
 {'type': 'unit', 'unit': 'L', 'prefix': ''},
 {'type': 'interval', 'special': 'near'},
 {'type': 'unit', 'unit': 'M', 'prefix': ''},
 {'type': 'interval', 'special': 'overlap'},
 {'type': 'unit', 'unit': 'N', 'prefix': ''}]

Acknowledgements

This specification has been created in context of the DFG project GO 3192/1-1 “Automated characterization of microbial genomes and metagenomes by collection and verification of association rules”. The funders had no role in study design, data collection and analysis.

Name

The name Fardes is an acronym for "feature arrangement description". After naming the project, I noticed that, according to Wiktionary https://en.wiktionary.org/wiki/farde, in Belgian French, a "farde" (plural: fardes) is a file, in the meaning of stationery to keep documents together. This fits well to the purpose of the format.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fardes-1.2.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

fardes-1.2-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file fardes-1.2.tar.gz.

File metadata

  • Download URL: fardes-1.2.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for fardes-1.2.tar.gz
Algorithm Hash digest
SHA256 2a1dc36b65f9546f2cd03d653e7207ebb602942fc7dcedb3d30c9e16795ae77b
MD5 4537863a7a9e554e11bf911bcd06cd3f
BLAKE2b-256 fbea78def7f29a2ed3409671ac321f967b1b48a89808c11213f8f1ec3688b04d

See more details on using hashes here.

File details

Details for the file fardes-1.2-py3-none-any.whl.

File metadata

  • Download URL: fardes-1.2-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for fardes-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 09987bf2f57349c9b2f2d11d07ac19d3bedcccdb5788762b4fee8956c926bf80
MD5 508933e8b93caccf11d559587dd4c70d
BLAKE2b-256 e81ee01c20ebf004ac1b27c9e1cf6df4854e1d2504d10c6ec32415257c524db9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page