A miniformat for expressing arrangements of sequence features
Project description
Fardes: Features Arrangement Description Miniformat
The mini-format described here allows to describe the relative arrangement of named sequence features on one or multiple molecules, in terms of their order, length of the interval between them, possible presence of further features between them, strand, position on the same or different molecule.
It was developed for an application in which the expected genome contents of prokaryotic genomes is expressed as a set of rules, which in some cases concern the relative arrangement of features.
Specification
The miniformat is described in the Markdown document SPECIFICATION.md
in
this repository.
Examples
Here are examples on how the format can be used to express different arrangements:
A,B,C,D
: this is a list without any interval specifications, thus the
features (whose IDs are given) will just follow each other without any
other relevant feature in between.
A,1,B,C
: in this case, between A and B, there is a further feature.
A,1(gene),C
: in this case, between A and B, there is exacly one feature,
which is of type gene
A,3(rRNA;tRNA),C
: in this case, between A and B, there are 3 features,
of type rRNA or tRNA.
A,8:10,B
: in this case, between A and B, there are 8 to 10 other features.
A,1:*,B
or A,>=1,B
: these are two equivalent ways to express the fact
that between A and B there is at least one other feature.
A,<10,B
: between A and B there are less than 10 features (max 9).
A,0[1000:3000],B
: there are no features between A and B,
but there are between 1000 and 3000 bases.
A,0:1[1kb:3kb],B
or A,<1[1kb:3kb],B
: there are between
1000 and 3000 bases and eventually a feature in this interval
A,[>30kbp],B
or A,>=0[>30kbp],B
: there are at least 30000 bases between
A and B, including any number of features.
A,><,B,<>,C
: A and B are close to each other (and thus also on the same
molecule) and distant from C (which can be on the same molecule or another)
A,><,B,<.>,C
: A and B are close to each other and distant from C,
but all three are on the same molecule
A,><,B,<|>,C
: A and B are close to each other on the same molecule,
while C is on another molecule
A,&,^B
: A and B overlap each other, on different strands
A,B,>C,^D
: the order of the features is A, B, C and D with no other
feature in between them; thereby C and D are on opposite strands, while
A and B can be on any strand)
A,B,=C
: the feature C is on the same strand as A, but B can be on the
same or on the oppposite strand.
A,><,=B,><,=C
: all three features are on the same strand and close
to each other, with no features in between.
Implementation as a Python package
The miniformat has been implemented as a TextFormats specification
(fardes.tf.yaml
).
This has been included in a Python module fardes
, which additionally include
cross-checking not expressable in TextFormats and normalizes the elements
while parsing a string
(e.g. by including implicit values and applying multipliers).
The module can be installed using pip install fardes
.
Example usage of the Python parser
Here is an example of usage of the module:
import fardes
elements = fardes.parse("A,1:10[1kb:3kb],>B,1(rRNA;tRNA),>C,1[2],>D,=E,[3:*],F,1:*[>2Mb],G,<>,H,>0,I,<4,J,[~3kb],K,<|>,L,><,M,&,N")
will result in the following:
[{'type': 'unit', 'unit': 'A', 'prefix': ''},
{'type': 'interval', 'length': {'min': 1000, 'max': 3000}, 'n_features': {'min': 1, 'max': 10}},
{'type': 'unit', 'unit': 'B', 'prefix': '>'},
{'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 1, 'max': 1, 'type_spec': {'types': ['rRNA', 'tRNA']}}},
{'type': 'unit', 'unit': 'C', 'prefix': '>'},
{'type': 'interval', 'length': {'min': 2, 'max': 2}, 'n_features': {'min': 1, 'max': 1}},
{'type': 'unit', 'unit': 'D', 'prefix': '>'},
{'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 0, 'max': 0}},
{'type': 'unit', 'unit': 'E', 'prefix': '='},
{'type': 'interval', 'length': {'min': 3, 'max': None}, 'n_features': {'min': 0, 'min': None}},
{'type': 'unit', 'unit': 'F', 'prefix': ''},
{'type': 'interval', 'length': {'min': 2000001, 'max': None}, 'n_features': {'min': 1, 'max': None}},
{'type': 'unit', 'unit': 'G', 'prefix': ''},
{'type': 'interval', 'special': 'distant'}
{'type': 'unit', 'unit': 'H', 'prefix': ''},
{'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 1, 'max': None}},
{'type': 'unit', 'unit': 'I', 'prefix': ''},
{'type': 'interval', 'length': {'min': 0, 'max': None}, 'n_features': {'min': 0, 'min': 3}},
{'type': 'unit', 'unit': 'J', 'prefix': ''},
{'type': 'interval', 'length': {'approx': 3000}, 'n_features': {'min': 0, 'max': None}},
{'type': 'unit', 'unit': 'K', 'prefix': ''},
{'type': 'interval', 'special': 'other_molecule'},
{'type': 'unit', 'unit': 'L', 'prefix': ''},
{'type': 'interval', 'special': 'near'},
{'type': 'unit', 'unit': 'M', 'prefix': ''},
{'type': 'interval', 'special': 'overlap'},
{'type': 'unit', 'unit': 'N', 'prefix': ''}]
Acknowledgements
This specification has been created in context of the DFG project GO 3192/1-1 “Automated characterization of microbial genomes and metagenomes by collection and verification of association rules”. The funders had no role in study design, data collection and analysis.
Name
The name Fardes is an acronym for "feature arrangement description". After naming the project, I noticed that, according to Wiktionary https://en.wiktionary.org/wiki/farde, in Belgian French, a "farde" (plural: fardes) is a file, in the meaning of stationery to keep documents together. This fits well to the purpose of the format.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fardes-1.2.tar.gz
.
File metadata
- Download URL: fardes-1.2.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a1dc36b65f9546f2cd03d653e7207ebb602942fc7dcedb3d30c9e16795ae77b |
|
MD5 | 4537863a7a9e554e11bf911bcd06cd3f |
|
BLAKE2b-256 | fbea78def7f29a2ed3409671ac321f967b1b48a89808c11213f8f1ec3688b04d |
File details
Details for the file fardes-1.2-py3-none-any.whl
.
File metadata
- Download URL: fardes-1.2-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09987bf2f57349c9b2f2d11d07ac19d3bedcccdb5788762b4fee8956c926bf80 |
|
MD5 | 508933e8b93caccf11d559587dd4c70d |
|
BLAKE2b-256 | e81ee01c20ebf004ac1b27c9e1cf6df4854e1d2504d10c6ec32415257c524db9 |