Skip to main content

Minimal Python library for parsing SPAdes FASTG files

Project description

pyfastg: a minimal Python library for parsing SPAdes FASTG files

pyfastg CI Code Coverage PyPI

The FASTG file format

FASTG is a format for describing sequencing assembly graphs. It is geared toward accurately representing the ambiguity resulting from sequencing limitations, ploidy, or other factors that complicate representation of a seqence as a simple string.

The latest specification for the FASTG format is version 1.00, as of writing; this specification is located here. Whenever the rest of this documentation mentions "the FASTG spec," this is in reference to this version of the specification.

pyfastg parses graphs that follow a subset of the FASTG spec: in particular, pyfastg is designed to work with files output by the SPAdes family of assemblers.

The pyfastg library

pyfastg is a Python library that contains parse_fastg(), a function that takes as input a path to a SPAdes FASTG file. parse_fastg() reads the specified FASTG file and returns a NetworkX DiGraph object representing the structure of the assembly graph. From here, the graph can be analyzed, visualized, etc. as needed.

pyfastg is very much in its infancy, so it may be most useful as a starting point. Pull requests are welcome!

Note about the graph topology

The FASTG spec contains the following sentence (in section 6, page 7):

Note also that strictly speaking, [the structure described in a FASTG file] is not a graph at all, as we have not specified a notion of vertex. However in many cases one can without ambiguity define vertices and thereby associate a bona fide digraph, and we do so frequently in this document to illustrate concepts.

We take this approach in pyfastg. "Edges" in the FASTG file will be represented as nodes in the NetworkX graph, and "adjacencies" between edges in the FASTG file will be represented as edges in the NetworkX graph. As far as we're aware, this is usually how these files are visualized.

Installation

pyfastg can be installed using pip:

pip install pyfastg

pyfastg's only dependency (which should be installed automatically with the above command) is NetworkX ≥ 2.

As of writing, pyfastg supports all Python versions ≥ 3.6. pyfastg might be able to work with earlier versions of Python, but we do not explicitly test against these.

Quick Example

The second line (which points to one of pyfastg's test assembly graphs) assumes that you're located in the root directory of the pyfastg repo.

>>> import pyfastg
>>> g = pyfastg.parse_fastg("pyfastg/tests/input/assembly_graph.fastg")
>>> # g is now a NetworkX DiGraph! We can do whatever we want with this object.
>>>
>>> # Example: List the sequences in this graph (these are "edges" in the FASTG
>>> # file, but are represented as nodes in g)
>>> g.nodes()
NodeView(('1+', '29-', '1-', '6-', '2+', '26+', '27+', '2-', '3+', '4+', '6+', '7+', '3-', '33-', '9-', '4-', '5+', '5-', '28+', '7-', '8+', '28-', '9+', '8-', '12-', '10+', '12+', '10-', '24-', '32-', '11+', '30-', '11-', '27-', '19-', '13+', '25+', '31-', '13-', '14+', '14-', '26-', '15+', '15-', '23-', '16+', '16-', '17+', '17-', '19+', '18+', '33+', '18-', '20+', '20-', '22+', '21+', '21-', '22-', '23+', '24+', '25-', '29+', '30+', '31+', '32+'))
>>>
>>> # Example: Get details for a single sequence (length, coverage, GC-content)
>>> g.nodes["15+"]
{'length': 193, 'cov': 6.93966, 'gc': 0.5492227979274611}
>>>
>>> # Example: Get information about the graph's connectivity
>>> import networkx as nx
>>> components = list(nx.weakly_connected_components(g))
>>> for c in components:
...     print(len(c), "nodes")
...     print(c)
...
33 nodes
{'8-', '17-', '15+', '30+', '16+', '26-', '25+', '19+', '7+', '23+', '14-', '18-', '10-', '29-', '20-', '27-', '11-', '5-', '3+', '2-', '12-', '13+', '31-', '6+', '1+', '21-', '24-', '32-', '22+', '28+', '4+', '33-', '9-'}
33 nodes
{'26+', '29+', '18+', '3-', '2+', '8+', '15-', '24+', '9+', '17+', '27+', '28-', '11+', '6-', '20+', '14+', '19-', '13-', '4-', '21+', '5+', '31+', '22-', '12+', '25-', '30-', '10+', '1-', '7-', '32+', '23-', '33+', '16-'}

Details about the required input file format (tl;dr: SPAdes-dialect FASTG files only)

Currently, pyfastg is hardcoded to parse FASTG files created by the SPAdes assembler. Other valid FASTG files that don't follow the pattern used by SPAdes for edge names are not supported.

Edge names

In particular, each edge in the file must have a name formatted like:

EDGE_1_length_9909_cov_6.94721

The edge ID (here, 1) can contain the characters a-z, A-Z, and 0-9.

The edge length (here, 9909) can contain the characters 0-9.

The edge coverage (here, 6.94721) can contain the characters 0-9 and ..

An edge name can optionally end with a ' character, indicating that this edge is a reverse complement. We will refer to whether or not an edge name ends with ' as its orientation: an edge that does not end with a ' has a + orientation, and an edge name that ends with a ' has a - orientation.

Edge names in a FASTG file should be consistent, with respect to their ID and orientation. If, in a single FASTG file, pyfastg sees a reference to an edge named EDGE_1_length_9909_cov_6.94721 and also a reference to an edge named EDGE_1_length_8109_cov_6.94721 (with the same ID [1] and orientation [+], but a different length and/or coverage) then it will throw an error.

Edge declaration lines

Here, we refer to each line starting with > as an edge declaration. An edge's sequence is described in the line(s) following its edge declaration (until the next edge declaration); additionally, the outgoing adjacencies from this edge to other edges may be described on this line, if present. For example, the line

>EDGE_1_length_5_cov_10:EDGE_2_length_3_cov_1,EDGE_3_length_6_cov_2.5',EDGE_4_length_8_cov_5.1;

indicates that the edge EDGE_1_length_5_cov_10 has three outgoing adjacencies: to the edges EDGE_2_length_3_cov_1, EDGE_3_length_6_cov_2.5', and EDGE_4_length_8_cov_5.1. This line would thus result in three "edges" being created in the NetworkX graph produced by pyfastg: (1+2+), (1+3-), and (1+4+).

Each edge declaration must end with a ; character (after removing trailing whitespace). Section 15 of the FASTG spec mentions that having a newline after the semicolon isn't required, but we require it here for the sake of simplicity.

Edge sequences

We assume that each sequence (the line(s) between edge declarations) consists only of the characters A, C, G, T, or U. So, more complex types of strings (e.g. the "stuffed gaps" described in the FASTG spec) are not allowed in an edge's sequence.

Additionally, lowercase characters or degenerate nucleotides are not allowed; this matches section 15 of the FASTG spec. The FASTG spec doesn't explicitly allow for uracil (U), but we allow it anyway in order to support RNA sequences. (U and T are allowed to be contained in the same sequence, in the unlikely case that this is needed.)

Leading and trailing whitespace in sequence lines will be ignored, so something like

    ATC

 G     

is technically valid, and describes the sequence ATCG. However, a line like ATC G is not valid since the inner space, , would be considered part of the sequence.

Details about the output NetworkX graph

Node names and attributes

Nodes in the returned DiGraph (corresponding to edges in the FASTG file) will contain three attribute fields:

  1. length: the length of the sequence (represented as a python int)
  2. cov: the coverage of the sequence (represented as a python float)
  3. gc: the GC-content (in the range [0, 1]) of the sequence (represented as a python float)

Each node's name is a python str created by concatenating edge IDs and orientations. For example, EDGE_1_length_9909_cov_6.94721 will correspond to a node named 1+. This naming scheme is analogous to that used by Bandage.

About reverse complements

pyfastg only creates nodes based on the edges explicitly described in the FASTG file. If a file only describes edges EDGE_1_length_5_cov_10, EDGE_2_length_6_cov_10', and EDGE_3_length_7_cov_15, then pyfastg will only create nodes 1+, 2-, and 3+, and not the reverse complement nodes 1-, 2+, 3-, etc.

Similarly, if a file contains an adjacency from edge EDGE_1_length_5_cov_10 to EDGE_2_length_6_cov_10', then this adjacency will only be represented as a single edge (1+2-) in pyfastg's output graph. The implied reverse-complement of this edge (2+1-) will not be created unless the file explicitly contains an adjacency from EDGE_2_length_6_cov_10 to EDGE_1_length_5_cov_10'.

Information for pyfastg developers

Installation

If you're interested in developing the code, you will probably want to fork this repository and then clone your fork. Once you do this, cd into the root of the repository and run

pip install -e .[dev]

to install pyfastg in "editable mode." Thanks to the [dev] flag, this will also install pyfastg's development dependencies (see the extras_require line in setup.py for details).

Testing, linting, and formatting the code

All of these commands are covered in pyfastg's Makefile.

  • Run tests: make test
  • Lint and style-check the code: make stylecheck
  • Automtaically style the code: make style

Changelog

See pyfastg's CHANGELOG.md file for information on the changes included with new pyfastg releases.

License

pyfastg is licensed under the MIT License. Please see pyfastg's LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfastg-0.1.0.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

pyfastg-0.1.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file pyfastg-0.1.0.tar.gz.

File metadata

  • Download URL: pyfastg-0.1.0.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.2.0 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.13

File hashes

Hashes for pyfastg-0.1.0.tar.gz
Algorithm Hash digest
SHA256 493bf63fd064cf70b03d23ec3d0173ae8a6f120b812acca091320435a097fb38
MD5 5b288afc59c5d51a4327baf4b02de3c4
BLAKE2b-256 aab0dda8f28e4dc8482039228c6c2152295a69a0dc00c7475ae359e60cae1585

See more details on using hashes here.

File details

Details for the file pyfastg-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyfastg-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.2.0 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.13

File hashes

Hashes for pyfastg-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c2b1bd818f8a9b564261121a1e60d1908fb2fa32540fb63bb0a3b38471ed7de
MD5 3f040f29fc27a8efdd6ae7d2fcda2a15
BLAKE2b-256 001c423e9bbe9a8b419841f635e78be1630bc0b9c3f41f88a309bb4b9ca71f82

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page