Skip to main content

A Python reader (and eventually writer) for ms files

Project description

msssPy

An ms/msms file reader for Python.

Pronounced “Mississippi”

This reader is enhanced over basic ms file readers, in that it keeps a cache indices for each file it reads, which significantly speeds of random access to individual samples within a multiple-replicate ms file.

This can be especially useful for machine learnings tasks, during which ms files need to be randomly accessed multiple times. Files already seen (by the same process) are read much more quickly than the first time they are accessed within that process. A future version will also add cache persistence.

Additionally, mssspy adds the ability to plug in different "reader" implementations that use different parsing algorithms. Currently two built-in readers are included, the "slow" reader which is more fault-tolerant and provides better error reporting, and a "faster" reader which assumes correctly formatted ms files, while sacrificing more careful validation.

Basic Usage

To read an ms file, the main high-level interface is the MSFile class. Simply open a file like:

>>> import mssspy
>>> msf = MSFile('path/to/simulations.ms')

You can then access the individual replicates in the file, or "samples" using index notation:

>>> msf[0]
Sample(haplotypes=array([[0, 1, 1, 0, 0],
       [1, 0, 0, 1, 1]], dtype=uint8), positions=array([0.283, 0.55 , 0.589, 0.715, 0.988]))

This is the case even if there is only one sample in the file, msf[0].

If you intend to read multiple samples from the same file while it's open, it is also more efficient to use MSFile in a with statement, e.g.:

>>> with MSFile('path/to/simulations.ms') as msf:
...     all_samples = list(msf)

Note: The is currently not a way to get the length of the file in samples. E.g. len(msf) does not work. This is because it would require scanning through the entire file to count the number of samples, which would be inefficient. However, this capability will be added in a future release.

In the meantime, you can still iterate over the MSFile which will try each possible index starting from 0 until an IndexError is raised. In other words, that's why list(msf) works.

And that's basically it!

Advanced Usage

TODO

TODO List for Future Releases

  • Add "fast" reader written in C(ython) and compare its performance to the existing "faster" reader.

  • More thorough parsing (e.g. support for time: and tree data parsing).

  • Support for writing.

  • More thorough documentation including API documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Built Distribution

mssspy-0.1.0b2-py3-none-any.whl (14.7 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page