A Python reader (and eventually writer) for ms files
msms file reader for Python.
This reader is enhanced over basic ms file readers, in that it keeps a cache indices for each file it reads, which significantly speeds of random access to individual samples within a multiple-replicate ms file.
This can be especially useful for machine learnings tasks, during which ms files need to be randomly accessed multiple times. Files already seen (by the same process) are read much more quickly than the first time they are accessed within that process. A future version will also add cache persistence.
mssspy adds the ability to plug in different "reader"
implementations that use different parsing algorithms. Currently two
built-in readers are included, the "slow" reader which is more
fault-tolerant and provides better error reporting, and a "faster" reader
which assumes correctly formatted ms files, while sacrificing more careful
To read an ms file, the main high-level interface is the
Simply open a file like:
>>> import mssspy >>> msf = MSFile('path/to/simulations.ms')
You can then access the individual replicates in the file, or "samples" using index notation:
>>> msf Sample(haplotypes=array([[0, 1, 1, 0, 0], [1, 0, 0, 1, 1]], dtype=uint8), positions=array([0.283, 0.55 , 0.589, 0.715, 0.988]))
This is the case even if there is only one sample in the file,
If you intend to read multiple samples from the same file while it's open,
it is also more efficient to use
MSFile in a
with statement, e.g.:
>>> with MSFile('path/to/simulations.ms') as msf: ... all_samples = list(msf)
Note: The is currently not a way to get the length of the file in
len(msf) does not work. This is because it would require
scanning through the entire file to count the number of samples, which would
be inefficient. However, this capability will be added in a future release.
In the meantime, you can still iterate over the
MSFile which will try each
possible index starting from
0 until an
IndexError is raised. In other
words, that's why
And that's basically it!
TODO List for Future Releases
Add "fast" reader written in C(ython) and compare its performance to the existing "faster" reader.
More thorough parsing (e.g. support for
time:and tree data parsing).
Support for writing.
More thorough documentation including API documentation.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.