Skip to main content

Data processing for Dogma

Project description

Dogma Data

Dogma Data is a Python package built for fast and efficient parsing of FASTA files, optimized for high-performance computing. It leverages multi-threading to fully utilize all available system threads, enabling parallel processing. Additionally, the package supports exporting parsed data to the HDF5 file format for easy storage and access.

Installation

To install Dogma Data, you can use pip:

pip install dogma-data

Usage

import dogma_data

vocab = {
    'a': 0,
    'g': 1,
    'c': 2,
    't': 3,
    ...
}

mapping = dogma_data.FastaMapping(vocab, vocab['a'])
(tokens, sequences, (taxons)) = dogma_data.parse_fasta('input_path.fa', mapping, dogma_data.HeaderType.TaxonId)

header_info = {"taxons": taxons}

dogma_data.export_hdf5(
    'output_path.h5',
    dogma_data.Splitter(
        train_prop=0.95,
        val_prop=0.025,
        test_prop=0.025,
        length=len(sequences) - 1,
    ),
    tokens,
    sequences,
    header_info,
)

Requirements

  • Python 3.10

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any questions, feel free to reach out:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

dogma_data-0.2.18-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (493.2 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

dogma_data-0.2.18-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (493.7 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

dogma_data-0.2.18-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (493.8 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page