Skip to main content

A tool for storing and analyzing manuscript-scale computational chemistry data

Project description

reptar

A tool for storing and analyzing manuscript-scale computational chemistry data

Documentation

Build Status codecov GitHub release (latest by date) DOI License GitHub repo size Black style Black style

MotivationInstallationFile TypesKey-value pairsWorkflowLicense

Motivation

The computational chemistry community often fails to openly provide raw and/or processed data used to draw their scientific conclusions.

For large projects, frameworks such as QCArchive, Materials Project, Pitt Quantum Repository, ioChem-BD and many others provide great storage solutions. This approach would not be practical for fluid data pipelines and small-scale projects such as a single manuscript.

Alternatively, you could use individual files in formats such as JSON, XML, YAML, npz, etc. These are great options for customizable data storage with their own advantages and disadvantages. However, you often must choose between (1) a standardized parser that might not support your workflow or (2) writing your own.

Reptar is designed for easy data storage and analysis for individual projects. Customizable parsers provide a simple way to extract new data without submitting issues and pull requests (although this is highly encouraged). While files are the heart of reptar, it strives to be file-type agnostic by providing the same interface for all supported file types. The result is a user-specified file streamlined for analysis in Python and archival on places such as GitHub and Zenodo.

Installation

You can install reptar from PyPI by using pip install reptar. Or, the latest development version can be installed directly from the GitHub repository or from TestPyPI.

git clone https://github.com/aalexmmaldonado/reptar
cd reptar
pip install .

File types

Reptar supports four file types with a single interface: exdir, zarr, JSON, and npz. JSON is a text file for storing key-value pairs with few dimensions (i.e., no large arrays). NumPy's npz format is useful for arrays; however, no nesting is possible and loading data often requires postprocessing for 0D arrays (e.g., np.array('data')).

Exdir is a simple, yet powerful open file format that mimics the HDF5 format with metadata and data stored in directories with YAML and npy files instead of a single binary file. For more detailed information, please read this Front. Neuroinform. article about exdir. Zarr is a similar hierarchical data format for chunked and compressed NumPy-like arrays and JSON attributes. Both of these file types provide several advantages such as mixing human-readable and binary files, being easier for version control, and only loading requested portions of arrays into memory.

Key-value pairs

All data is stored under a key-value pair within the reptar framework. The key tells reptar where the data is stored and is conceptually related to standard file paths (without file extensions). Nested data is specified by separating the nested keys with a /. For example, energy_pot, md_run/geometry, and entity_ids are all valid keys. Note that gradients and /gradients would translate to the same value (/ species the "root" of the file).

Workflow

Storing data

We refer to a "reptar file" as any file that can be used with the reptar.File class. Creating a reptar file starts by having a set of data files generated from some calculation. Paths to these data files are passed into reptar.Creator.from_calc that extracts information using a reptar.parser class. Information parsed from these files, parsed_info, is then used to populate a reptar.File object.

Data can also be manually added by using File.put(key, data) where key is a string specifying where to store the data.

Accessing data

Data can be added or retrieved using the same interface regardless of the underlying file format (e.g., exdir, JSON, and npz). The only thing required is the respective key specifying where it is stored. Then, File.get(key) can retrieve the data.

When working with JSON and npz files, File.save() must be explicitly called after any modification.

Writing to other formats

Other packages often require data to be formatted in their own specific way. Reptar provides ways to extract data from reptar files using File.get(key) and passing it into the desired reptar.writer function. Reptar currently automates the creation of:

License

Distributed under the MIT License. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reptar-0.1.0.tar.gz (93.5 kB view details)

Uploaded Source

Built Distribution

reptar-0.1.0-py3-none-any.whl (95.0 kB view details)

Uploaded Python 3

File details

Details for the file reptar-0.1.0.tar.gz.

File metadata

  • Download URL: reptar-0.1.0.tar.gz
  • Upload date:
  • Size: 93.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.16

File hashes

Hashes for reptar-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5bab29a12ece11e0b45014f3b715e3c45bbc09d9bebe5c55f58d1acef5395e98
MD5 86fb256b49fed86a306273289d2c953a
BLAKE2b-256 93f2f94aefc8cfb1a5bbf7652eecac97cdcc99bbe5c80f6ef3c4710c96357306

See more details on using hashes here.

File details

Details for the file reptar-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: reptar-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 95.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.16

File hashes

Hashes for reptar-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c25511ca3a3ad16a08ad166d4f90e8b83cf38618febd0821acb19042de29f41b
MD5 746b07133a8093ccbe0751b828a2979e
BLAKE2b-256 b31178f17a0ce2e1bf33171136487272cb61c4ab36a2dcb4f31f33555a8e9e87

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page