Skip to main content

High-volume key-value store and analytics, based on hdf5

Project description

ExeTera

Welcome to the ExeTera Readme! This page and the accompanying github wiki show you how to make use of ExeTera to create reproducible analysis pipelines for large tabular datasets.

Please take a moment to read this page, and also take a look at the Wiki, which contains in-depth documentation on the concepts behind this software, usage examples, and developer resources such as the roadmap for future releases.

Current release and requirements

Documentation Status PyPI Version Testing codecov

Requires python 3.7+


Usage

The ExeTera allows you to import data from CSV sources into HDF5, a columnar data format more suited to performing analytics. This is done through exetera import.

exetera import

exetera import \
  -s path/to/covid_schema.json \
  -i "patients:path/to/patient_data.csv, assessments:path/to/assessmentdata.csv, tests:path/to/covid_test_data.csv, diet:path/to/diet_study_data.csv" \
  -o /path/to/output_dataset_name.hdf5 \
  --include "patients:(id,country_code,blood_group), assessments:(id,patient_id,chest_pain)" \
  --exclude "tests:(country_code)"

Arguments

  • -s/--schema: The location and name of the schema file
  • -te/--territories: If set, this only imports the listed territories. If left unset, all territories are imported
  • -i/--inputs : A comma separated list of 'name:file' pairs. This should be put in parentheses if it contains any whitespace. See the example above.
  • -o/--output_hdf5: The path and name to where the resulting hdf5 dataset should be written
  • -ts/--timestamp: An override for the timestamp to be written (defaults to datetime.now(timezone.utc))
  • -w/--overwrite: If set, overwrite any existing dataset with the same name; appends to existing dataset otherwise
  • -n/--include: If set, filters out all fields apart from those in the list.
  • -x/--exclude: If set, filters out the fields in this list.

Expect this script to take about an hour or more to execute on very large datasets.

How do I work on the resulting dataset?

This is done through the python exetera API.

from exetera.core.session import Session

with Session() as s:
    src = s.open_dataset('/path/to/my/source/dataset', 'r', 'src')
    dest = s.open_dataset('/path/to/my/result/dataset', 'w', 'dest')

    # code...

See the wiki for detailed examples of how to interact with the hdf5 datastore.

Changes

The ChangeLog can now be found on the ExeTera wiki

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

exetera-0.6.0-py3-none-any.whl (105.1 kB view details)

Uploaded Python 3

File details

Details for the file exetera-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: exetera-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 105.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.7.12

File hashes

Hashes for exetera-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0a7bd4bb3065d387ee8d362ac10f1f03c31d3a8b2588da405db258fd2ff5b14e
MD5 cf35eee492e60b82502341fa42471d00
BLAKE2b-256 404dc35a67e6ff8afb47af2d38f8bb31167aaf9ac234eeac433d17a8d70924de

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page