High-volume key-value store and analytics, based on hdf5

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ExeTera

Welcome to the ExeTera Readme! This page and the accompanying github wiki show you how to make use of ExeTera to create reproducible analysis pipelines for large tabular datasets.

Current release and requirements

Current release version: v0.4.0

Requires python 3.7+

Usage

The ExeTera allows you to import data from CSV sources into HDF5, a columnar data format more suited to performing analytics. This is done through exetera import.

`exetera import`

exetera import
-s path/to/covid_schema.json \
-i "patients:path/to/patient_data.csv, assessments:path/to/assessmentdata.csv, tests:path/to/covid_test_data.csv, diet:path/to/diet_study_data.csv" \
-o /path/to/output_dataset_name.hdf5

Arguments

-s/--schema: The location and name of the schema file
-te/--territories: If set, this only imports the listed territories. If left unset, all territories are imported
-i/--inputs : A comma separated list of 'name:file' pairs. This should be put in parentheses if it contains any whitespace. See the example above.
-o/--output_hdf5: The path and name to where the resulting hdf5 dataset should be written
-ts/--timestamp: An override for the timestamp to be written (defaults to datetime.now(timezone.utc))
-w/--overwrite: If set, overwrite any existing dataset with the same name; appends to existing dataset otherwise

Expect this script to take about an hour or more to execute on very large datasets.

How do I work on the resulting dataset?

This is done through the python exetera API.

from exetera.core.session import Session

with Session() as s:
    src = s.open_dataset('/path/to/my/source/dataset', 'r', 'src')
    dest = s.open_dataset('/path/to/my/result/dataset', 'w', 'dest')

    # code...

See the wiki for detailed examples of how to interact with the hdf5 datastore.

Changes

v0.3.2 -> v0.4

Separation of all covid-specific functionality out to https://github.com/KCL-BMEIS/ExeTeraCovid.git
Removal of legacy csv pipeline code
Renaming of some of the ordered_merge_* functionality parameters for clarity
Addition of open/close/list/get_dataset functionality to Session
Made Session 'withable'
Improved performance of Session.get_spans
Bug fixes for Session API
- apply_spans / aggregation issues
Bug fixes for Field API
- provided __bool__ so that if field: works as expected
- provided single element read for IndexedStringField

v0.3.1 -> v0.3.2

Fixing issues with use of test_type_from_mechanism_v1
Adding ability to optionally import lsoa-based fields through add_imd script
Import now appends by default; to overwrite an existing dataset use -w \ --overwrite
Moved schema files to resources
Adding separate lsoa schema for import

v0.3.0 -> v0.3.1

Major performance improvement to Session.get_spans

v0.2.7 -> v0.3.0

Renaming of hystore to ExeTera, the project's new name!
Renaming of the hystorex command to exetera
Removal of scripts that now belong in https://github.com/KCL-BMEIS/ExeTeraCovid.git
Addition of snapshot journaling and extremely large sort functionality
Removal of the legacy csv script functionality

v0.2.7 -> v0.2.7.3

Fix to covid_schema.json for numeric diet fields marked 'float' instead of 'float32'
Addition of --daily flag to enable / disable generation of daily assessments
Addition of

v0.2.6 -> v0.2.7

Addition of diet questionnaire schema
Reworking of arguments for hystorex import to support arbitrary numbers and names of csvs
Provision of highly-scalable merge functionality through ordered merge functions
- Fix for filtering of indexed string fields

v0.2.5 -> v0.2.6

Moving from DataSet to Session class offering cleaner syntax
Moving from Readers/Writers to Fields for cleaner syntax
Introduction of schema for import command
Consolidating commands
- h5import -> hystorex import
- h5process -> hystorex process

v0.2.3 -> v0.2.5

Please note: there was no version v0.2.4; due to a numbering error when updating the version number
Simplifications to the API

v0.2.2 -> v0.2.3

Data schema updated for 1.5.1

v0.2.1 -> v0.2.2

Fix: Split functionality had not been moved to bin/csvsplit as documented
Fix: Missing license headers added

v0.2.0 -> v0.2.1 - tag

Refactor: Created the DataStore class and moved processor api methods onto it as member functions
Refactor: Simplified the creation of Writers. This can now be done through get_writer on a DataStore instance
Fix: Writes to a hdf5 store can no longer be interrupted by interrupts, resulting in more stable hdf5 files
Fix: Fixed critical bug in process method that resulted in exceptions when running on fields with a length that isn't an exact multiple of the chunksize

v0.1.9 -> v0.2.0

Added hdf5 import and process functionality

v0.1.8 -> v0.1.9

Feature: provision of the split.py script to split the dataset up into subsets of patients and their associated assessments
Fix: added treatments and other_symptoms to cleaned assessment file. These fields are concatenated during the merge step using using csv-style delimiters and escapes

v0.1.7 -> v0.1.8

Fix: had_covid_test was not being patched up along with tested_covid_positive'
Breaking change: output fields renamed
- Fixed up had_covid_test is output as had_covid_test_clean
- Fixed up tested_covid_positive is output as tested_covid_positive_clean
- had_covid_test and tested_covid_positive contain the un-fixed-up data (although rows may still be modified as a result of quantising assessments by day)

v0.1.6 -> v0.1.7

Fix: height_clean contains weight data and weight_clean contains height data. This has been the case since they were introduced in v0.1.5

v0.1.5 -> v0.1.6

Performance: reduced memory usage
Addition: provision of -ps flag for setting parsing schema

v0.1.4 -> v0.1.5

Fix: health_status was not being accumulated during the assessment compression phase of cleanup

v0.1.3 -> v0.1.4

Fix: added missing value rarely_left_the_house_but_visit_lots to level_of_isolation
Fix: added missing fields weight_clean, height_clean and bmi_clean

v0.1.2 -> v0.1.3

Fix: -po and -ao options now properly export patient and assessment csvs respectively

v0.1.1 -> v0.1.2

Fix: day no longer overwriting tested_covid_positive on assessment export
Fix: tested_covid_positive output as a label instead of a number

v0.1 -> v0.1.1

Change: Converted 'NA' to '' for csv export

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.6.0

Apr 22, 2022

0.6.0b6 pre-release

Apr 1, 2022

0.6.0b5 pre-release

Apr 1, 2022

0.6.0b0 pre-release

Feb 21, 2022

0.5.5

Jun 22, 2021

0.5.4

May 28, 2021

0.5.3

May 24, 2021

0.5.2

Apr 30, 2021

0.5.1

Apr 23, 2021

0.5.0

Apr 21, 2021

0.4.0.4

Jan 10, 2021

This version

0.4.0.3

Dec 16, 2020

0.4.0.2

Dec 11, 2020

0.4.0

Dec 3, 2020

0.4.0.dev8 pre-release

Dec 1, 2020

0.4.0.dev7 pre-release

Dec 1, 2020

0.4.0.dev6 pre-release

Nov 30, 2020

0.4.0.dev5 pre-release

Nov 30, 2020

0.4.0.dev4 pre-release

Nov 30, 2020

0.4.0.dev3 pre-release

Nov 30, 2020

0.4.0.dev2 pre-release

Nov 30, 2020

0.4.0.dev1 pre-release

Nov 28, 2020

0.3.2

Nov 4, 2020

0.3.1

Oct 27, 2020

0.3.0

Oct 26, 2020

0.2.8.dev1 pre-release

Oct 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exetera-0.4.0.3.tar.gz (55.5 kB view hashes)

Uploaded Dec 16, 2020 Source

Built Distribution

exetera-0.4.0.3-py3-none-any.whl (65.2 kB view hashes)

Uploaded Dec 16, 2020 Python 3

Hashes for exetera-0.4.0.3.tar.gz

Hashes for exetera-0.4.0.3.tar.gz
Algorithm	Hash digest
SHA256	`3156ee44877ba4910d6d1fe026a917738233e524f9d7beba504e5a3d5e932860`
MD5	`c8d8b37fd896057f11883763bd43f45c`
BLAKE2b-256	`ffcf245c5def2d1a3866e8997ca3396c3808595b744f59632765debb68258842`

Hashes for exetera-0.4.0.3-py3-none-any.whl

Hashes for exetera-0.4.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37f9bf6e7cad3df2b4740d37d1be88b587201687030cf65f60d90e8cdcdf6c55`
MD5	`b17cf8894fd83e19e195b299d83f1752`
BLAKE2b-256	`d06c95f9462c16b2ad69c7667f0c0b348f94fa1cf13778bb10d18128fcb817b2`

exetera 0.4.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

Current release and requirements

Usage

exetera import

Arguments

How do I work on the resulting dataset?

Changes

v0.3.2 -> v0.4

v0.3.1 -> v0.3.2

v0.3.0 -> v0.3.1

v0.2.7 -> v0.3.0

v0.2.7 -> v0.2.7.3

v0.2.6 -> v0.2.7

v0.2.5 -> v0.2.6

v0.2.3 -> v0.2.5

v0.2.2 -> v0.2.3

v0.2.1 -> v0.2.2

v0.2.0 -> v0.2.1 - tag

v0.1.9 -> v0.2.0

v0.1.8 -> v0.1.9

v0.1.7 -> v0.1.8

v0.1.6 -> v0.1.7

v0.1.5 -> v0.1.6

v0.1.4 -> v0.1.5

v0.1.3 -> v0.1.4

v0.1.2 -> v0.1.3

v0.1.1 -> v0.1.2

v0.1 -> v0.1.1

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`exetera import`