High-volume key-value store and analytics, based on hdf5
Project description
ExeTera
Welcome to the ExeTera Readme! This page and the accompanying github wiki show you how to make use of ExeTera to create reproducible analysis pipelines for large tabular datasets.
Current release and requirements
Current release version: v0.4.0
Requires python 3.7+Usage
The ExeTera allows you to import data from CSV sources into HDF5, a columnar data
format more suited to performing analytics. This is done through exetera import.
exetera import
exetera import
-s path/to/covid_schema.json \
-i "patients:path/to/patient_data.csv, assessments:path/to/assessmentdata.csv, tests:path/to/covid_test_data.csv, diet:path/to/diet_study_data.csv" \
-o /path/to/output_dataset_name.hdf5
Arguments
-s/--schema: The location and name of the schema file-te/--territories: If set, this only imports the listed territories. If left unset, all territories are imported-i/--inputs: A comma separated list of 'name:file' pairs. This should be put in parentheses if it contains any whitespace. See the example above.-o/--output_hdf5: The path and name to where the resulting hdf5 dataset should be written-ts/--timestamp: An override for the timestamp to be written (defaults todatetime.now(timezone.utc))-w/--overwrite: If set, overwrite any existing dataset with the same name; appends to existing dataset otherwise
Expect this script to take about an hour or more to execute on very large datasets.
How do I work on the resulting dataset?
This is done through the python exetera API.
from exetera.core.session import Session
with Session() as s:
src = s.open_dataset('/path/to/my/source/dataset', 'r', 'src')
dest = s.open_dataset('/path/to/my/result/dataset', 'w', 'dest')
# code...
See the wiki for detailed examples of how to interact with the hdf5 datastore.
Changes
v0.3.2 -> v0.4
- Separation of all covid-specific functionality out to https://github.com/KCL-BMEIS/ExeTeraCovid.git
- Removal of legacy csv pipeline code
- Renaming of some of the
ordered_merge_*functionality parameters for clarity - Addition of
open/close/list/get_datasetfunctionality toSession - Made
Session'withable' - Improved performance of
Session.get_spans - Bug fixes for Session API
- apply_spans / aggregation issues
- Bug fixes for Field API
- provided
__bool__so thatif field:works as expected - provided single element read for
IndexedStringField
- provided
v0.3.1 -> v0.3.2
- Fixing issues with use of test_type_from_mechanism_v1
- Adding ability to optionally import lsoa-based fields through add_imd script
- Import now appends by default; to overwrite an existing dataset use
-w\--overwrite - Moved schema files to resources
- Adding separate lsoa schema for import
v0.3.0 -> v0.3.1
- Major performance improvement to Session.get_spans
v0.2.7 -> v0.3.0
- Renaming of hystore to ExeTera, the project's new name!
- Renaming of the
hystorexcommand toexetera - Removal of scripts that now belong in https://github.com/KCL-BMEIS/ExeTeraCovid.git
- Addition of snapshot journaling and extremely large sort functionality
- Removal of the legacy csv script functionality
v0.2.7 -> v0.2.7.3
- Fix to covid_schema.json for numeric diet fields marked 'float' instead of 'float32'
- Addition of --daily flag to enable / disable generation of daily assessments
- Addition of
v0.2.6 -> v0.2.7
- Addition of diet questionnaire schema
- Reworking of arguments for hystorex import to support arbitrary numbers and names of csvs
- Provision of highly-scalable merge functionality through ordered merge functions
- Fix for filtering of indexed string fields
v0.2.5 -> v0.2.6
- Moving from DataSet to Session class offering cleaner syntax
- Moving from Readers/Writers to Fields for cleaner syntax
- Introduction of schema for import command
- Consolidating commands
- h5import -> hystorex import
- h5process -> hystorex process
v0.2.3 -> v0.2.5
- Please note: there was no version v0.2.4; due to a numbering error when updating the version number
- Simplifications to the API
v0.2.2 -> v0.2.3
- Data schema updated for 1.5.1
v0.2.1 -> v0.2.2
- Fix: Split functionality had not been moved to bin/csvsplit as documented
- Fix: Missing license headers added
v0.2.0 -> v0.2.1 - tag
- Refactor: Created the
DataStoreclass and movedprocessorapi methods onto it as member functions - Refactor: Simplified the creation of Writers. This can now be done through
get_writeron aDataStoreinstance - Fix: Writes to a hdf5 store can no longer be interrupted by interrupts, resulting in more stable hdf5 files
- Fix: Fixed critical bug in process method that resulted in exceptions when running on fields with a length that isn't an exact multiple of the chunksize
v0.1.9 -> v0.2.0
- Added hdf5 import and process functionality
v0.1.8 -> v0.1.9
- Feature: provision of the
split.pyscript to split the dataset up into subsets of patients and their associated assessments - Fix: added
treatmentsandother_symptomsto cleaned assessment file. These fields are concatenated during the merge step using using csv-style delimiters and escapes
v0.1.7 -> v0.1.8
- Fix:
had_covid_testwas not being patched up along withtested_covid_positive' - Breaking change: output fields renamed
- Fixed up
had_covid_testis output ashad_covid_test_clean - Fixed up
tested_covid_positiveis output astested_covid_positive_clean had_covid_testandtested_covid_positivecontain the un-fixed-up data (although rows may still be modified as a result of quantising assessments by day)
- Fixed up
v0.1.6 -> v0.1.7
- Fix:
height_cleancontains weight data andweight_cleancontains height data. This has been the case since they were introduced in v0.1.5
v0.1.5 -> v0.1.6
- Performance: reduced memory usage
- Addition: provision of
-psflag for setting parsing schema
v0.1.4 -> v0.1.5
- Fix:
health_statuswas not being accumulated during the assessment compression phase of cleanup
v0.1.3 -> v0.1.4
- Fix: added missing value
rarely_left_the_house_but_visit_lotstolevel_of_isolation - Fix: added missing fields
weight_clean,height_cleanandbmi_clean
v0.1.2 -> v0.1.3
- Fix:
-poand-aooptions now properly export patient and assessment csvs respectively
v0.1.1 -> v0.1.2
- Fix:
dayno longer overwritingtested_covid_positiveon assessment export - Fix:
tested_covid_positiveoutput as a label instead of a number
v0.1 -> v0.1.1
- Change: Converted
'NA'to''for csv export
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file exetera-0.4.0.3.tar.gz.
File metadata
- Download URL: exetera-0.4.0.3.tar.gz
- Upload date:
- Size: 55.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3156ee44877ba4910d6d1fe026a917738233e524f9d7beba504e5a3d5e932860
|
|
| MD5 |
c8d8b37fd896057f11883763bd43f45c
|
|
| BLAKE2b-256 |
ffcf245c5def2d1a3866e8997ca3396c3808595b744f59632765debb68258842
|
File details
Details for the file exetera-0.4.0.3-py3-none-any.whl.
File metadata
- Download URL: exetera-0.4.0.3-py3-none-any.whl
- Upload date:
- Size: 65.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37f9bf6e7cad3df2b4740d37d1be88b587201687030cf65f60d90e8cdcdf6c55
|
|
| MD5 |
b17cf8894fd83e19e195b299d83f1752
|
|
| BLAKE2b-256 |
d06c95f9462c16b2ad69c7667f0c0b348f94fa1cf13778bb10d18128fcb817b2
|