Skip to main content

Small library of common functionalities used in various projects in the ratschlab

Project description

ratschlab-common

Small library of common code used in various projects in the ratschlab.

Features

  • Writing parquet and HDF5 files with sensible defaults ratschlab_common.io.dataframe_formats.
  • Support for working with 'chunkfiles', i.e. splitting up a large dataset in smaller chunks which can be processed independently (see example notebook):
    • Repartition records (i.e. increase or decrease number of chunkfiles) while keeping data belonging together in the same file (e.g. data with the same patient id associated)
    • simple indexing for looking up in which chunk to find data belonging e.g. to a patient
  • bigmatrix: support for creating and reading large matrices stored in HDF5 having additional metadata on the axes in form of data frames (see example notebook.)
  • small wrappers for spark and dask (spark example.)
  • saving sparse pandas dataframes to hdf5, see example notebook

Tools

ratschlab-common also comes with some command line tools:

  • pq-tool: inspect parquet files on the command line
    • pq-tool head: first records
    • pq-tool tail: last records
    • pq-tool cat: all records
    • pq-tool schema: schema of a parquet file
  • export-db-to-files: Tool to dump (postgres) database tables into parquet files. Large tables can be partitioned on a key and dumped into separate file chunks. This allows for further processing to be easily done in parallel.
  • bigmatrix-repack: rechunking/packing bigmatrix hdf5 files

Installation and Requirements

The library along with all the required dependencies can be installed with:

pip install ratschlab-common[complete]

Depending on whether you plan to use spark or dask or none of them you could install ratschlab-common through either of the commands

pip install ratschlab-common
pip install ratschlab-common[spark]
pip install ratschlab-common[dask]

Note, that if you plan on using spark make sure, you have Java 8 and either python 3.6 or 3.7 installed (python 3.8 is currently not supported by pyspark).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ratschlab_common-0.3.0.tar.gz (36.9 kB view details)

Uploaded Source

Built Distribution

ratschlab_common-0.3.0-py2.py3-none-any.whl (36.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file ratschlab_common-0.3.0.tar.gz.

File metadata

  • Download URL: ratschlab_common-0.3.0.tar.gz
  • Upload date:
  • Size: 36.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1.post20200604 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.0

File hashes

Hashes for ratschlab_common-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b26fd856b15f0e1229a2cd715925d68a71f47af5d48f1e16687b96eed57650af
MD5 1da77be345fa81753958178f712140ab
BLAKE2b-256 5f280a3cdc84dafafdc3c26ecb64b025c1c0c717bdc201f58e0b2ee29e493aed

See more details on using hashes here.

Provenance

File details

Details for the file ratschlab_common-0.3.0-py2.py3-none-any.whl.

File metadata

  • Download URL: ratschlab_common-0.3.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 36.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1.post20200604 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.0

File hashes

Hashes for ratschlab_common-0.3.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c69f2010da947d4942f659df26cdf91ac684e3f3b70382a99585670a89203a8f
MD5 d621d11ef5a9bca3ac345fd016ba6608
BLAKE2b-256 85378b5668279be38226c6887cc7d3de0d817c236e6363ab4ef5fba94eb3e244

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page