Small library of common functionalities used in various projects in the ratschlab
Project description
ratschlab-common
Small library of common code used in various projects in the ratschlab.
Features
- Writing parquet and HDF5 files with sensible defaults
ratschlab_common.io.dataframe_formats
. - Support for working with 'chunkfiles', i.e. splitting up a large
dataset in smaller chunks which can be processed independently (see
example notebook):
- Repartition records (i.e. increase or decrease number of chunkfiles) while keeping data belonging together in the same file (e.g. data with the same patient id associated)
- simple indexing for looking up in which chunk to find data belonging e.g. to a patient
- bigmatrix: support for creating and reading large matrices stored in HDF5 having additional metadata on the axes in form of data frames (see example notebook.)
- small wrappers for spark and dask (spark example.)
- saving sparse
pandas
dataframes to hdf5, see example notebook
Tools
ratschlab-common
also comes with some command line tools:
pq-tool
: inspect parquet files on the command linepq-tool head
: first recordspq-tool tail
: last recordspq-tool cat
: all recordspq-tool schema
: schema of a parquet file
export-db-to-files
: Tool to dump (postgres) database tables into parquet files. Large tables can be partitioned on a key and dumped into separate file chunks. This allows for further processing to be easily done in parallel.bigmatrix-repack
: rechunking/packing bigmatrix hdf5 files
Installation and Requirements
The library along with all the required dependencies can be installed with:
pip install ratschlab-common[complete]
Depending on whether you plan to use spark
or dask
or none of them you could install
ratschlab-common
through either of the commands
pip install ratschlab-common
pip install ratschlab-common[spark]
pip install ratschlab-common[dask]
Note, that if you plan on using spark
make sure, you have
Java 8 and either python 3.6 or 3.7 installed (python 3.8 is currently not supported by pyspark
).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ratschlab_common-0.3.0.tar.gz
.
File metadata
- Download URL: ratschlab_common-0.3.0.tar.gz
- Upload date:
- Size: 36.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1.post20200604 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b26fd856b15f0e1229a2cd715925d68a71f47af5d48f1e16687b96eed57650af |
|
MD5 | 1da77be345fa81753958178f712140ab |
|
BLAKE2b-256 | 5f280a3cdc84dafafdc3c26ecb64b025c1c0c717bdc201f58e0b2ee29e493aed |
File details
Details for the file ratschlab_common-0.3.0-py2.py3-none-any.whl
.
File metadata
- Download URL: ratschlab_common-0.3.0-py2.py3-none-any.whl
- Upload date:
- Size: 36.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1.post20200604 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c69f2010da947d4942f659df26cdf91ac684e3f3b70382a99585670a89203a8f |
|
MD5 | d621d11ef5a9bca3ac345fd016ba6608 |
|
BLAKE2b-256 | 85378b5668279be38226c6887cc7d3de0d817c236e6363ab4ef5fba94eb3e244 |