Skip to main content

Systematic comparisons of multiple datasets

Project description

DataComp: A Python Framework for Systematic Dataset Comparisons

Current version on PyPI Apache 2.0 License Stable Supported Python Versions Development Documentation Status

Description

DataComp is an open source Python package for domain independent multimodal longitudinal dataset comparisons. It serves as an investigative toolbox to assess differences between multiple datasets on feature level. DataComp empowers data analysts to identify significantly different and not significantly difference between datasets and thereby is helpful to identify comparable dataset combinations.

Typical application scenarios are:

  • Identifying comparable datasets that can be used in machine learning approaches as training and independent test data

  • Evaluate if, how and where simulated or synthetic datasets deviate from real world data

  • Assess (systematic) differences across multiple datasets (for example multiple sampling sites)

  • Conducting multiple statistical comparisons

  • Comparative visualizations

./docs/source/DataComp_workflow.png

The figure above depicts a typical DataComp workflow.

Main Features

DataComp supports:

  • Evaluating and visualizing the overlap in features across datasets

  • Parametric and nonparametric statistical hypothesis testing to compare feature value distributions

  • Creating comparative plots of feature value distributions

  • Normalizing time series data to baseline and statistically comparing the progression of features over time

  • Comparative visualization of feature progression over time

  • Hierarchical clustering of the entities in the data sets to evaluate if dataset membership labels are evenly distributed across clusters or assigned to distinct clusters

  • Performing a MANOVA to assess the influence of features onto the dataset membership

Installation

pip install datacomp

Documentation

The full package documentation can be found here.

Application examples

Example notebooks showcasing Datacomp workflows and results on simulated data can be found at DataComp_Examples:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacomp-0.0.6.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

datacomp-0.0.6-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file datacomp-0.0.6.tar.gz.

File metadata

  • Download URL: datacomp-0.0.6.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.5.2

File hashes

Hashes for datacomp-0.0.6.tar.gz
Algorithm Hash digest
SHA256 3980c5300702c3561da8e8f61709d8adcc581af14e7007a129c5d8e8e2c8ef6a
MD5 7ce0f2bb9766e653711318b051f7bf3b
BLAKE2b-256 ef12f528202bdd6edfebd7f56999ba017ebc5519457cd82abbdb5af05055daf4

See more details on using hashes here.

File details

Details for the file datacomp-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: datacomp-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 27.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.5.2

File hashes

Hashes for datacomp-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 897a43ab8d835d5fc5b7937e8628557a2ab5bba236d308a918d9b4e283fa002c
MD5 c7f80c16a762874365e19ff98df1b588
BLAKE2b-256 39b8c40f97d8c2d220778c49bb004ceacc84494f9ea008c3f5ec30986a21563f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page