Skip to main content

No project description provided

Project description

Data pipeline

Data pipeline to ingest, conform, normalise, validate and export tabular data files (for now) given yml schema(s) as the only source of truth. It should be possible to easily plug it in modern orchestration tools such as Airflow and Dagster.

:warning: This is a work in progress

Steps

  • Read - currently csv and xlsx
  • Conform - detect column headers, normalise (according to schema rules/accepted aliases) and ignore irrelevant ones
  • Normalise - normalise columns content according to schema data types
    • define accepted data formats (int, float, str, date, year-month, categorical)
    • allow schemas to extend said formats

TODO

  • Try this with airflow - use S3 operator to store data in between
  • Try this with dagster ?
  • Try this with pyodide ?
  • make it work for multiple datasets - many files and many tables within each file
  • allow other export formats - for now, data is transfered over csv, but we might want to use binary, like feather (which might require the usage of pandas)
  • try with external data sources (like S3 bucket) - one should able to read and write from theme easily
  • return data directly - it might be the case that there's no need to store the data between steps - in that case it should be returned directly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabular_pipeline-0.2.3.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

tabular_pipeline-0.2.3-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file tabular_pipeline-0.2.3.tar.gz.

File metadata

  • Download URL: tabular_pipeline-0.2.3.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.3 Darwin/22.3.0

File hashes

Hashes for tabular_pipeline-0.2.3.tar.gz
Algorithm Hash digest
SHA256 a1de545c68e8b29206bead7c532e61faf23457610a3a50b8237e16798b428caf
MD5 42b5e4125be2664d696fdc74ca8815ca
BLAKE2b-256 4f82d6d7a1f0dac46ee1f7a822c3a805f81d51c55f62fdee241414331504846e

See more details on using hashes here.

File details

Details for the file tabular_pipeline-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for tabular_pipeline-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 aa8f829ac18b61b62b35f9abe192315a3bb0e869d40f145a205db8cc99309819
MD5 f66bb3917d97a154bc0c88084acd228f
BLAKE2b-256 cfaa5150a85110e4aa7a221921be2f17f9dd6b10aecaa3b50a338fe41a823581

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page