Skip to main content

No project description provided

Project description

Data pipeline

Data pipeline to ingest, conform, normalise, validate and export tabular data files (for now) given yml schema(s) as the only source of truth. It should be possible to easily plug it in modern orchestration tools such as Airflow and Dagster.

:warning: This is a work in progress

Steps

  • Read - currently csv and xlsx
  • Conform - detect column headers, normalise (according to schema rules/accepted aliases) and ignore irrelevant ones
  • Normalise - normalise columns content according to schema data types
    • define accepted data formats (int, float, str, date, year-month, categorical)
    • allow schemas to extend said formats

TODO

  • Try this with airflow - use S3 operator to store data in between
  • Try this with dagster ?
  • Try this with pyodide ?
  • make it work for multiple datasets - many files and many tables within each file
  • allow other export formats - for now, data is transfered over csv, but we might want to use binary, like feather (which might require the usage of pandas)
  • try with external data sources (like S3 bucket) - one should able to read and write from theme easily
  • return data directly - it might be the case that there's no need to store the data between steps - in that case it should be returned directly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabular_pipeline-0.2.3.tar.gz (5.5 kB view hashes)

Uploaded Source

Built Distribution

tabular_pipeline-0.2.3-py3-none-any.whl (7.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page