No project description provided
Project description
Data pipeline
Data pipeline to ingest, conform, normalise, validate and export tabular data files (for now) given yml schema(s) as the only source of truth. It should be possible to easily plug it in modern orchestration tools such as Airflow and Dagster.
:warning: This is a work in progress
Steps
- Read - currently csv and xlsx
- Conform - detect column headers, normalise (according to schema rules/accepted aliases) and ignore irrelevant ones
- Normalise - normalise columns content according to schema data types
- define accepted data formats (int, float, str, date, year-month, categorical)
- allow schemas to extend said formats
TODO
- Try this with airflow - use S3 operator to store data in between
- Try this with dagster ?
- Try this with pyodide ?
- make it work for multiple datasets - many files and many tables within each file
- allow other export formats - for now, data is transfered over csv, but we might want to use binary, like feather (which might require the usage of pandas)
- try with external data sources (like S3 bucket) - one should able to read and write from theme easily
- return data directly - it might be the case that there's no need to store the data between steps - in that case it should be returned directly.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tabular_pipeline-0.2.3.tar.gz
.
File metadata
- Download URL: tabular_pipeline-0.2.3.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.11.3 Darwin/22.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1de545c68e8b29206bead7c532e61faf23457610a3a50b8237e16798b428caf |
|
MD5 | 42b5e4125be2664d696fdc74ca8815ca |
|
BLAKE2b-256 | 4f82d6d7a1f0dac46ee1f7a822c3a805f81d51c55f62fdee241414331504846e |
File details
Details for the file tabular_pipeline-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: tabular_pipeline-0.2.3-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.11.3 Darwin/22.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa8f829ac18b61b62b35f9abe192315a3bb0e869d40f145a205db8cc99309819 |
|
MD5 | f66bb3917d97a154bc0c88084acd228f |
|
BLAKE2b-256 | cfaa5150a85110e4aa7a221921be2f17f9dd6b10aecaa3b50a338fe41a823581 |