Skip to main content

Simple data warehouse using S3

Project description

acme-dw

Simple data warehouse using S3

Problem

Some LLM based definitions: A data warehouse is a centralized repository designed for storing, managing, and analyzing structured data from various sources, optimized for query performance and reporting. It typically uses a schema-based approach to organize data in tables and supports complex queries and analytics. In contrast, a data lake is a storage system that holds vast amounts of raw, unstructured, and structured data in its native format until needed. It is designed for scalability and flexibility, allowing for the storage of diverse data types and enabling advanced analytics, machine learning, and big data processing.

We can see how S3 can be easily utilized as a data lake with little extra functionality. However to use it as a data warehouse we need to add some extra functionality that largly depends on the needs of a given domain.

Features

  • Provides read/wrie on schema-less pd.DataFrame/pl.DataFrame
  • Saves pd.DataFrame/pl.DataFrame using parquet format for fast read performance.
  • Standardizes metadata associated with each dataset
  • Support for parquet datasets (datasets spread over multiple parquet files).

Dev environment

The project comes with a python development environment. To generate it, after checking out the repo run:

chmod +x create_env.sh

Then to generate the environment (or update it to latest version based on state of uv.lock), run:

./create_env.sh

This will generate a new python virtual env under .venv directory. You can activate it via:

source .venv/bin/activate

If you are using VSCode, set to use this env via Python: Select Interpreter command.

Example usage

from acme_dw import DW, DatasetMetadata

dw = DW()
        
# Write with DatasetMetadata object
metadata = DatasetMetadata(
    source='yahoo_finance',
    name='price_history', 
    version='v1',
    process_id='fetch_yahoo_data',
    partitions=['minute', 'AAPL', '2025'],
    file_name='20250124',
    file_type='parquet'
)
dw.write_df(df, metadata)
df = dw.read_df(metadata)

Project template

This project has been setup with acme-project-create, a python code template library.

Required setup post use

  • Enable GitHub Pages to be published via GitHub Actions by going to Settings-->Pages-->Source

  • Create release-pypi environment for GitHub Actions to enable uploads of the library to PyPi

  • Setup auth to PyPI for the GitHub Action implemented in .github/workflows/release.yml via Trusted Publisher uv publish doc

  • Once you create the python environment for the first time add the uv.lock file that will be created in project directory to the source control and update it each time environment is rebuilt

  • In order not to replicate documentation in docs/docs/index.md file and README.md in root of the project setup a symlink from README.md file to the index.md file. To do this, from docs/docs dir run:

    ln -sf ../../README.md index.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acme_dw-0.0.5.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acme_dw-0.0.5-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file acme_dw-0.0.5.tar.gz.

File metadata

  • Download URL: acme_dw-0.0.5.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.6.10

File hashes

Hashes for acme_dw-0.0.5.tar.gz
Algorithm Hash digest
SHA256 b597cb78b153ab87ca60177af7427614101a4384373317f45b04abc7b7e50ce3
MD5 a546f69ef1b0005c1f4aa64d9f464695
BLAKE2b-256 fdd091ac54dc1b49e5740f138e4604363bb901a4c3c06d0c848c7136fd138177

See more details on using hashes here.

File details

Details for the file acme_dw-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: acme_dw-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.6.10

File hashes

Hashes for acme_dw-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 259341ccce0ed1c9660844109a9ac8ce1051e4eaa2fea82b739b606812a757ee
MD5 e0fccf881c82bcffef45d733f56aefc0
BLAKE2b-256 bd189313350cafbc4192b56c5855e2fce872d2f6404fc9a8eb6fcfbb95c25a35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page