Skip to main content

Utility for simplifying the management of data warehouse dimensions and facts

Project description

easydw

easydw is a Python library for building data warehouse loading pipelines with less boilerplate and more consistency.

It provides reusable abstractions for:

  • Connecting to warehouse engines (PostgreSQL and Oracle)
  • Loading dimensions with Slowly Changing Dimension strategies (Type 0, Type 1, Type 2)
  • Loading facts with key validation and upsert workflows
  • Working with Polars DataFrames while persisting through SQLAlchemy

The goal is simple: model dimensions and facts as Python classes, then run predictable and testable load logic.

Why easydw

Data warehouse projects often repeat the same patterns:

  • Select data from source or staging
  • Detect new records and changed records
  • Apply SCD behavior on dimensions
  • Upsert facts safely

easydw centralizes these patterns so project-specific code can focus on data rules and transformations instead of database plumbing.

Core concepts

Database abstraction

Database is the core abstraction used by dimensions and facts.

Main operations:

  • connect(): initialize SQLAlchemy engine
  • select(table_name=..., query=..., params=...): return Polars DataFrame
  • insert(df, table_name): bulk insert rows from DataFrame
  • update(df, table_name, keys): batch update by key columns
  • upsert(df, table_name, keys): backend-specific upsert strategy
  • dispose(): close engine and clear metadata cache

Included adapters:

  • PostgresDatabase: native ON CONFLICT DO UPDATE upsert
  • OracleDatabase: update-then-insert upsert emulation

Dimension strategies

easydw includes 3 SCD-oriented base classes:

  • DimensionType0: insert-only, no updates for existing keys
  • DimensionType1: overwrite changed records, no historical versions
  • DimensionType2: close old records and insert new active versions

You implement a concrete dimension by subclassing one of these and providing bind(df), used to attach dimension keys to fact datasets.

Fact loading

Fact defines a standard workflow:

  1. Build rows in _prepare_records(from_timestamp, to_timestamp)
  2. Optionally bind dimension keys through _bind_dimensions(df)
  3. Validate uniqueness on declared keys
  4. Upsert into warehouse table

Quickstart

This example shows a minimal pipeline using PostgreSQL, one Type 1 dimension, and one fact table.

1) Install

pip install .

2) Create a database connection

from easydw.database import PostgresDatabase

dwh = PostgresDatabase(
	name="warehouse",
	params={
		"user": "dw_user",
		"password": "dw_password",
		"host": "localhost",
		"port": 5432,
		"database": "analytics",
		"schema": "public",
	},
)

dwh.connect()

3) Implement a dimension

import polars as pl

from easydw.dimension import DimensionType1


class CustomerDimension(DimensionType1):
	def __init__(self, dwh):
		super().__init__(name="dim_customer", dwh=dwh)

	def bind(self, df: pl.DataFrame) -> pl.DataFrame:
		# Join business key with dimension surrogate key.
		dim_df = self.extract().select(["customer_code", "id"])
		return df.join(dim_df, on="customer_code", how="left")

Load dimension records:

incoming_customers = pl.DataFrame(
	{
		"customer_code": ["C001", "C002"],
		"customer_name": ["Alice", "Bob"],
		"country": ["IT", "FR"],
	}
)

customer_dim = CustomerDimension(dwh)
customer_dim.insert(incoming_customers, keys=["customer_code"])

4) Implement a fact

from datetime import datetime

import polars as pl

from easydw.fact import Fact


class SalesFact(Fact):
	keys = ["order_id"]
	dimensions = [CustomerDimension]

	def __init__(self, dwh):
		super().__init__(name="fct_sales", dwh=dwh)

	def _prepare_records(
		self, from_timestamp: datetime, to_timestamp: datetime
	) -> pl.DataFrame:
		# Replace this with your source extraction logic.
		raw = pl.DataFrame(
			{
				"order_id": [1001, 1002],
				"customer_code": ["C001", "C002"],
				"amount": [120.0, 75.5],
			}
		)

		return self._bind_dimensions(raw)

Load fact rows:

sales_fact = SalesFact(dwh)
sales_fact.insert(
	from_timestamp=datetime(2026, 3, 1),
	to_timestamp=datetime(2026, 3, 20),
)

dwh.dispose()

Development

Build package

python -m build
cp dist/*.whl /your/desired/location
pip install /your/desired/location/easydw-*.whl

Run tests

pytest

Requirements

  • Python 3.10+
  • SQLAlchemy 2.x
  • Polars 1.x
  • psycopg2 (for PostgreSQL usage)
  • oracledb (for Oracle usage)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easydw-1.0.2.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

easydw-1.0.2-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file easydw-1.0.2.tar.gz.

File metadata

  • Download URL: easydw-1.0.2.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.12.13 HTTPX/0.28.1

File hashes

Hashes for easydw-1.0.2.tar.gz
Algorithm Hash digest
SHA256 bcd15899ce692dad27bf8f544ff40b7a461da373263bb412081626e918741cdb
MD5 0a498e934142ae8f54751f8e29b2c101
BLAKE2b-256 7c3d4836b38a222f79475df0d9585b8b0edc3dde239212c89a9b9712199b9bda

See more details on using hashes here.

File details

Details for the file easydw-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: easydw-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.12.13 HTTPX/0.28.1

File hashes

Hashes for easydw-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 123d0799e41610aa19b4f68f857cd44f82c70694492a955ec5eafa312f714b0e
MD5 17953e3c01d544a36893040c0d356f2f
BLAKE2b-256 aaf1d4f110ed955449a67727095162b4a534a8cdf1bc056069f4891abb71d1e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page