Skip to main content

High-Performance Bipartite Interaction Graph Engine powered by DuckDB

Project description

๐Ÿฆ†๐Ÿ‹ Duwhal

High-Performance Bipartite Interaction Graph Engine โ€” powered by DuckDB.

Tests Coverage Python โ‰ฅ 3.9 License: MIT


Duwhal treats your data as a bipartite graph โ€” Contexts (orders, sessions, sentences, patients) connected to Entities (products, genes, tokens, games) โ€” and gives you a complete toolkit to mine patterns, generate recommendations, and detect stable communities, all with the speed of DuckDB.


Table of Contents


Why Duwhal?

Most recommendation and pattern-mining libraries either:

  • โŒ Require you to build a matrix first (memory bottleneck), or
  • โŒ Are designed for a single domain (e-commerce only), or
  • โŒ Don't give you an explanation for why an item was recommended.

Duwhal does things differently:

Feature Duwhal
Ingestion format Parquet, CSV, Pandas, Polars, Arrow โ€” zero-copy via DuckDB
Recommendation strategies Rules, ItemCF, Graph Path Integral, Popularity
Explainability Every recommendation includes the path that generated it
Domain agnosticism Retail, Genomics, NLP, Music, Social โ€” all the same API
Community detection Tarjan SCC to find "equilibrium" communities & filter bubbles
Scale 100k+ transactions in seconds, on-disk DuckDB for larger data

Core Concepts

Duwhal models interactions as a bipartite graph:

Contextโ‚ โ”€โ”€โ”€ Entity_A
Contextโ‚ โ”€โ”€โ”€ Entity_B
Contextโ‚‚ โ”€โ”€โ”€ Entity_A
Contextโ‚‚ โ”€โ”€โ”€ Entity_C

From this, it projects a unipartite co-occurrence graph where edges carry probabilistic weights derived from interaction frequency. Recommendations are then paths through this graph.

This model is universal:

Domain Context Entity
Retail Order Product
Genomics Patient / Sample Gene / Mutation
Music Playlist Song
NLP Sentence / Document Token / Concept
Social User Session Content Item

Installation

pip install duwhal

With uv (recommended):

uv add duwhal

Optional extras:

pip install "duwhal[pandas]"   # Pandas support
pip install "duwhal[polars]"   # Polars support

Quick Start

from duwhal import Duwhal
from duwhal.datasets import generate_retail_transactions

df = generate_retail_transactions()

with Duwhal() as db:
    # 1. Load your interactions
    db.load_interactions(df, set_col="order_id", node_col="item_name")

    # 2. Mine Association Rules
    rules = db.association_rules(min_support=0.2, min_confidence=0.5)
    print(rules.to_pandas().head())

    # 3. Recommend based on rules
    recs = db.recommend(["Pasta"], strategy="rules", n=3)
    print(recs.column("recommended_item").to_pylist())
    # โ†’ ['Tomato Sauce', 'Parmesan', ...]

    # 4. Or use Graph Path Integral for multi-hop discovery
    recs_graph = db.recommend(["iPhone 15"], strategy="graph", n=3)
    print(recs_graph.to_pandas()[["recommended_item", "reason"]])
    # Shows the discovery path for each recommendation

API Overview

Duwhal (main engine)

The unified entry point for all operations.

from duwhal import Duwhal

db = Duwhal()                          # in-memory (default)
db = Duwhal(database="store.duckdb")  # persistent

Loading Data

# From a DataFrame (Pandas or Polars)
db.load_interactions(df, set_col="order_id", node_col="item_id")

# From a Parquet file (zero-copy via DuckDB)
db.load_interactions("transactions.parquet", set_col="order_id", node_col="item_id")

# With a sort column for sequential mining
db.load_interactions(df, set_col="order_id", node_col="item_id", sort_col="timestamp")

# From an interaction matrix (rows = contexts, columns = items)
db.load_interaction_matrix(matrix_df)

Mining

# Frequent Itemsets
itemsets = db.frequent_itemsets(min_support=0.3)

# Association Rules
rules = db.association_rules(min_support=0.1, min_confidence=0.5, min_lift=1.2)

# Sequential Patterns (requires a timestamp column)
patterns = db.sequential_patterns(timestamp_col="ts", min_support=0.05, max_gap=1)

Recommendation Strategies

Strategy Method Best For
"rules" Association Rules High-confidence, interpretable
"cf" Item Collaborative Filtering Similarity-based ("users who liked X also liked Y")
"graph" Path Integral traversal Multi-hop discovery, sparse data
"popular" Global / windowed popularity Cold-start, trending
"auto" Picks the best available General use
# Train models
db.association_rules(min_support=0.1, min_confidence=0.5)
db.fit_cf(metric="jaccard", min_cooccurrence=2)
db.fit_graph(alpha=0.1)
db.fit_popularity(strategy="global")

# Recommend
recs = db.recommend(["item_a"], strategy="cf", n=5)
recs = db.recommend(["item_a"], strategy="graph", scoring="probability", n=5)

# Score a basket's internal cohesion
score = db.score_basket(["Beer", "Diaper"])  # โ†’ float

Sink SCC Detection

Identifies self-sustaining communities โ€” nodes that collectively reinforce each other (Tarjan's algorithm over the probabilistic co-occurrence graph):

sccs = db.find_sink_sccs(min_cooccurrence=5, min_confidence=0.1)
# Returns: node, scc_id, scc_size, is_sink, members

InteractionGraph (graph interface)

A higher-level, node-centric API for graph analysis tasks.

from duwhal import InteractionGraph

with InteractionGraph() as graph:
    graph.load_interactions(df, context_col="user_id", node_col="game_title")
    graph.build_topology(min_interactions=2)

    # Multi-hop proximity ranking from seed nodes
    results = graph.rank_nodes(["Mario"], steps=3, scoring="probability", limit=5)
    # Returns: node, score, steps, reason (path)

    # Detect Filter Bubbles / Equilibrium Communities
    communities = graph.find_equilibrium_communities(min_cooccurrence=5, min_confidence=0.1)

Built-in Datasets

Duwhal ships with synthetic generators for every domain, featuring known ground-truth patterns so you can validate algorithms instantly:

from duwhal.datasets import (
    generate_retail_transactions,    # iPhone โ†’ Silicone Case, Pasta โ†’ Tomato Sauce
    generate_benchmark_patterns,     # Beer & Diaper (100% co-occurrence), Milk/Bread/Butter
    generate_playlist_data,          # Rock cluster โ†” Jazz cluster with bridge
    generate_genomics_data,          # BRCA1 โ†” TP53 co-mutation signal
    generate_nlp_corpus,             # Tech cluster โ†” Economy cluster with bridge sentence
    generate_filter_bubble_data,     # Retro Gaming sink & Modern FPS sink with transient bridge
    generate_large_scale_data,       # Power-law 100k+ transactions for benchmarking
    generate_3scc_dataset,           # Controlled 3-SCC graph for path-integral research
)

Each generator returns a pd.DataFrame with documented columns and optional seed for reproducibility.


Use Cases

Explore the examples/use_cases/ directory:

Example Domain Key Technique
retail_market_basket.py Retail Association Rules + Sequential Patterns
benchmarking_models.py Any Model comparison: Rules vs CF vs Graph vs Popularity
genomics_trajectories.py Genomics Graph Path Integral over gene co-mutation data
nlp_token_cooccurrence.py NLP Token proximity + sequential n-gram discovery
media_playlist_discovery.py Music Multi-hop cross-genre discovery
ecosystem_equilibrium.py Social Sink SCC detection for filter bubble analysis
evaluation_scaling.py Any Large-scale ingestion + benchmarking on 100k+ rows

Evaluation Toolkit

from duwhal.evaluation import temporal_split, random_split, evaluate_recommendations

# Split interactions temporally (respects time ordering)
train, test = temporal_split(df, test_fraction=0.2, timestamp_col="ts")

# Or randomly
train, test = random_split(df, test_fraction=0.2, seed=42)

# Evaluate recommendations
metrics = evaluate_recommendations(model_recs, ground_truth, k=10)
# Returns: precision@k, recall@k, MAP@k

Architecture

duwhal/
โ”œโ”€โ”€ api.py                  โ† Duwhal: unified engine facade
โ”œโ”€โ”€ graph.py                โ† InteractionGraph: node-centric interface
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ connection.py       โ† DuckDB connection management
โ”‚   โ””โ”€โ”€ ingestion.py        โ† Multi-format data loading (Parquet, DF, Arrow)
โ”œโ”€โ”€ mining/
โ”‚   โ”œโ”€โ”€ frequent_itemsets.py
โ”‚   โ”œโ”€โ”€ association_rules.py
โ”‚   โ”œโ”€โ”€ sequences.py        โ† Sequential pattern mining
โ”‚   โ””โ”€โ”€ sink_sccs.py        โ† Tarjan SCC + sink identification
โ”œโ”€โ”€ recommenders/
โ”‚   โ”œโ”€โ”€ graph.py            โ† Path Integral traversal
โ”‚   โ”œโ”€โ”€ item_cf.py          โ† ItemCF (Jaccard / Cosine / Lift)
โ”‚   โ””โ”€โ”€ popularity.py       โ† Global + time-windowed popularity
โ”œโ”€โ”€ evaluation/
โ”‚   โ”œโ”€โ”€ metrics.py          โ† Precision, Recall, MAP
โ”‚   โ””โ”€โ”€ splitting.py        โ† Temporal and random splits
โ””โ”€โ”€ datasets/               โ† Synthetic generators for 7 domains

License

MIT ยฉ Duwhal Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duwhal-0.1.1.tar.gz (47.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duwhal-0.1.1-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file duwhal-0.1.1.tar.gz.

File metadata

  • Download URL: duwhal-0.1.1.tar.gz
  • Upload date:
  • Size: 47.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.2

File hashes

Hashes for duwhal-0.1.1.tar.gz
Algorithm Hash digest
SHA256 567c29f3f2dcac235fb6152f18fbf895a14dbca26f8303000224964381d9206f
MD5 b9e8223dd10150f4bb8807563234dff2
BLAKE2b-256 31ffdf27ab7476229fce467676a76324f1718f41710e8cd6193445f6ceb3c301

See more details on using hashes here.

File details

Details for the file duwhal-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: duwhal-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.2

File hashes

Hashes for duwhal-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0ef935b8a8192e5d50467fdcbccd2f86127bb9f7a569c8e12b3fb89a9556b55e
MD5 71b86c461821929353cdc675c960b7cd
BLAKE2b-256 4f092fc9659379c7495a55082f24703dde5e7fced6181e6f9ab74fd07d959f60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page