Skip to main content

High-Performance Bipartite Interaction Graph Engine powered by DuckDB

Project description

๐Ÿฆ†๐Ÿ‹ Duwhal

High-Performance Bipartite Interaction Graph Engine โ€” powered by DuckDB.

codecov Python โ‰ฅ 3.9 License: MIT


Duwhal treats your data as a bipartite graph โ€” Contexts (orders, sessions, sentences, patients) connected to Entities (products, genes, tokens, games) โ€” and gives you a complete toolkit to mine patterns, generate recommendations, and detect stable communities, all with the speed of DuckDB.


Table of Contents


Why Duwhal?

Most recommendation and pattern-mining libraries either:

  • โŒ Require you to build a matrix first (memory bottleneck), or
  • โŒ Are designed for a single domain (e-commerce only), or
  • โŒ Don't give you an explanation for why an item was recommended.

Duwhal does things differently:

Feature Duwhal
Ingestion format Parquet, CSV, Pandas, Polars, Arrow โ€” zero-copy via DuckDB
Recommendation strategies Rules, ItemCF, Graph Path Integral, Popularity
Explainability Every recommendation includes the path that generated it
Domain agnosticism Retail, Genomics, NLP, Music, Social โ€” all the same API
Community detection Tarjan SCC to find "equilibrium" communities & filter bubbles
Scale 100k+ transactions in seconds, on-disk DuckDB for larger data

Core Concepts

Duwhal models interactions as a bipartite graph:

Contextโ‚ โ”€โ”€โ”€ Entity_A
Contextโ‚ โ”€โ”€โ”€ Entity_B
Contextโ‚‚ โ”€โ”€โ”€ Entity_A
Contextโ‚‚ โ”€โ”€โ”€ Entity_C

From this, it projects a unipartite co-occurrence graph where edges carry probabilistic weights derived from interaction frequency. Recommendations are then paths through this graph.

This model is universal:

Domain Context Entity
Retail Order Product
Genomics Patient / Sample Gene / Mutation
Music Playlist Song
NLP Sentence / Document Token / Concept
Social User Session Content Item

Installation

pip install duwhal

With uv (recommended):

uv add duwhal

Quick Start

from duwhal import Duwhal
from duwhal.datasets import generate_retail_transactions

df = generate_retail_transactions()

with Duwhal() as db:
    # 1. Load your interactions
    db.load_interactions(df, set_col="order_id", node_col="item_name")

    # 2. Mine Association Rules
    rules = db.association_rules(min_support=0.2, min_confidence=0.5)
    print(rules.to_pandas().head())

    # 3. Recommend based on rules
    recs = db.recommend(["Pasta"], strategy="rules", n=3)
    print(recs.column("recommended_item").to_pylist())
    # โ†’ ['Tomato Sauce', 'Parmesan', ...]

    # 4. Or use Graph Path Integral for multi-hop discovery
    recs_graph = db.recommend(["iPhone 15"], strategy="graph", n=3)
    print(recs_graph.to_pandas()[["recommended_item", "reason"]])
    # Shows the discovery path for each recommendation

API Overview

Duwhal (main engine)

The unified entry point for all operations.

from duwhal import Duwhal

db = Duwhal()                          # in-memory (default)
db = Duwhal(database="store.duckdb")  # persistent

Loading Data

# From a DataFrame (Pandas or Polars)
db.load_interactions(df, set_col="order_id", node_col="item_id")

# From a Parquet file (zero-copy via DuckDB)
db.load_interactions("transactions.parquet", set_col="order_id", node_col="item_id")

# With a sort column for sequential mining
db.load_interactions(df, set_col="order_id", node_col="item_id", sort_col="timestamp")

# From an interaction matrix (rows = contexts, columns = items)
db.load_interaction_matrix(matrix_df)

Mining

# Frequent Itemsets
itemsets = db.frequent_itemsets(min_support=0.3)

# Association Rules
rules = db.association_rules(min_support=0.1, min_confidence=0.5, min_lift=1.2)

# Sequential Patterns (requires a timestamp column)
patterns = db.sequential_patterns(timestamp_col="ts", min_support=0.05, max_gap=1)

Recommendation Strategies

Strategy Method Best For
"rules" Association Rules High-confidence, interpretable
"cf" Item Collaborative Filtering Similarity-based ("users who liked X also liked Y")
"graph" Path Integral traversal Multi-hop discovery, sparse data
"popular" Global / windowed popularity Cold-start, trending
"auto" Picks the best available General use
# Train models
db.association_rules(min_support=0.1, min_confidence=0.5)
db.fit_cf(metric="jaccard", min_cooccurrence=2)
db.fit_graph(alpha=0.1)
db.fit_popularity(strategy="global")

# Recommend
recs = db.recommend(["item_a"], strategy="cf", n=5)
recs = db.recommend(["item_a"], strategy="graph", scoring="probability", n=5)

# Score a basket's internal cohesion
score = db.score_basket(["Beer", "Diaper"])  # โ†’ float

Sink SCC Detection

Identifies self-sustaining communities โ€” nodes that collectively reinforce each other (Tarjan's algorithm over the probabilistic co-occurrence graph):

sccs = db.find_sink_sccs(min_cooccurrence=5, min_confidence=0.1)
# Returns: node, scc_id, scc_size, is_sink, members

InteractionGraph (graph interface)

A higher-level, node-centric API for graph analysis tasks.

from duwhal import InteractionGraph

with InteractionGraph() as graph:
    graph.load_interactions(df, context_col="user_id", node_col="game_title")
    graph.build_topology(min_interactions=2)

    # Multi-hop proximity ranking from seed nodes
    results = graph.rank_nodes(["Mario"], steps=3, scoring="probability", limit=5)
    # Returns: node, score, steps, reason (path)

    # Detect Filter Bubbles / Equilibrium Communities
    communities = graph.find_equilibrium_communities(min_cooccurrence=5, min_confidence=0.1)

Built-in Datasets

Duwhal ships with synthetic generators for every domain, featuring known ground-truth patterns so you can validate algorithms instantly:

from duwhal.datasets import (
    generate_retail_transactions,    # iPhone โ†’ Silicone Case, Pasta โ†’ Tomato Sauce
    generate_benchmark_patterns,     # Beer & Diaper (100% co-occurrence), Milk/Bread/Butter
    generate_playlist_data,          # Rock cluster โ†” Jazz cluster with bridge
    generate_genomics_data,          # BRCA1 โ†” TP53 co-mutation signal
    generate_nlp_corpus,             # Tech cluster โ†” Economy cluster with bridge sentence
    generate_filter_bubble_data,     # Retro Gaming sink & Modern FPS sink with transient bridge
    generate_large_scale_data,       # Power-law 100k+ transactions for benchmarking
    generate_3scc_dataset,           # Controlled 3-SCC graph for path-integral research
)

Each generator returns a pd.DataFrame with documented columns and optional seed for reproducibility.


Use Cases

Explore the examples/use_cases/ directory:

Example Domain Key Technique
retail_market_basket.py Retail Association Rules + Sequential Patterns
benchmarking_models.py Any Model comparison: Rules vs CF vs Graph vs Popularity
genomics_trajectories.py Genomics Graph Path Integral over gene co-mutation data
nlp_token_cooccurrence.py NLP Token proximity + sequential n-gram discovery
media_playlist_discovery.py Music Multi-hop cross-genre discovery
ecosystem_equilibrium.py Social Sink SCC detection for filter bubble analysis
evaluation_scaling.py Any Large-scale ingestion + benchmarking on 100k+ rows

Evaluation Toolkit

from duwhal.evaluation import temporal_split, random_split, evaluate_recommendations

# Split interactions temporally (respects time ordering)
train, test = temporal_split(df, test_fraction=0.2, timestamp_col="ts")

# Or randomly
train, test = random_split(df, test_fraction=0.2, seed=42)

# Evaluate recommendations
metrics = evaluate_recommendations(model_recs, ground_truth, k=10)
# Returns: precision@k, recall@k, MAP@k

Architecture

duwhal/
โ”œโ”€โ”€ api.py                  โ† Duwhal: unified engine facade
โ”œโ”€โ”€ graph.py                โ† InteractionGraph: node-centric interface
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ connection.py       โ† DuckDB connection management
โ”‚   โ””โ”€โ”€ ingestion.py        โ† Multi-format data loading (Parquet, DF, Arrow)
โ”œโ”€โ”€ mining/
โ”‚   โ”œโ”€โ”€ frequent_itemsets.py
โ”‚   โ”œโ”€โ”€ association_rules.py
โ”‚   โ”œโ”€โ”€ sequences.py        โ† Sequential pattern mining
โ”‚   โ””โ”€โ”€ sink_sccs.py        โ† Tarjan SCC + sink identification
โ”œโ”€โ”€ recommenders/
โ”‚   โ”œโ”€โ”€ graph.py            โ† Path Integral traversal
โ”‚   โ”œโ”€โ”€ item_cf.py          โ† ItemCF (Jaccard / Cosine / Lift)
โ”‚   โ””โ”€โ”€ popularity.py       โ† Global + time-windowed popularity
โ”œโ”€โ”€ evaluation/
โ”‚   โ”œโ”€โ”€ metrics.py          โ† Precision, Recall, MAP
โ”‚   โ””โ”€โ”€ splitting.py        โ† Temporal and random splits
โ””โ”€โ”€ datasets/               โ† Synthetic generators for 7 domains

License

MIT ยฉ Duwhal Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duwhal-0.1.2.tar.gz (46.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duwhal-0.1.2-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file duwhal-0.1.2.tar.gz.

File metadata

  • Download URL: duwhal-0.1.2.tar.gz
  • Upload date:
  • Size: 46.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.2

File hashes

Hashes for duwhal-0.1.2.tar.gz
Algorithm Hash digest
SHA256 73e87e37a4f6cdf0de1dea5febb6f8d54948ed854c5b63c2f6972e0a25113875
MD5 8cc940690d7d4724e29b43874f11c6f2
BLAKE2b-256 19d9e07e8f838c6bb838dc7b10425f54add64e5411bfdc745ad5a02fb1729eef

See more details on using hashes here.

File details

Details for the file duwhal-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: duwhal-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.2

File hashes

Hashes for duwhal-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 39e932565e12464b714028ee6ee4bec88890ee48489c86ddd91daaff241a0dc7
MD5 c0c49132cf07c97fd9ea49df9d858fc5
BLAKE2b-256 bde12122731444981c11d6b5543843e8c6deaac6253cb495569af7f14162a803

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page