High-Performance Bipartite Interaction Graph Engine powered by DuckDB
Project description
๐ฆ๐ Duwhal
High-Performance Bipartite Interaction Graph Engine โ powered by DuckDB.
Duwhal treats your data as a bipartite graph โ Contexts (orders, sessions, sentences, patients) connected to Entities (products, genes, tokens, games) โ and gives you a complete toolkit to mine patterns, generate recommendations, and detect stable communities, all with the speed of DuckDB.
Table of Contents
Why Duwhal?
Most recommendation and pattern-mining libraries either:
- โ Require you to build a matrix first (memory bottleneck), or
- โ Are designed for a single domain (e-commerce only), or
- โ Don't give you an explanation for why an item was recommended.
Duwhal does things differently:
| Feature | Duwhal |
|---|---|
| Ingestion format | Parquet, CSV, Pandas, Polars, Arrow โ zero-copy via DuckDB |
| Recommendation strategies | Rules, ItemCF, Graph Path Integral, Popularity |
| Explainability | Every recommendation includes the path that generated it |
| Domain agnosticism | Retail, Genomics, NLP, Music, Social โ all the same API |
| Community detection | Tarjan SCC to find "equilibrium" communities & filter bubbles |
| Scale | 100k+ transactions in seconds, on-disk DuckDB for larger data |
Core Concepts
Duwhal models interactions as a bipartite graph:
Contextโ โโโ Entity_A
Contextโ โโโ Entity_B
Contextโ โโโ Entity_A
Contextโ โโโ Entity_C
From this, it projects a unipartite co-occurrence graph where edges carry probabilistic weights derived from interaction frequency. Recommendations are then paths through this graph.
This model is universal:
| Domain | Context | Entity |
|---|---|---|
| Retail | Order | Product |
| Genomics | Patient / Sample | Gene / Mutation |
| Music | Playlist | Song |
| NLP | Sentence / Document | Token / Concept |
| Social | User Session | Content Item |
Installation
pip install duwhal
With uv (recommended):
uv add duwhal
Quick Start
from duwhal import Duwhal
from duwhal.datasets import generate_retail_transactions
df = generate_retail_transactions()
with Duwhal() as db:
# 1. Load your interactions
db.load_interactions(df, set_col="order_id", node_col="item_name")
# 2. Mine Association Rules
rules = db.association_rules(min_support=0.2, min_confidence=0.5)
print(rules.to_pandas().head())
# 3. Recommend based on rules
recs = db.recommend(["Pasta"], strategy="rules", n=3)
print(recs.column("recommended_item").to_pylist())
# โ ['Tomato Sauce', 'Parmesan', ...]
# 4. Or use Graph Path Integral for multi-hop discovery
recs_graph = db.recommend(["iPhone 15"], strategy="graph", n=3)
print(recs_graph.to_pandas()[["recommended_item", "reason"]])
# Shows the discovery path for each recommendation
API Overview
Duwhal (main engine)
The unified entry point for all operations.
from duwhal import Duwhal
db = Duwhal() # in-memory (default)
db = Duwhal(database="store.duckdb") # persistent
Loading Data
# From a DataFrame (Pandas or Polars)
db.load_interactions(df, set_col="order_id", node_col="item_id")
# From a Parquet file (zero-copy via DuckDB)
db.load_interactions("transactions.parquet", set_col="order_id", node_col="item_id")
# With a sort column for sequential mining
db.load_interactions(df, set_col="order_id", node_col="item_id", sort_col="timestamp")
# From an interaction matrix (rows = contexts, columns = items)
db.load_interaction_matrix(matrix_df)
Mining
# Frequent Itemsets
itemsets = db.frequent_itemsets(min_support=0.3)
# Association Rules
rules = db.association_rules(min_support=0.1, min_confidence=0.5, min_lift=1.2)
# Sequential Patterns (requires a timestamp column)
patterns = db.sequential_patterns(timestamp_col="ts", min_support=0.05, max_gap=1)
Recommendation Strategies
| Strategy | Method | Best For |
|---|---|---|
"rules" |
Association Rules | High-confidence, interpretable |
"cf" |
Item Collaborative Filtering | Similarity-based ("users who liked X also liked Y") |
"graph" |
Path Integral traversal | Multi-hop discovery, sparse data |
"popular" |
Global / windowed popularity | Cold-start, trending |
"auto" |
Picks the best available | General use |
# Train models
db.association_rules(min_support=0.1, min_confidence=0.5)
db.fit_cf(metric="jaccard", min_cooccurrence=2)
db.fit_graph(alpha=0.1)
db.fit_popularity(strategy="global")
# Recommend
recs = db.recommend(["item_a"], strategy="cf", n=5)
recs = db.recommend(["item_a"], strategy="graph", scoring="probability", n=5)
# Score a basket's internal cohesion
score = db.score_basket(["Beer", "Diaper"]) # โ float
Sink SCC Detection
Identifies self-sustaining communities โ nodes that collectively reinforce each other (Tarjan's algorithm over the probabilistic co-occurrence graph):
sccs = db.find_sink_sccs(min_cooccurrence=5, min_confidence=0.1)
# Returns: node, scc_id, scc_size, is_sink, members
InteractionGraph (graph interface)
A higher-level, node-centric API for graph analysis tasks.
from duwhal import InteractionGraph
with InteractionGraph() as graph:
graph.load_interactions(df, context_col="user_id", node_col="game_title")
graph.build_topology(min_interactions=2)
# Multi-hop proximity ranking from seed nodes
results = graph.rank_nodes(["Mario"], steps=3, scoring="probability", limit=5)
# Returns: node, score, steps, reason (path)
# Detect Filter Bubbles / Equilibrium Communities
communities = graph.find_equilibrium_communities(min_cooccurrence=5, min_confidence=0.1)
Built-in Datasets
Duwhal ships with synthetic generators for every domain, featuring known ground-truth patterns so you can validate algorithms instantly:
from duwhal.datasets import (
generate_retail_transactions, # iPhone โ Silicone Case, Pasta โ Tomato Sauce
generate_benchmark_patterns, # Beer & Diaper (100% co-occurrence), Milk/Bread/Butter
generate_playlist_data, # Rock cluster โ Jazz cluster with bridge
generate_genomics_data, # BRCA1 โ TP53 co-mutation signal
generate_nlp_corpus, # Tech cluster โ Economy cluster with bridge sentence
generate_filter_bubble_data, # Retro Gaming sink & Modern FPS sink with transient bridge
generate_large_scale_data, # Power-law 100k+ transactions for benchmarking
generate_3scc_dataset, # Controlled 3-SCC graph for path-integral research
)
Each generator returns a pd.DataFrame with documented columns and optional seed for reproducibility.
Use Cases
Explore the examples/use_cases/ directory:
| Example | Domain | Key Technique |
|---|---|---|
retail_market_basket.py |
Retail | Association Rules + Sequential Patterns |
benchmarking_models.py |
Any | Model comparison: Rules vs CF vs Graph vs Popularity |
genomics_trajectories.py |
Genomics | Graph Path Integral over gene co-mutation data |
nlp_token_cooccurrence.py |
NLP | Token proximity + sequential n-gram discovery |
media_playlist_discovery.py |
Music | Multi-hop cross-genre discovery |
ecosystem_equilibrium.py |
Social | Sink SCC detection for filter bubble analysis |
evaluation_scaling.py |
Any | Large-scale ingestion + benchmarking on 100k+ rows |
Evaluation Toolkit
from duwhal.evaluation import temporal_split, random_split, evaluate_recommendations
# Split interactions temporally (respects time ordering)
train, test = temporal_split(df, test_fraction=0.2, timestamp_col="ts")
# Or randomly
train, test = random_split(df, test_fraction=0.2, seed=42)
# Evaluate recommendations
metrics = evaluate_recommendations(model_recs, ground_truth, k=10)
# Returns: precision@k, recall@k, MAP@k
Architecture
duwhal/
โโโ api.py โ Duwhal: unified engine facade
โโโ graph.py โ InteractionGraph: node-centric interface
โโโ core/
โ โโโ connection.py โ DuckDB connection management
โ โโโ ingestion.py โ Multi-format data loading (Parquet, DF, Arrow)
โโโ mining/
โ โโโ frequent_itemsets.py
โ โโโ association_rules.py
โ โโโ sequences.py โ Sequential pattern mining
โ โโโ sink_sccs.py โ Tarjan SCC + sink identification
โโโ recommenders/
โ โโโ graph.py โ Path Integral traversal
โ โโโ item_cf.py โ ItemCF (Jaccard / Cosine / Lift)
โ โโโ popularity.py โ Global + time-windowed popularity
โโโ evaluation/
โ โโโ metrics.py โ Precision, Recall, MAP
โ โโโ splitting.py โ Temporal and random splits
โโโ datasets/ โ Synthetic generators for 7 domains
License
MIT ยฉ Duwhal Contributors
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file duwhal-0.1.2.tar.gz.
File metadata
- Download URL: duwhal-0.1.2.tar.gz
- Upload date:
- Size: 46.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73e87e37a4f6cdf0de1dea5febb6f8d54948ed854c5b63c2f6972e0a25113875
|
|
| MD5 |
8cc940690d7d4724e29b43874f11c6f2
|
|
| BLAKE2b-256 |
19d9e07e8f838c6bb838dc7b10425f54add64e5411bfdc745ad5a02fb1729eef
|
File details
Details for the file duwhal-0.1.2-py3-none-any.whl.
File metadata
- Download URL: duwhal-0.1.2-py3-none-any.whl
- Upload date:
- Size: 33.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39e932565e12464b714028ee6ee4bec88890ee48489c86ddd91daaff241a0dc7
|
|
| MD5 |
c0c49132cf07c97fd9ea49df9d858fc5
|
|
| BLAKE2b-256 |
bde12122731444981c11d6b5543843e8c6deaac6253cb495569af7f14162a803
|