Fast Deep Feature Synthesis for tabular data
Project description
FastDFS - Deep Feature Synthesis for Tabular Data
FastDFS is a Python library for automated feature engineering using Deep Feature Synthesis (DFS). It augments target dataframes with rich features derived from relational database structures, making it easy to create powerful features for machine learning without manual feature engineering.
Core Concept
FastDFS treats feature engineering as a table augmentation process: given any target dataframe and a relational database (RDB) containing related tables, it automatically generates new features by aggregating information across relationships.
# Your target dataframe (what you want to predict on)
target_df = pd.DataFrame({
"user_id": [1, 2, 3],
"item_id": [100, 200, 300],
"interaction_time": ["2024-01-01", "2024-01-02", "2024-01-03"]
})
# Your relational database (context for feature generation)
rdb = fastdfs.load_rdb("ecommerce_data/") # Contains user, item, interaction tables
# Generate features automatically
enriched_df = fastdfs.compute_dfs_features(
rdb=rdb,
target_dataframe=target_df,
key_mappings={"user_id": "user.user_id", "item_id": "item.item_id"},
cutoff_time_column="interaction_time"
)
# Result: Original columns + 50+ new features like user_avg_rating, item_count_purchases, etc.
Installation
pip install fastdfs
Or for development:
git clone https://github.com/dglai/fastdfs.git
cd fastdfs
pip install -e .
Quick Start
1. Prepare Your Data
FastDFS provides multiple ways to prepare your relational data.
Option A: Create from DataFrames (Recommended)
You can create an RDB directly from pandas DataFrames. FastDFS will automatically infer the schema.
import fastdfs
import pandas as pd
# 1. Define your tables
users_df = pd.DataFrame(...)
items_df = pd.DataFrame(...)
interactions_df = pd.DataFrame(...)
# 2. Create RDB with relationships
rdb = fastdfs.create_rdb(
name="ecommerce",
tables={
"user": users_df,
"item": items_df,
"interaction": interactions_df
},
primary_keys={
"user": "user_id",
"item": "item_id"
},
foreign_keys=[
("interaction", "user_id", "user", "user_id"),
("interaction", "item_id", "item", "item_id")
],
time_columns={
"interaction": "timestamp"
}
)
# 3. Save for later use
rdb.save("ecommerce_rdb/")
# 4. Load it back
rdb = fastdfs.load_rdb("ecommerce_rdb/")
Option B: Adapt Existing Datasets
FastDFS includes adapters for popular relational dataset benchmarks.
RelBench
from fastdfs.adapter.relbench import RelBenchAdapter
# Load and convert RelBench dataset
adapter = RelBenchAdapter("rel-stack")
rdb = adapter.load()
rdb.save("rel-stack-rdb/")
DBInfer
from fastdfs.adapter.dbinfer import DBInferAdapter
# Load and convert DBInfer dataset
adapter = DBInferAdapter("diginetica")
rdb = adapter.load()
rdb.save("diginetica-rdb/")
Option C: Load from Relational Database
FastDFS supports loading data directly from SQL databases (SQLite, MySQL, PostgreSQL, DuckDB).
from fastdfs.adapter.sqlite import SQLiteAdapter
# from fastdfs.adapter.mysql import MySQLAdapter
# from fastdfs.adapter.postgres import PostgreSQLAdapter
# Connect to database
adapter = SQLiteAdapter(
"ecommerce.db",
time_columns={"orders": "created_at"}, # (Optional) specify which column is the time column for which table
type_hints={"users": {"age": "float"}} # (Optional) specify the desired column data type
)
# Or for MySQL/PostgreSQL:
# adapter = MySQLAdapter("mysql+pymysql://user:pass@host/db")
rdb = adapter.load()
2. Generate Features
import fastdfs
import pandas as pd
# Prepare your rdb from the methods above
rdb = ...
# Create or load your target dataframe
target_df = pd.DataFrame({
"user_id": [1, 2, 3],
"item_id": [10, 20, 30],
"prediction_time": ["2024-01-01", "2024-01-02", "2024-01-03"]
})
# Generate features
features = fastdfs.compute_dfs_features(
rdb=rdb,
target_dataframe=target_df,
key_mappings={
"user_id": "user.user_id",
"item_id": "item.item_id"
},
cutoff_time_column="prediction_time",
config_overrides={"max_depth": 2}
)
print(f"Original columns: {len(target_df.columns)}")
print(f"With features: {len(features.columns)}")
3. Advanced Usage with Transforms
# Apply preprocessing transforms before feature generation
from fastdfs.transform import RDBTransformWrapper, RDBTransformPipeline, HandleDummyTable, FeaturizeDatetime
pipeline = fastdfs.DFSPipeline(
transform_pipeline=RDBTransformPipeline([
HandleDummyTable(),
RDBTransformWrapper(FeaturizeDatetime(features=["year", "month", "hour"]))
]),
dfs_config=fastdfs.DFSConfig(max_depth=3, engine="dfs2sql")
)
features = pipeline.run(
rdb=rdb,
target_dataframe=target_df,
key_mappings=key_mappings,
cutoff_time_column="prediction_time"
)
Key Features
- Table-Centric Design: Augment any dataframe, not just predefined datasets
- Multiple DFS Engines: Choose between Featuretools (pandas) or DFS2SQL (high-performance)
- Temporal Consistency: Built-in cutoff time support prevents data leakage
- Flexible Key Mapping: Connect target data to RDB with simple column mappings
- Transform Pipeline: Composable preprocessing transforms for data cleaning
- Type Safety: Full type hints and runtime validation
- Minimal Dependencies: Focused, lightweight package
Engine Comparison
| Feature | Featuretools | DFS2SQL |
|---|---|---|
| Performance | Good for small data | Excellent for large data |
| Memory Usage | High (pandas) | Low (SQL-based) |
| Primitives | Rich set | Core primitives |
| Backend | Pandas | DuckDB |
Documentation
- User Guide: Complete tutorial with concepts and examples
- API Reference: Detailed API documentation
- Examples: Runnable code examples
Why FastDFS?
Before FastDFS (manual feature engineering):
# Manual aggregations for each feature
user_avg_rating = interactions.groupby('user_id')['rating'].mean()
user_total_purchases = interactions.groupby('user_id').size()
item_avg_rating = interactions.groupby('item_id')['rating'].mean()
# ... dozens more features ...
With FastDFS (automated):
# Automatic generation of 50+ features
features = fastdfs.compute_dfs_features(rdb, target_df, key_mappings)
FastDFS automatically discovers relationships in your data and generates meaningful aggregation features, saving weeks of manual feature engineering work.
Contributing
We welcome contributions! See our development logs for project history and architecture decisions.
License
Apache-2.0 License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastdfs-0.2.0.tar.gz.
File metadata
- Download URL: fastdfs-0.2.0.tar.gz
- Upload date:
- Size: 86.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43b0a417526814ce582289e7418139b1e9fcff40d34745f2aca4405b0b11a984
|
|
| MD5 |
d979a73853e5573b7ef6caf888fd3dfd
|
|
| BLAKE2b-256 |
099f02573aeb930995219bc2e84f14ad83e48cd5d8817c3ac701de0ba1fe8202
|
File details
Details for the file fastdfs-0.2.0-py3-none-any.whl.
File metadata
- Download URL: fastdfs-0.2.0-py3-none-any.whl
- Upload date:
- Size: 76.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c73b40759e7795967fd63822c99c005f252039c982b51123aa05e8487052e195
|
|
| MD5 |
0ab49693c4977d8cc89671654f98790e
|
|
| BLAKE2b-256 |
75b3c56207b779ba44ee98563b25a62278cc9068a72bda17c13d0d6a19fbb5c0
|