Skip to main content

Fast Deep Feature Synthesis for tabular data

Project description

FastDFS - Deep Feature Synthesis for Tabular Data

FastDFS is a Python library for automated feature engineering using Deep Feature Synthesis (DFS). It augments target dataframes with rich features derived from relational database structures, making it easy to create powerful features for machine learning without manual feature engineering.

Core Concept

FastDFS treats feature engineering as a table augmentation process: given any target dataframe and a relational database (RDB) containing related tables, it automatically generates new features by aggregating information across relationships.

# Your target dataframe (what you want to predict on)
target_df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "item_id": [100, 200, 300], 
    "interaction_time": ["2024-01-01", "2024-01-02", "2024-01-03"]
})

# Your relational database (context for feature generation)
rdb = fastdfs.load_rdb("ecommerce_data/")  # Contains user, item, interaction tables

# Generate features automatically
enriched_df = fastdfs.compute_dfs_features(
    rdb=rdb,
    target_dataframe=target_df,
    key_mappings={"user_id": "user.user_id", "item_id": "item.item_id"},
    cutoff_time_column="interaction_time"
)
# Result: Original columns + 50+ new features like user_avg_rating, item_count_purchases, etc.

Installation

pip install fastdfs

Or for development:

git clone https://github.com/dglai/fastdfs.git
cd fastdfs
pip install -e .

Quick Start

1. Prepare Your Data

FastDFS provides multiple ways to prepare your relational data.

Option A: Create from DataFrames (Recommended)

You can create an RDB directly from pandas DataFrames. FastDFS will automatically infer the schema.

import fastdfs
import pandas as pd

# 1. Define your tables
users_df = pd.DataFrame(...)
items_df = pd.DataFrame(...)
interactions_df = pd.DataFrame(...)

# 2. Create RDB with relationships
rdb = fastdfs.create_rdb(
    name="ecommerce",
    tables={
        "user": users_df,
        "item": items_df,
        "interaction": interactions_df
    },
    primary_keys={
        "user": "user_id",
        "item": "item_id"
    },
    foreign_keys=[
        ("interaction", "user_id", "user", "user_id"),
        ("interaction", "item_id", "item", "item_id")
    ],
    time_columns={
        "interaction": "timestamp"
    }
)

# 3. Save for later use
rdb.save("ecommerce_rdb/")

# 4. Load it back
rdb = fastdfs.load_rdb("ecommerce_rdb/")

Option B: Adapt Existing Datasets

FastDFS includes adapters for popular relational dataset benchmarks.

RelBench

from fastdfs.adapter.relbench import RelBenchAdapter

# Load and convert RelBench dataset
adapter = RelBenchAdapter("rel-stack")
rdb = adapter.load()
rdb.save("rel-stack-rdb/")

DBInfer

from fastdfs.adapter.dbinfer import DBInferAdapter

# Load and convert DBInfer dataset
adapter = DBInferAdapter("diginetica")
rdb = adapter.load()
rdb.save("diginetica-rdb/")

Option C: Load from Relational Database

FastDFS supports loading data directly from SQL databases (SQLite, MySQL, PostgreSQL, DuckDB).

from fastdfs.adapter.sqlite import SQLiteAdapter
# from fastdfs.adapter.mysql import MySQLAdapter
# from fastdfs.adapter.postgres import PostgreSQLAdapter

# Connect to database
adapter = SQLiteAdapter(
      "ecommerce.db",
      time_columns={"orders": "created_at"},  # (Optional) specify which column is the time column for which table
      type_hints={"users": {"age": "float"}}  # (Optional) specify the desired column data type
)

# Or for MySQL/PostgreSQL:
# adapter = MySQLAdapter("mysql+pymysql://user:pass@host/db")

rdb = adapter.load()

2. Generate Features

import fastdfs
import pandas as pd

# Prepare your rdb from the methods above
rdb = ...

# Create or load your target dataframe
target_df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "item_id": [10, 20, 30],
    "prediction_time": ["2024-01-01", "2024-01-02", "2024-01-03"]
})

# Generate features
features = fastdfs.compute_dfs_features(
    rdb=rdb,
    target_dataframe=target_df, 
    key_mappings={
        "user_id": "user.user_id",
        "item_id": "item.item_id"  
    },
    cutoff_time_column="prediction_time",
    config_overrides={"max_depth": 2}
)

print(f"Original columns: {len(target_df.columns)}")
print(f"With features: {len(features.columns)}")

3. Advanced Usage with Transforms

# Apply preprocessing transforms before feature generation
from fastdfs.transform import RDBTransformWrapper, RDBTransformPipeline, HandleDummyTable, FeaturizeDatetime

pipeline = fastdfs.DFSPipeline(
    transform_pipeline=RDBTransformPipeline([
        HandleDummyTable(),
        RDBTransformWrapper(FeaturizeDatetime(features=["year", "month", "hour"]))
    ]),
    dfs_config=fastdfs.DFSConfig(max_depth=3, engine="dfs2sql")
)

features = pipeline.run(
    rdb=rdb,
    target_dataframe=target_df,
    key_mappings=key_mappings,
    cutoff_time_column="prediction_time"
)

Key Features

  • Table-Centric Design: Augment any dataframe, not just predefined datasets
  • Multiple DFS Engines: Choose between Featuretools (pandas) or DFS2SQL (high-performance)
  • Temporal Consistency: Built-in cutoff time support prevents data leakage
  • Flexible Key Mapping: Connect target data to RDB with simple column mappings
  • Transform Pipeline: Composable preprocessing transforms for data cleaning
  • Type Safety: Full type hints and runtime validation
  • Minimal Dependencies: Focused, lightweight package

Engine Comparison

Feature Featuretools DFS2SQL
Performance Good for small data Excellent for large data
Memory Usage High (pandas) Low (SQL-based)
Primitives Rich set Core primitives
Backend Pandas DuckDB

Documentation

Why FastDFS?

Before FastDFS (manual feature engineering):

# Manual aggregations for each feature
user_avg_rating = interactions.groupby('user_id')['rating'].mean()
user_total_purchases = interactions.groupby('user_id').size()
item_avg_rating = interactions.groupby('item_id')['rating'].mean()
# ... dozens more features ...

With FastDFS (automated):

# Automatic generation of 50+ features
features = fastdfs.compute_dfs_features(rdb, target_df, key_mappings)

FastDFS automatically discovers relationships in your data and generates meaningful aggregation features, saving weeks of manual feature engineering work.

Contributing

We welcome contributions! See our development logs for project history and architecture decisions.

License

Apache-2.0 License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastdfs-0.2.0.tar.gz (86.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastdfs-0.2.0-py3-none-any.whl (76.6 kB view details)

Uploaded Python 3

File details

Details for the file fastdfs-0.2.0.tar.gz.

File metadata

  • Download URL: fastdfs-0.2.0.tar.gz
  • Upload date:
  • Size: 86.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for fastdfs-0.2.0.tar.gz
Algorithm Hash digest
SHA256 43b0a417526814ce582289e7418139b1e9fcff40d34745f2aca4405b0b11a984
MD5 d979a73853e5573b7ef6caf888fd3dfd
BLAKE2b-256 099f02573aeb930995219bc2e84f14ad83e48cd5d8817c3ac701de0ba1fe8202

See more details on using hashes here.

File details

Details for the file fastdfs-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: fastdfs-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 76.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for fastdfs-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c73b40759e7795967fd63822c99c005f252039c982b51123aa05e8487052e195
MD5 0ab49693c4977d8cc89671654f98790e
BLAKE2b-256 75b3c56207b779ba44ee98563b25a62278cc9068a72bda17c13d0d6a19fbb5c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page