A automated feature engineering and designing pipeline library
Project description
DataForgeML
Automated data profiling and splitting pipeline for ML datasets.
DataForgeML inspects your dataset, detects each column's semantic type (numeric, categorical, boolean, text, datetime, or identifier), computes per-column statistics and missingness, and produces a structured result ready for downstream feature engineering — no manual schema wrangling required.
Installation
pip install dataforge-ml
Quick Start
from dataforge_ml import DataLoader, PipelineConfig, StructuralProfiler
df = DataLoader().load("titanic.csv")
config = PipelineConfig()
result = StructuralProfiler(config).profile(df)
print(result.columns["Age"].semantic_type) # SemanticType.Numeric
print(result.dataset.row_count) # total rows
DataLoader auto-detects encoding and delimiter. Supported formats: CSV, TSV, Parquet, JSON, NDJSON, JSONL, XLSX, XLS, Arrow, Feather.
Column Type Overrides
Override the auto-detected type for any column before profiling:
config = PipelineConfig()
config.set_column_type("PassengerId", "identifier") # skip stats entirely
config.set_columns_type(["Survived", "Pclass"], "categorical")
result = StructuralProfiler(config).profile(df)
To drop a column from all processing entirely, use exclude_columns:
config = PipelineConfig(exclude_columns=["PassengerId", "Name"])
Splitting
from dataforge_ml import DataLoader, DataSplitter
df = DataLoader().load("titanic.csv")
splitter = DataSplitter(df, target="Survived", random_seed=42)
# Random train/test split (stratified by default when target is set)
split = splitter.random_split(test_size=0.2)
print(split.train.shape, split.test.shape)
# Chronological split (no temporal leakage)
split = splitter.time_split(time_column="date", test_size=0.2)
# K-fold cross-validation
for fold in splitter.kfold(k=5):
print(f"Fold {fold.fold_index}: train={fold.train_size}, val={fold.val_size}")
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataforge_ml-1.0.1.tar.gz.
File metadata
- Download URL: dataforge_ml-1.0.1.tar.gz
- Upload date:
- Size: 59.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe20ee3e60bf50bdfc08b2bf053db6b8f12f2d570d7589a8e7070ee0b48ef540
|
|
| MD5 |
b89c4e60fd39013c06c31ffa9999ff7d
|
|
| BLAKE2b-256 |
e8363e1f3155ff91cd442219770f396ed994e32ff64d073f21cc080107c1897b
|
Provenance
The following attestation bundles were made for dataforge_ml-1.0.1.tar.gz:
Publisher:
publish.yml on DEVunderdog/DataForgeML
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataforge_ml-1.0.1.tar.gz -
Subject digest:
fe20ee3e60bf50bdfc08b2bf053db6b8f12f2d570d7589a8e7070ee0b48ef540 - Sigstore transparency entry: 1743315786
- Sigstore integration time:
-
Permalink:
DEVunderdog/DataForgeML@6fb17ff338f0f0fc320328f1baf8952bde74aa51 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/DEVunderdog
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6fb17ff338f0f0fc320328f1baf8952bde74aa51 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dataforge_ml-1.0.1-py3-none-any.whl.
File metadata
- Download URL: dataforge_ml-1.0.1-py3-none-any.whl
- Upload date:
- Size: 77.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca6ddac0906323441f1734fa4b8399bb757ca56f87741ee1f562c6d526b4a93e
|
|
| MD5 |
b628094456b0ad5d1af3df5a3c326152
|
|
| BLAKE2b-256 |
1232c99b310cddce92faae86f6bfe8f9c7d5e83987efdc1dba1ef556ad1e5908
|
Provenance
The following attestation bundles were made for dataforge_ml-1.0.1-py3-none-any.whl:
Publisher:
publish.yml on DEVunderdog/DataForgeML
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataforge_ml-1.0.1-py3-none-any.whl -
Subject digest:
ca6ddac0906323441f1734fa4b8399bb757ca56f87741ee1f562c6d526b4a93e - Sigstore transparency entry: 1743315840
- Sigstore integration time:
-
Permalink:
DEVunderdog/DataForgeML@6fb17ff338f0f0fc320328f1baf8952bde74aa51 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/DEVunderdog
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6fb17ff338f0f0fc320328f1baf8952bde74aa51 -
Trigger Event:
push
-
Statement type: