Open-source data compiler for AI training datasets
Project description
DataGPU
Open-source data compiler for AI training datasets
Compile datasets like code: clean, rank, and optimize in one command.
Mission
To make data as programmable and optimized as compute.
DataGPU compiles raw, messy datasets into training-ready binaries, turning 10k+ lines of preprocessing scripts into a single declarative command.
Features
- Automatic Cleaning: Schema inference and normalization for text, numeric, and categorical data
- Fast Deduplication: Hash-based duplicate removal using xxHash
- Quality Ranking: TF-IDF and cosine similarity-based relevance scoring
- Smart Caching: Local cache with SQLite for reproducible compilations
- Unified Pipeline: Single command execution for all preprocessing steps
- Compiled Artifacts: Parquet + manifest format with versioning and metadata
- Framework Integration: Compatible with PyTorch DataLoader and Hugging Face Datasets
Quick Start
Installation from PyPI
Install the latest stable version directly from PyPI:
pip install datagpu
For production use, we recommend installing in a virtual environment:
# Create and activate virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install DataGPU
pip install datagpu
Verify Installation
Check that DataGPU is installed correctly:
datagpu --version
# Output: DataGPU version 0.1.0
Install from Source (Development)
For development or to get the latest features:
git clone https://github.com/Jasiri-App/datagpu.git
cd datagpu
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install in development mode with all dependencies
pip install -e ".[dev]"
Basic Usage
Compile a Dataset
Process and optimize your dataset with a single command:
datagpu compile data/your_dataset.csv \
--rank \
--dedupe \
--cache \
--out compiled/
Example with sample data:
# First, download or generate sample data
python examples/generate_sample_data.py
# Process the sample data
datagpu compile examples/data/small_test.csv --out /tmp/compiled --verbose
Example output:
DataGPU v0.1.0
Compiling: examples/data/small_test.csv
Loading data from examples/data/small_test.csv...
Cleaning data...
Deduplicating...
Ranking by relevance...
Saving to /tmp/compiled/data.parquet...
Compilation complete!
Rows processed 100
Valid rows 100 (100.0%)
Duplicates removed 20 (20.0%)
Ranked samples 80
Processing time 0.1s
Output /tmp/compiled/data.parquet
Manifest /tmp/compiled/manifest.yaml
Dataset version: v0.1.0
View Dataset Information
Inspect compiled datasets:
datagpu info /tmp/compiled/manifest.yaml
Cache Management
List cached datasets:
datagpu cache-list
Clear cache:
datagpu cache-clear
Python API
You can also use DataGPU programmatically:
from datagpu import DataCompiler, load
from datagpu.types import CompilationConfig, RankMethod
# Configure the compilation
config = CompilationConfig(
source_path="data/your_dataset.csv",
output_path="compiled/",
dedupe=True, # Enable deduplication
rank=True, # Enable quality ranking
rank_method=RankMethod.RELEVANCE,
rank_target="high quality examples", # Target for relevance ranking
cache=True, # Enable caching
verbose=True # Show progress
)
# Create and run the compiler
compiler = DataCompiler(config)
output_path, manifest, stats = compiler.compile()
# Load the compiled dataset
dataset = load("compiled/manifest.yaml")
# Use with PyTorch DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Or convert to pandas/arrow
df = dataset.to_pandas()
table = dataset.to_arrow()
# Access compilation statistics
print(f"Processed {stats.total_rows} rows")
print(f"Removed {stats.duplicates_removed} duplicates")
print(f"Processing time: {stats.processing_time:.2f}s")
Architecture
┌───────────────────────────────┐
│ CLI Interface (Typer) │
│ - datagpu compile ... │
└──────────────┬────────────────┘
│
┌──────────────┴────────────────┐
│ Compiler Core (Python) │
│ - Loader (Polars/Arrow) │
│ - Cleaner │
│ - Deduper (xxHash) │
│ - Ranker (TF-IDF / cosine) │
│ - Optimizer (Parquet Writer) │
│ - Cache Manager (SQLite) │
└──────────────┬────────────────┘
│
┌──────────────┴────────────────┐
│ Storage Backend │
│ - Local FS │
│ - Parquet / Arrow │
│ - Optional S3 adapter (Phase2)│
└────────────────────────────────┘
CLI Commands
Compile
datagpu compile <source> [OPTIONS]
Options:
--out, -o PATH Output directory [default: compiled]
--rank/--no-rank Enable quality ranking [default: True]
--rank-method TEXT Ranking method: relevance, tfidf, cosine
--rank-target TEXT Target query for relevance ranking
--dedupe/--no-dedupe Enable deduplication [default: True]
--cache/--no-cache Enable caching [default: True]
--compression TEXT Compression: zstd, snappy, gzip [default: zstd]
--verbose/--quiet Verbose output [default: True]
Info
# Display dataset information
datagpu info compiled/manifest.yaml
Cache Management
# List cached datasets
datagpu cache-list
# Clear cache
datagpu cache-clear --force
Dataset Manifest
Each compiled dataset includes a manifest.yaml with metadata:
dataset_name: train
version: v0.1.0
rows: 1840200
columns: 12
dedup_ratio: 0.124
rank_method: cosine
created_at: 2025-11-11T14:03:21Z
hash: 7ac2fdf7a00f...
source_path: data/train.csv
compiled_path: compiled/data.parquet
cache_path: .datagpu/cache/
schema:
id: numeric
text: text
category: categorical
stats:
total_rows: 2400000
valid_rows: 2367840
duplicates_removed: 297600
processing_time: 8.2
Performance
Benchmarks (MVP)
| Metric | Target | Status |
|---|---|---|
| Cleaning throughput | ≥ 1M rows/sec | On track |
| Deduplication | 10× faster than Pandas | Achieved |
| Dataset compression | 40-70% smaller | Achieved |
| Ranking | ≤ 10ms per 1k rows | On track |
| Cache reuse | 5× faster | Implemented |
Example Performance
Dataset: 10k rows
Processing time: 0.8s
Throughput: 12,500 rows/sec
Compression: 65% (CSV → Parquet)
Integration Examples
PyTorch DataLoader
from datagpu import load
from torch.utils.data import DataLoader
dataset = load("compiled/manifest.yaml")
loader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
# Train your model
pass
Hugging Face Datasets
from datagpu.loader import load_to_hf
dataset = load_to_hf("compiled/manifest.yaml")
dataset.train_test_split(test_size=0.2)
Development
Setup
# Clone repository
git clone https://github.com/Jasiri-App/datagpu.git
cd datagpu
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run benchmarks
python examples/generate_sample_data.py
python examples/benchmark.py
Project Structure
datagpu/
├── datagpu/ # Core package
│ ├── __init__.py
│ ├── cli.py # CLI interface
│ ├── compiler.py # Main compiler
│ ├── cleaner.py # Data cleaning
│ ├── deduper.py # Deduplication
│ ├── ranker.py # Quality ranking
│ ├── cache.py # Cache management
│ ├── loader.py # Dataset loader
│ ├── types.py # Type definitions
│ └── utils.py # Utilities
├── tests/ # Test suite
├── examples/ # Examples and benchmarks
├── pyproject.toml # Project configuration
└── README.md
Roadmap
Phase 0.2 - Semantic Deduplication
- Embedding-based near-duplicate removal
- FAISS integration for similarity search
Phase 0.3 - Parallel Compilation
- Distributed compilation with Ray/Dask
- Multi-core optimization
Phase 0.4 - Cloud Storage
- S3/GCS backend support
- Remote dataset compilation
Phase 0.5 - Web Dashboard
- Dataset visualization
- Quality metrics and stats
- Version comparison
Phase 0.6 - Rust Backend
- Rewrite core kernels in Rust
- 20× performance improvement target
Contributing
Contributions are welcome! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
DataGPU is released under the Apache 2.0 License.
Citation
If you use DataGPU in your research, please cite:
@software{datagpu2025,
title = {DataGPU: Open-source data compiler for AI training datasets},
author = {Celestino Kariuki},
organization = {Safariblocks Ltd.},
year = {2025},
url = {https://github.com/Jasiri-App/datagpu}
}
Support
- Documentation: GitHub README
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with focus on data quality and reproducibility
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datagpu-0.1.1.tar.gz.
File metadata
- Download URL: datagpu-0.1.1.tar.gz
- Upload date:
- Size: 27.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c1fa25dd99e5b4b7edc7b4e725d4122288846613f7d440d7b53ed4d3a62504a
|
|
| MD5 |
bf6389dd48232e63822b580fa4d26018
|
|
| BLAKE2b-256 |
ec82b751a0b8213efbde2de05f6195bd0d49dd558e5284209ffa1e1d7134cbc4
|
File details
Details for the file datagpu-0.1.1-py3-none-any.whl.
File metadata
- Download URL: datagpu-0.1.1-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c19c2405208a98bf9acb116846ae3cd72d158ff6181ceb3786aadd8af06c214
|
|
| MD5 |
575abc23eaccf6e55cdc56dc12261555
|
|
| BLAKE2b-256 |
71afbff879dda066b520b78ded40bff1b96cf56b62fd3bfb08c195fa733424fa
|