A data processing and analysis pipeline designed to handle various jobs related to data transformation, quality assessment, deduplication, and formatting. The pipeline can be configured and executed using YAML configuration files.

These details have not been verified by PyPI

Project links

Project description

🚀 DatasetPipeline

Transform messy datasets into ML-ready gold. A powerful, configurable pipeline for dataset processing, quality assessment, and standardization—built by ML practitioner(s), for ML practitioners.

🎯 Why DatasetPipeline?

The Problem: You're drowning in data preprocessing chaos. Multiple formats, inconsistent schemas, duplicate records, quality issues—and you're spending more time wrangling data than training models.

The Solution: DatasetPipeline automates the entire journey from raw data to model-ready datasets with reproducible, configurable workflows.

Born from Real-World Pain 🔥

This project emerged from my experience as a data engineer and MLOps practitioner. I was constantly:

Ingesting diverse datasets for LLM fine-tuning
Converting everything to OpenAI-compatible formats
Writing repetitive preprocessing scripts
Juggling deduplication, quality checks, and format conversions
Maintaining brittle pipelines across multiple projects

What started as manageable became overwhelming. DatasetPipeline was built to solve these exact pain points—turning hours of manual work into minutes of configuration.

✨ Features

Feature	Description
🔌 Multi-Source Loading	Hugging Face datasets, local files, cloud storage
🔄 Format Flexibility	SFT, DPO, conversational, text—convert between any format
🧹 Smart Deduplication	Semantic similarity using embeddings, not just exact matches
📊 Quality Analysis	Automated categorization and quality scoring
⚙️ YAML Configuration	Reproducible workflows, version-controlled pipelines
🖥️ CLI Interface	Simple commands, powerful automation
🚀 GPU Acceleration	Optional GPU support for heavy processing

🚀 Quick Start

Installation

# Recommended: Use as isolated tool
uv tool install datasetpipeline

# Or with pip
pip install datasetpipeline

# For full features (embeddings, GPU support)
pip install "datasetpipeline[all]"

Your First Pipeline

# Generate a minimal sample configuration with comments
datasetpipeline sample my-first-job.yml --template minimal

# Or generate a full sample with all options and comments
datasetpipeline sample my-first-job.yml --template full

# Run the pipeline
datasetpipeline run my-first-job.yml

# That's it! 🎉

⚙️ Configuration Guidelines

🚨 Important Configuration Rule

When disabling pipeline components, you must keep the section keys present with null values. Never completely remove the top-level keys.

✅ Correct Way to Disable Components

load:
  huggingface:
    path: "teknium/OpenHermes-2.5"
    split: "train"

format:
  sft:
    use_openai: true

# Disable deduplication - keep the key with null
deduplicate: null

# Disable analysis - keep the key with null  
analyze: null

save:
  local:
    directory: "output"
    filetype: "jsonl"

❌ Wrong Way (Will Cause Errors)

load:
  huggingface:
    path: "teknium/OpenHermes-2.5"
    split: "train"

format:
  sft:
    use_openai: true

# DON'T DO THIS - completely removing keys
# deduplicate: <-- missing entirely
# analyze: <-- missing entirely

save:
  local:
    directory: "output"
    filetype: "jsonl"

💡 Alternative: Comment Out Values, Keep Keys

load:
  huggingface:
    path: "teknium/OpenHermes-2.5"
    split: "train"

format:
  sft:
    use_openai: true

# Temporarily disable deduplication
deduplicate:
  # semantic:
  #   threshold: 0.85
  #   column: "messages"

# Disable analysis for now
analyze:
  # quality:
  #   column_name: "messages"
  #   categories: ["code", "reasoning"]

save:
  local:
    directory: "output"
    filetype: "jsonl"

Why This Matters

DatasetPipeline expects all major pipeline sections (load, format, deduplicate, analyze, save) to be present in the configuration. This design ensures:

Consistent pipeline structure across all jobs
Clear intent - you explicitly choose to skip steps vs. forgetting them
Easy re-enablement - uncomment values instead of rewriting sections
Better error messages - the pipeline knows what you intended

🎛️ Managing Configuration Complexity

Problem: The full sample configuration can be overwhelming with all comments and options.

Solutions:

Start minimal - Use --template minimal as a starting point for clean, simple configs
Use templates - Pre-built configurations for common use cases (--template sft, --template dpo, --template analysis)
Progressive enhancement - Start simple, add complexity as needed
Reference mode - Use --template full when you need to see all available options

📖 Real-World Example

Transform a Hugging Face dataset into training-ready format:

# jobs/sft-training.yml
load:
  huggingface:
    path: "teknium/OpenHermes-2.5"
    split: "train"
    take_rows: 10000

format:
  sft:
    use_openai: true
    column_role_map:
      system: "system"
      human: "user" 
      gpt: "assistant"

deduplicate:
  semantic:
    threshold: 0.85
    column: "messages"

analyze:
  quality:
    column_name: "messages"
    categories: ["code", "reasoning", "creative", "factual"]

save:
  local:
    directory: "training_data"
    filetype: "jsonl"

datasetpipeline run jobs/sft-training.yml

Result: Clean, deduplicated, standardized training data ready for your LLM fine-tuning pipeline.

🛠️ Core Commands & Sample Generation

Command Reference

Command	Purpose	Example
`list`	Preview available jobs	`datasetpipeline list jobs/`
`run`	Execute pipeline(s)	`datasetpipeline run jobs/my-job.yml`
`sample`	Generate template configs	`datasetpipeline sample new-job.yml --template=minimal`

Batch Processing

# Process all jobs in a directory
datasetpipeline run jobs/

# Preview what will run
datasetpipeline list jobs/

🏗️ Pipeline Components

📥 Data Loading

Hugging Face: Direct dataset integration
Local Files: JSON, CSV, Parquet, JSONL
Cloud Storage: S3, GCS (coming soon)

🔧 Data Formatting

SFT (Supervised Fine-Tuning): OpenAI chat format
DPO (Direct Preference Optimization): Preference pairs
Conversational: Multi-turn dialogue format
Text: Simple text processing
Custom Merging: Combine multiple fields intelligently

🧹 Deduplication

Semantic: Embedding-based similarity detection
Exact: Hash-based duplicate removal
Fuzzy: Near-duplicate detection

📊 Quality Analysis

Automated Categorization: Code, math, reasoning, creative writing
Quality Scoring: Length, complexity, coherence metrics
Custom Categories: Define your own quality dimensions

💾 Data Saving

Multiple Formats: Parquet, JSONL, CSV
Flexible Output: Local files, structured directories
Metadata: Pipeline provenance and statistics

📁 Project Structure

datasetpipeline/
├── 📦 app/
│   ├── 🔬 analyzer/       # Quality analysis & categorization
│   ├── 🧹 dedup/          # Deduplication algorithms
│   ├── 🔄 format/         # Data format transformations
│   ├── 📥 loader/         # Multi-source data loading
│   ├── 💾 saver/          # Output handling
│   └── 🛠️ helpers/       # Utilities & common functions
├── ⚙️ jobs/              # YAML configurations
├── 📊 processed/         # Pipeline outputs
└── 📜 scripts/           # Additional utilities

🎨 Advanced Configuration

Conditional Processing

load:
  huggingface:
    path: "my-dataset"
    filters:
      quality_score: ">= 0.8"
      language: "en"

format:
  sft:
    use_openai: true
    min_message_length: 10
    max_conversation_turns: 20

# Skip deduplication for this job
deduplicate: null

analyze:
  quality:
    column_name: "text"
    min_score: 0.7
    categories: ["technical", "creative"]
    save_analysis: true

save:
  local:
    directory: "filtered_data"
    filetype: "parquet"

Quality-Based Filtering

load:
  local:
    path: "raw_data.jsonl"

# Skip formatting - data is already in correct format
format: null

deduplicate:
  exact:
    column: "content"

analyze:
  quality:
    column_name: "text"
    min_score: 0.7
    categories: ["technical", "creative"]
    save_analysis: true

save:
  local:
    directory: "cleaned_data"
    filetype: "jsonl"

Custom Deduplication

load:
  huggingface:
    path: "my-dataset"

format:
  text:
    column: "content"

deduplicate:
  semantic:
    threshold: 0.9
    model: "sentence-transformers/all-MiniLM-L6-v2"
    batch_size: 32
    use_gpu: true

# Skip analysis for faster processing
analyze: null

save:
  local:
    directory: "deduped_data"
    filetype: "parquet"

🏗️ Extensible Architecture

DatasetPipeline is built with extensibility at its core. Each major component uses Abstract Base Classes (ABC), making it incredibly easy to add new functionality:

# Want a new data loader? Just extend BaseLoader
class MyCustomLoader(BaseLoader):
    def load(self) -> Dataset:
        # Your custom loading logic
        pass

# Need a specialized formatter? Extend BaseFormatter  
class MyFormatter(BaseFormatter):
    def format(self, dataset: Dataset) -> Dataset:
        # Your formatting logic
        pass

🔌 Pluggable Components

Component	ABC Base Class	Easy to Add
📥 Loaders	`BaseLoader`	New data sources (APIs, databases, cloud storage)
🔄 Formatters	`BaseFormatter`	Custom data transformations and schemas
🧹 Deduplicators	`BaseDeduplicator`	Novel similarity algorithms
📊 Analyzers	`BaseAnalyzer`	Domain-specific quality metrics
💾 Savers	`BaseSaver`	New output formats and destinations

🚀 Contribution-Friendly

This architecture means:

Low barrier to entry: Add one component without touching others
Clean interfaces: Well-defined contracts between components
Easy testing: Mock and test components in isolation
Community growth: Contributors can focus on their expertise area

Example: Want to add PostgreSQL loading? Just implement BaseLoader and you're done!

🏃‍♂️ Performance Tips

GPU Acceleration: Install with [gpu] extras for faster embeddings
Batch Processing: Use larger batch sizes for better throughput
Memory Management: Process large datasets in chunks
Caching: Embeddings are cached automatically

# High-performance setup
pip install "datasetpipeline[gpu]"
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

🤝 Contributing

We welcome contributions! Whether you're fixing bugs, adding features, or improving documentation:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Test your changes thoroughly
Submit a pull request

Development Setup

git clone https://github.com/subhayu99/datasetpipeline
cd DatasetPipeline
uv pip install -e ".[dev]"
pre-commit install

Areas We Need Help

🌐 Cloud storage integrations (S3, GCS, Azure)
🔍 Advanced quality metrics
📈 Performance optimizations
📚 Documentation and examples
🧪 Test coverage improvements

📄 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Built with love by the ML community, for the ML community. Special thanks to all contributors and users who help make dataset preparation less painful.

Star the repo if DatasetPipeline saves you time! ⭐

Made with ❤️ by Subhayu Kumar Bala

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Jun 15, 2025

This version

0.2.0

May 25, 2025

0.1.8

May 25, 2025

0.1.7

May 25, 2025

0.1.6

May 25, 2025

0.1.5

May 25, 2025

0.1.4

May 25, 2025

0.1.3

May 25, 2025

0.1.2

May 25, 2025

0.1.1

May 25, 2025

0.1.0

May 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasetpipeline-0.2.0.tar.gz (281.9 kB view details)

Uploaded May 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datasetpipeline-0.2.0-py3-none-any.whl (80.0 kB view details)

Uploaded May 25, 2025 Python 3

File details

Details for the file datasetpipeline-0.2.0.tar.gz.

File metadata

Download URL: datasetpipeline-0.2.0.tar.gz
Upload date: May 25, 2025
Size: 281.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for datasetpipeline-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`fd9c70028182298c558c7d767b293a97734ed4dcb38544fa05f1a6748c2685ce`
MD5	`87d0edcb7b3e6f3e8106ffaf5963adbb`
BLAKE2b-256	`8ef9f726990eb91031a2039a58ba70f4eb18cb516b5617cbba26ff8b0b63e660`

See more details on using hashes here.

File details

Details for the file datasetpipeline-0.2.0-py3-none-any.whl.

File metadata

Download URL: datasetpipeline-0.2.0-py3-none-any.whl
Upload date: May 25, 2025
Size: 80.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for datasetpipeline-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`504a5c6e08d20e3c4ebddb9dc83fc462118c8b8f239cab6583dc0a01e44fbfb5`
MD5	`10af0245b48d02934a0b5ab70f68bb70`
BLAKE2b-256	`191ff6dba91fb22c906347e6c25008afdf33bd0b1f20040c3d6d1d1ba5b9672f`

See more details on using hashes here.

datasetpipeline 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

🚀 DatasetPipeline

🎯 Why DatasetPipeline?

Born from Real-World Pain 🔥

✨ Features

🚀 Quick Start

Installation

Your First Pipeline

⚙️ Configuration Guidelines

🚨 Important Configuration Rule

✅ Correct Way to Disable Components

❌ Wrong Way (Will Cause Errors)

💡 Alternative: Comment Out Values, Keep Keys

Why This Matters

🎛️ Managing Configuration Complexity

📖 Real-World Example

🛠️ Core Commands & Sample Generation

Command Reference

Batch Processing

🏗️ Pipeline Components

📥 Data Loading

🔧 Data Formatting

🧹 Deduplication

📊 Quality Analysis

💾 Data Saving

📁 Project Structure

🎨 Advanced Configuration

Conditional Processing

Quality-Based Filtering

Custom Deduplication

🏗️ Extensible Architecture

🔌 Pluggable Components

🚀 Contribution-Friendly

🏃‍♂️ Performance Tips

🤝 Contributing

Development Setup

Areas We Need Help

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes