A data processing and analysis pipeline designed to handle various jobs related to data transformation, quality assessment, deduplication, and formatting. The pipeline can be configured and executed using YAML configuration files.

These details have not been verified by PyPI

Project links

Project description

DatasetPipeline

Features

Multi-source data loading: Load from Hugging Face datasets, local files, and more
Flexible data formatting: Convert between different formats (SFT, DPO, conversational, text)
Advanced deduplication: Semantic deduplication using embeddings
Quality analysis: Automated quality assessment and categorization
Configurable pipeline: YAML-based configuration for reproducible workflows
CLI interface: Easy-to-use command-line interface

Installation

From PyPI (Recommended)

# Use as a uv tool (isolated environment)
uv tool install datasetpipeline

# Or install as a package with pip
pip install datasetpipeline

# Or install as a package with uv
uv pip install datasetpipeline

Optional Dependencies

# Full embeddings support
pip install "datasetpipeline[full]"

# GPU acceleration
pip install "datasetpipeline[gpu]"

# All features
pip install "datasetpipeline[all]"

# With uv tool
uv tool install "datasetpipeline[full]"

Quick Start

After installation, you can use the CLI tool directly:

# Check available commands
datasetpipeline --help

# Or use the short alias
dsp --help

Usage

Listing Jobs

To list all jobs in a pipeline configuration:

datasetpipeline list jobs/
datasetpipeline list jobs/config.yml

Running the Pipeline

To run a pipeline based on configuration files:

# Run all jobs in a directory
datasetpipeline run jobs/

# Run a specific job configuration
datasetpipeline run jobs/aeroboros-conv.yml

Generating Sample Configuration

To generate a sample job configuration:

# Print to stdout
datasetpipeline sample

# Save to file
datasetpipeline sample my-job.yml
datasetpipeline sample my-job.json

Configuration

Job configurations are defined in YAML format. Each configuration specifies the complete pipeline: loading, formatting, deduplication, analysis, and saving.

Example Configuration

# jobs/example-job.yml
load:
  huggingface:
    path: "davanstrien/data-centric-ml-sft"
    split: "train"
    take_rows: 1000

format:
  merger:
    user:
      fields: ["book_id", "author", "text"]
      separator: "\n"
      merged_field: "human"
  sft:
    use_openai: false
    column_role_map:
      persona: "system"
      human: "user"
      summary: "assistant"

deduplicate:
  semantic:
    threshold: 0.8
    column: "messages"

analyze:
  quality:
    column_name: "messages"
    categories: ["code", "math", "science", "literature"]

save:
  local:
    directory: "processed"
    filetype: "parquet"

Configuration Sections

load: Configure data sources (Hugging Face, local files)
format: Transform data between formats (SFT, DPO, conversational, text)
deduplicate: Remove duplicate entries using semantic similarity
analyze: Perform quality analysis and categorization
save: Save processed data locally or to cloud storage

Directory Structure

app/
├── analyzer/          # Data quality analysis modules
├── dedup/             # Deduplication logic
├── format/            # Data formatting transformations
├── helpers/           # Utility functions and helpers
├── loader/            # Data loading from various sources
├── models/            # Pydantic data models
├── saver/             # Data saving utilities
├── translators/       # Data translation modules
├── cli.py             # CLI entry point
├── constants.py       # Application constants
├── job.py             # Job configuration and execution
├── pipeline.py        # Pipeline orchestration
└── sample_job.py      # Sample configuration

jobs/                  # YAML job configurations (default)
processed/             # Output directory for processed data (default)
scripts/               # Additional utility scripts

Development

Setting up Development Environment

# Clone and install in development mode
git clone https://github.com/subhayu99/datasetpipeline
cd DatasetPipeline
uv pip install -e ".[dev]"

# Install pre-commit hooks (optional)
pre-commit install

Running Tests

pytest
pytest --cov=app  # With coverage

Code Formatting

black app/
flake8 app/
mypy app/

Optional Dependencies

full: Complete embeddings support with transformers
dev: Development and testing tools
gpu: GPU acceleration for embeddings and deduplication
all: All optional dependencies

Install specific groups:

uv pip install "datasetpipeline[full,gpu]"

Examples

Basic Text Processing

# Create a simple job configuration
datasetpipeline sample simple-job.yml

# Edit the configuration as needed
# Then run it
datasetpipeline run simple-job.yml

Batch Processing

# Process multiple job configurations
datasetpipeline run jobs/

# List all jobs first to preview
datasetpipeline list jobs/

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests and ensure code quality
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Jun 15, 2025

0.2.0

May 25, 2025

0.1.8

May 25, 2025

0.1.7

May 25, 2025

0.1.6

May 25, 2025

0.1.5

May 25, 2025

0.1.4

May 25, 2025

0.1.3

May 25, 2025

0.1.2

May 25, 2025

0.1.1

May 25, 2025

This version

0.1.0

May 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasetpipeline-0.1.0.tar.gz (259.1 kB view details)

Uploaded May 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datasetpipeline-0.1.0-py3-none-any.whl (74.9 kB view details)

Uploaded May 24, 2025 Python 3

File details

Details for the file datasetpipeline-0.1.0.tar.gz.

File metadata

Download URL: datasetpipeline-0.1.0.tar.gz
Upload date: May 24, 2025
Size: 259.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for datasetpipeline-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`15f9a09163277afaba77c37f69042af9b5c88291274cbd191702cac45e5771ec`
MD5	`c5eb58d1a9dd89efa26b47d050e27cd7`
BLAKE2b-256	`673707474a453dc14fa739f974bb09b656ba107580ba6085a4e17b91c64e68d3`

See more details on using hashes here.

File details

Details for the file datasetpipeline-0.1.0-py3-none-any.whl.

File metadata

Download URL: datasetpipeline-0.1.0-py3-none-any.whl
Upload date: May 24, 2025
Size: 74.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for datasetpipeline-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`39efb7e2850fc6602a7d71cdf761a3d2bc16ed9a8f8a016becbcc5d61cefa695`
MD5	`1dd2e76548efa3ddfdf2600ccae1174b`
BLAKE2b-256	`5e9dbfa600210bb2b0085b25292328e59fdcdafbe50890b5fed423dbbb03f075`

See more details on using hashes here.

datasetpipeline 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

DatasetPipeline

Features

Installation

From PyPI (Recommended)

Optional Dependencies

Quick Start

Usage

Listing Jobs

Running the Pipeline

Generating Sample Configuration

Configuration

Example Configuration

Configuration Sections

Directory Structure

Development

Setting up Development Environment

Running Tests

Code Formatting

Optional Dependencies

Examples

Basic Text Processing

Batch Processing

Contributing

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes