Skip to main content

A data processing and analysis pipeline designed to handle various jobs related to data transformation, quality assessment, deduplication, and formatting. The pipeline can be configured and executed using YAML configuration files.

Project description

DatasetPipeline

A data processing and analysis pipeline designed to handle various jobs related to data transformation, quality assessment, deduplication, and formatting. The pipeline can be configured and executed using YAML configuration files.

Features

  • Multi-source data loading: Load from Hugging Face datasets, local files, and more
  • Flexible data formatting: Convert between different formats (SFT, DPO, conversational, text)
  • Advanced deduplication: Semantic deduplication using embeddings
  • Quality analysis: Automated quality assessment and categorization
  • Configurable pipeline: YAML-based configuration for reproducible workflows
  • CLI interface: Easy-to-use command-line interface

Installation

From PyPI (Recommended)

# Use as a uv tool (isolated environment)
uv tool install datasetpipeline

# Or install as a package with pip
pip install datasetpipeline

# Or install as a package with uv
uv pip install datasetpipeline

Optional Dependencies

# Full embeddings support
pip install "datasetpipeline[full]"

# GPU acceleration
pip install "datasetpipeline[gpu]"

# All features
pip install "datasetpipeline[all]"

# With uv tool
uv tool install "datasetpipeline[full]"

Quick Start

After installation, you can use the CLI tool directly:

# Check available commands
datasetpipeline --help

# Or use the short alias
dsp --help

Usage

Listing Jobs

To list all jobs in a pipeline configuration:

datasetpipeline list jobs/
datasetpipeline list jobs/config.yml

Running the Pipeline

To run a pipeline based on configuration files:

# Run all jobs in a directory
datasetpipeline run jobs/

# Run a specific job configuration
datasetpipeline run jobs/aeroboros-conv.yml

Generating Sample Configuration

To generate a sample job configuration:

# Print to stdout
datasetpipeline sample

# Save to file
datasetpipeline sample my-job.yml
datasetpipeline sample my-job.json

Configuration

Job configurations are defined in YAML format. Each configuration specifies the complete pipeline: loading, formatting, deduplication, analysis, and saving.

Example Configuration

# jobs/example-job.yml
load:
  huggingface:
    path: "davanstrien/data-centric-ml-sft"
    split: "train"
    take_rows: 1000

format:
  merger:
    user:
      fields: ["book_id", "author", "text"]
      separator: "\n"
      merged_field: "human"
  sft:
    use_openai: false
    column_role_map:
      persona: "system"
      human: "user"
      summary: "assistant"

deduplicate:
  semantic:
    threshold: 0.8
    column: "messages"

analyze:
  quality:
    column_name: "messages"
    categories: ["code", "math", "science", "literature"]

save:
  local:
    directory: "processed"
    filetype: "parquet"

Configuration Sections

  • load: Configure data sources (Hugging Face, local files)
  • format: Transform data between formats (SFT, DPO, conversational, text)
  • deduplicate: Remove duplicate entries using semantic similarity
  • analyze: Perform quality analysis and categorization
  • save: Save processed data locally or to cloud storage

Directory Structure

app/
├── analyzer/          # Data quality analysis modules
├── dedup/             # Deduplication logic
├── format/            # Data formatting transformations
├── helpers/           # Utility functions and helpers
├── loader/            # Data loading from various sources
├── models/            # Pydantic data models
├── saver/             # Data saving utilities
├── translators/       # Data translation modules
├── cli.py             # CLI entry point
├── constants.py       # Application constants
├── job.py             # Job configuration and execution
├── pipeline.py        # Pipeline orchestration
└── sample_job.py      # Sample configuration

jobs/                  # YAML job configurations (default)
processed/             # Output directory for processed data (default)
scripts/               # Additional utility scripts

Development

Setting up Development Environment

# Clone and install in development mode
git clone https://github.com/subhayu99/datasetpipeline
cd DatasetPipeline
uv pip install -e ".[dev]"

# Install pre-commit hooks (optional)
pre-commit install

Running Tests

pytest
pytest --cov=app  # With coverage

Code Formatting

black app/
flake8 app/
mypy app/

Optional Dependencies

  • full: Complete embeddings support with transformers
  • dev: Development and testing tools
  • gpu: GPU acceleration for embeddings and deduplication
  • all: All optional dependencies

Install specific groups:

uv pip install "datasetpipeline[full,gpu]"

Examples

Basic Text Processing

# Create a simple job configuration
datasetpipeline sample simple-job.yml

# Edit the configuration as needed
# Then run it
datasetpipeline run simple-job.yml

Batch Processing

# Process multiple job configurations
datasetpipeline run jobs/

# List all jobs first to preview
datasetpipeline list jobs/

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests and ensure code quality
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasetpipeline-0.1.0.tar.gz (259.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datasetpipeline-0.1.0-py3-none-any.whl (74.9 kB view details)

Uploaded Python 3

File details

Details for the file datasetpipeline-0.1.0.tar.gz.

File metadata

  • Download URL: datasetpipeline-0.1.0.tar.gz
  • Upload date:
  • Size: 259.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for datasetpipeline-0.1.0.tar.gz
Algorithm Hash digest
SHA256 15f9a09163277afaba77c37f69042af9b5c88291274cbd191702cac45e5771ec
MD5 c5eb58d1a9dd89efa26b47d050e27cd7
BLAKE2b-256 673707474a453dc14fa739f974bb09b656ba107580ba6085a4e17b91c64e68d3

See more details on using hashes here.

File details

Details for the file datasetpipeline-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for datasetpipeline-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 39efb7e2850fc6602a7d71cdf761a3d2bc16ed9a8f8a016becbcc5d61cefa695
MD5 1dd2e76548efa3ddfdf2600ccae1174b
BLAKE2b-256 5e9dbfa600210bb2b0085b25292328e59fdcdafbe50890b5fed423dbbb03f075

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page