A data processing and analysis pipeline designed to handle various jobs related to data transformation, quality assessment, deduplication, and formatting. The pipeline can be configured and executed using YAML configuration files.
Project description
DatasetPipeline
A data processing and analysis pipeline designed to handle various jobs related to data transformation, quality assessment, deduplication, and formatting. The pipeline can be configured and executed using YAML configuration files.
Features
- Multi-source data loading: Load from Hugging Face datasets, local files, and more
- Flexible data formatting: Convert between different formats (SFT, DPO, conversational, text)
- Advanced deduplication: Semantic deduplication using embeddings
- Quality analysis: Automated quality assessment and categorization
- Configurable pipeline: YAML-based configuration for reproducible workflows
- CLI interface: Easy-to-use command-line interface
Installation
From PyPI (Recommended)
# Use as a uv tool (isolated environment)
uv tool install datasetpipeline
# Or install as a package with pip
pip install datasetpipeline
# Or install as a package with uv
uv pip install datasetpipeline
Optional Dependencies
# Full embeddings support
pip install "datasetpipeline[full]"
# GPU acceleration
pip install "datasetpipeline[gpu]"
# All features
pip install "datasetpipeline[all]"
# With uv tool
uv tool install "datasetpipeline[full]"
Quick Start
After installation, you can use the CLI tool directly:
# Check available commands
datasetpipeline --help
# Or use the short alias
dsp --help
Usage
Listing Jobs
To list all jobs in a pipeline configuration:
datasetpipeline list jobs/
datasetpipeline list jobs/config.yml
Running the Pipeline
To run a pipeline based on configuration files:
# Run all jobs in a directory
datasetpipeline run jobs/
# Run a specific job configuration
datasetpipeline run jobs/aeroboros-conv.yml
Generating Sample Configuration
To generate a sample job configuration:
# Print to stdout
datasetpipeline sample
# Save to file
datasetpipeline sample my-job.yml
datasetpipeline sample my-job.json
Configuration
Job configurations are defined in YAML format. Each configuration specifies the complete pipeline: loading, formatting, deduplication, analysis, and saving.
Example Configuration
# jobs/example-job.yml
load:
huggingface:
path: "davanstrien/data-centric-ml-sft"
split: "train"
take_rows: 1000
format:
merger:
user:
fields: ["book_id", "author", "text"]
separator: "\n"
merged_field: "human"
sft:
use_openai: false
column_role_map:
persona: "system"
human: "user"
summary: "assistant"
deduplicate:
semantic:
threshold: 0.8
column: "messages"
analyze:
quality:
column_name: "messages"
categories: ["code", "math", "science", "literature"]
save:
local:
directory: "processed"
filetype: "parquet"
Configuration Sections
load: Configure data sources (Hugging Face, local files)format: Transform data between formats (SFT, DPO, conversational, text)deduplicate: Remove duplicate entries using semantic similarityanalyze: Perform quality analysis and categorizationsave: Save processed data locally or to cloud storage
Directory Structure
app/
├── analyzer/ # Data quality analysis modules
├── dedup/ # Deduplication logic
├── format/ # Data formatting transformations
├── helpers/ # Utility functions and helpers
├── loader/ # Data loading from various sources
├── models/ # Pydantic data models
├── saver/ # Data saving utilities
├── translators/ # Data translation modules
├── cli.py # CLI entry point
├── constants.py # Application constants
├── job.py # Job configuration and execution
├── pipeline.py # Pipeline orchestration
└── sample_job.py # Sample configuration
jobs/ # YAML job configurations (default)
processed/ # Output directory for processed data (default)
scripts/ # Additional utility scripts
Development
Setting up Development Environment
# Clone and install in development mode
git clone https://github.com/subhayu99/datasetpipeline
cd DatasetPipeline
uv pip install -e ".[dev]"
# Install pre-commit hooks (optional)
pre-commit install
Running Tests
pytest
pytest --cov=app # With coverage
Code Formatting
black app/
flake8 app/
mypy app/
Optional Dependencies
full: Complete embeddings support with transformersdev: Development and testing toolsgpu: GPU acceleration for embeddings and deduplicationall: All optional dependencies
Install specific groups:
uv pip install "datasetpipeline[full,gpu]"
Examples
Basic Text Processing
# Create a simple job configuration
datasetpipeline sample simple-job.yml
# Edit the configuration as needed
# Then run it
datasetpipeline run simple-job.yml
Batch Processing
# Process multiple job configurations
datasetpipeline run jobs/
# List all jobs first to preview
datasetpipeline list jobs/
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests and ensure code quality
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datasetpipeline-0.1.0.tar.gz.
File metadata
- Download URL: datasetpipeline-0.1.0.tar.gz
- Upload date:
- Size: 259.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15f9a09163277afaba77c37f69042af9b5c88291274cbd191702cac45e5771ec
|
|
| MD5 |
c5eb58d1a9dd89efa26b47d050e27cd7
|
|
| BLAKE2b-256 |
673707474a453dc14fa739f974bb09b656ba107580ba6085a4e17b91c64e68d3
|
File details
Details for the file datasetpipeline-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datasetpipeline-0.1.0-py3-none-any.whl
- Upload date:
- Size: 74.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39efb7e2850fc6602a7d71cdf761a3d2bc16ed9a8f8a016becbcc5d61cefa695
|
|
| MD5 |
1dd2e76548efa3ddfdf2600ccae1174b
|
|
| BLAKE2b-256 |
5e9dbfa600210bb2b0085b25292328e59fdcdafbe50890b5fed423dbbb03f075
|