Skip to main content

Production-grade data collection and processing pipeline for training LLMs and multimodal AI

Project description

Auralith Data Pipeline

Production-grade multimodal data processing pipeline for training RT-DLM and large-scale AI systems.

Python 3.10+ License: Apache 2.0 Code style: black PyPI


Overview

Auralith Data Pipeline ingests raw text, images, audio, video, and code, applies production-quality curation (perplexity filtering, LLM-as-Judge scoring, FAISS deduplication, PII scrubbing, license detection), tokenizes everything through BPE + Vector Quantization, and outputs SafeTensors shards ready for distributed model training.

Pipeline Stages

Stage What happens
Ingestion Text (HuggingFace, Common Crawl, local), images (.npy/JPEG/PNG), audio (.wav/.npy), video (.mp4), code (TheStack)
Quality Curation GPT-2 perplexity filter, LLM-as-Judge scoring, FAISS embedding dedup, license detection
Tokenization BPE for text, patch + VQ for images, mel + VQ for audio, frame + VQ for video
Sharding SafeTensors v2 schema with input_ids, attention_mask, modality_mask, targets
Observability MLflow / W&B tracking, per-sample lineage, auto data cards
Orchestration Argo Workflows, Helm/K8s, Ray, distributed coordinator + workers

Installation

# Core text pipeline
pip install auralith-data-pipeline

# With all extras (multimodal, cloud, distributed, dev tools)
pip install "auralith-data-pipeline[all]"

# Pick only what you need
pip install "auralith-data-pipeline[quality]"        # + perplexity filter + FAISS dedup
pip install "auralith-data-pipeline[distributed]"    # + Ray
pip install "auralith-data-pipeline[cloud,pdf]"      # + S3/GCS/Azure + PDF extraction
pip install "auralith-data-pipeline[multimodal]"     # + video/image/audio (PyTorch)
pip install "auralith-data-pipeline[tracking]"       # + MLflow + W&B

Quick Start

CLI

# List available datasets
auralith-pipeline list-datasets

# Process Wikipedia dataset
auralith-pipeline collect \
  --dataset wikipedia \
  --output ./data/shards \
  --max-samples 100000 \
  --preset production

End-to-End Workflow

# 1. Train tokenizers (BPE + VQ codebooks)
auralith-pipeline train-tokenizer all \
  --corpus  data/corpus/ \
  --images  data/images/ \
  --audio   data/audio/ \
  --videos  data/videos/ \
  --output  tokenizers/ \
  --vocab-size 32000 \
  --codebook-size 1024

# 2. Process raw data into SafeTensors shards
auralith-pipeline process \
  --input  data/raw/ \
  --output shards/ \
  --tokenizers tokenizers/ \
  --max-seq-len 4096 \
  --shard-size 10000

# 3. Upload to cloud storage or HuggingFace Hub
auralith-pipeline upload --source shards/ --dest s3://my-bucket/training-data/

Python API

from auralith_pipeline import Pipeline, PipelineConfig
from auralith_pipeline.sources import create_source

config = PipelineConfig.from_preset("production")
pipeline = Pipeline(config)
source = create_source("wikipedia", streaming=True, max_samples=1_000_000)
pipeline.add_source(source)

stats = pipeline.run()
print(stats.summary())

Key Features

Data Processing

  • Multi-source ingestion (HuggingFace, Common Crawl, local files, video)
  • Weighted round-robin interleaving across multiple sources
  • MinHash + FAISS embedding deduplication
  • Quality filtering (length, language, perplexity, LLM-as-Judge)
  • PII removal (multi-jurisdiction, 15+ countries)
  • License compliance scanning for code data
  • Document extraction (PDF, DOCX, HTML, Markdown)
  • SafeTensors sharding with Zstd compression and SHA-256 checksums
  • Streaming checkpointing with seeded reproducibility

Tokenization

  • Custom BPE tokenizer with 16 special tokens and byte-level fallback
  • Vector quantization for images, audio, and video
  • Multimodal token fusion with encode_with_mask()
  • Configurable vocab size (32k-128k)

Distributed Processing

  • Embedded mode — in-process coordinator + workers (no Redis needed)
  • External mode — multi-machine with Redis state store
  • Worker failure detection + automatic task requeue
  • Linear scaling up to 64+ workers

Observability & Compliance

  • MLflow / Weights & Biases experiment tracking
  • Per-sample lineage (source to shard provenance)
  • Auto-generated data cards (HuggingFace-compatible)
  • Full audit logging (JSONL) for accept/reject decisions
  • Credential and secret sanitization

SafeTensors Schema (v2)

Every output shard is directly compatible with RT-DLM training.

Tensor Dtype Shape Description
input_ids int32 (batch, seq_len) All tokens (text + image + audio + video + code)
attention_mask uint8 (batch, seq_len) 1 = real token, 0 = padding
modality_mask uint8 (batch, seq_len) 0=text, 1=image, 2=audio, 3=video, 4=code
targets int32 (batch, seq_len) Right-shifted input_ids for causal LM

Special Tokens

ID Token Purpose
0 <PAD> Padding
1 <UNK> Unknown
2 <BOS> Beginning of sequence
3 <EOS> End of sequence
4-5 <IMG> / <IMG_END> Image region
6-7 <AUDIO> / <AUDIO_END> Audio region
8-9 <VIDEO> / <VIDEO_END> Video region
10 <FUSE> Cross-modal fusion
11 <SEP> Separator
12 <MASK> Masked LM
13-14 <CODE> / <CODE_END> Code block
15 <THINK> Chain-of-thought

Available Datasets

Dataset Size Description
wikipedia 20 GB English Wikipedia
c4 750 GB Cleaned Common Crawl
redpajama 1.2 TB LLaMA training data
openwebtext 40 GB Reddit links
bookcorpus 5 GB 11k books
the_stack 3 TB Source code (deduplicated)

Performance

Operation Speed
Text preprocessing 10k samples/sec
MinHash deduplication 5k samples/sec
FAISS dedup 3k samples/sec
BPE encoding < 1 ms/sample
SafeTensors writing 50 MB/s
Image tokenization 50 ms/image
Video tokenization 200 ms/video

Configuration

# configs/production.yaml
pipeline:
  name: production-pipeline
  output_dir: ./data/shards
  deduplicate: true
  quality_filter: true
  remove_pii: true
  seed: 42
  checkpoint_every: 10000

advanced_quality:
  enabled: true
  perplexity_filter: true
  max_perplexity: 1500.0

Documentation

For the full documentation, architecture diagrams, distributed processing guide, and contributor guide, visit the GitHub repository.


License

Apache License 2.0 — see LICENSE.


Built by AuralithAI for RT-DLM.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auralith_data_pipeline-0.1.11.tar.gz (143.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

auralith_data_pipeline-0.1.11-py3-none-any.whl (160.3 kB view details)

Uploaded Python 3

File details

Details for the file auralith_data_pipeline-0.1.11.tar.gz.

File metadata

  • Download URL: auralith_data_pipeline-0.1.11.tar.gz
  • Upload date:
  • Size: 143.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for auralith_data_pipeline-0.1.11.tar.gz
Algorithm Hash digest
SHA256 15a38ffbf4c73c054b3ca29b9fa4dcba82239deef29848b58e57e0b5bb4d6f67
MD5 a49457ea9f381d3665b18e2d3d06f9c1
BLAKE2b-256 64dd840276296e697ebb68a05e323703b174d062d03c89ac3a494d1a969bf712

See more details on using hashes here.

Provenance

The following attestation bundles were made for auralith_data_pipeline-0.1.11.tar.gz:

Publisher: release.yml on AuralithAI/Auralith-Data-Pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file auralith_data_pipeline-0.1.11-py3-none-any.whl.

File metadata

File hashes

Hashes for auralith_data_pipeline-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 700c7959d61eafe8aca26360c21668a532f376d1b669b684f9ab18e52e7648b5
MD5 b8acafbea90214d62ec8973430f33e5d
BLAKE2b-256 14373a2d144a4e765197b3a84472b02d4431f8205bb0d8e4648873cfc557157f

See more details on using hashes here.

Provenance

The following attestation bundles were made for auralith_data_pipeline-0.1.11-py3-none-any.whl:

Publisher: release.yml on AuralithAI/Auralith-Data-Pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page