Production-grade data collection and processing pipeline for training LLMs and multimodal AI
Project description
Auralith Data Pipeline
Production-grade multimodal data processing pipeline for training RT-DLM and large-scale AI systems.
Overview
Auralith Data Pipeline ingests raw text, images, audio, video, and code, applies production-quality curation (perplexity filtering, LLM-as-Judge scoring, FAISS deduplication, PII scrubbing, license detection), tokenizes everything through BPE + Vector Quantization, and outputs SafeTensors shards ready for distributed model training.
Pipeline Stages
| Stage | What happens |
|---|---|
| Ingestion | Text (HuggingFace, Common Crawl, local), images (.npy/JPEG/PNG), audio (.wav/.npy), video (.mp4), code (TheStack) |
| Quality Curation | GPT-2 perplexity filter, LLM-as-Judge scoring, FAISS embedding dedup, license detection |
| Tokenization | BPE for text, patch + VQ for images, mel + VQ for audio, frame + VQ for video |
| Sharding | SafeTensors v2 schema with input_ids, attention_mask, modality_mask, targets |
| Observability | MLflow / W&B tracking, per-sample lineage, auto data cards |
| Orchestration | Argo Workflows, Helm/K8s, Ray, distributed coordinator + workers |
Installation
# Core text pipeline
pip install auralith-data-pipeline
# With all extras (multimodal, cloud, distributed, dev tools)
pip install "auralith-data-pipeline[all]"
# Pick only what you need
pip install "auralith-data-pipeline[quality]" # + perplexity filter + FAISS dedup
pip install "auralith-data-pipeline[distributed]" # + Ray
pip install "auralith-data-pipeline[cloud,pdf]" # + S3/GCS/Azure + PDF extraction
pip install "auralith-data-pipeline[multimodal]" # + video/image/audio (PyTorch)
pip install "auralith-data-pipeline[tracking]" # + MLflow + W&B
Quick Start
CLI
# List available datasets
auralith-pipeline list-datasets
# Process Wikipedia dataset
auralith-pipeline collect \
--dataset wikipedia \
--output ./data/shards \
--max-samples 100000 \
--preset production
End-to-End Workflow
# 1. Train tokenizers (BPE + VQ codebooks)
auralith-pipeline train-tokenizer all \
--corpus data/corpus/ \
--images data/images/ \
--audio data/audio/ \
--videos data/videos/ \
--output tokenizers/ \
--vocab-size 32000 \
--codebook-size 1024
# 2. Process raw data into SafeTensors shards
auralith-pipeline process \
--input data/raw/ \
--output shards/ \
--tokenizers tokenizers/ \
--max-seq-len 4096 \
--shard-size 10000
# 3. Upload to cloud storage or HuggingFace Hub
auralith-pipeline upload --source shards/ --dest s3://my-bucket/training-data/
Python API
from auralith_pipeline import Pipeline, PipelineConfig
from auralith_pipeline.sources import create_source
config = PipelineConfig.from_preset("production")
pipeline = Pipeline(config)
source = create_source("wikipedia", streaming=True, max_samples=1_000_000)
pipeline.add_source(source)
stats = pipeline.run()
print(stats.summary())
Key Features
Data Processing
- Multi-source ingestion (HuggingFace, Common Crawl, local files, video)
- Weighted round-robin interleaving across multiple sources
- MinHash + FAISS embedding deduplication
- Quality filtering (length, language, perplexity, LLM-as-Judge)
- PII removal (multi-jurisdiction, 15+ countries)
- License compliance scanning for code data
- Document extraction (PDF, DOCX, HTML, Markdown)
- SafeTensors sharding with Zstd compression and SHA-256 checksums
- Streaming checkpointing with seeded reproducibility
Tokenization
- Custom BPE tokenizer with 16 special tokens and byte-level fallback
- Vector quantization for images, audio, and video
- Multimodal token fusion with
encode_with_mask() - Configurable vocab size (32k-128k)
Distributed Processing
- Embedded mode — in-process coordinator + workers (no Redis needed)
- External mode — multi-machine with Redis state store
- Worker failure detection + automatic task requeue
- Linear scaling up to 64+ workers
Observability & Compliance
- MLflow / Weights & Biases experiment tracking
- Per-sample lineage (source to shard provenance)
- Auto-generated data cards (HuggingFace-compatible)
- Full audit logging (JSONL) for accept/reject decisions
- Credential and secret sanitization
SafeTensors Schema (v2)
Every output shard is directly compatible with RT-DLM training.
| Tensor | Dtype | Shape | Description |
|---|---|---|---|
input_ids |
int32 | (batch, seq_len) | All tokens (text + image + audio + video + code) |
attention_mask |
uint8 | (batch, seq_len) | 1 = real token, 0 = padding |
modality_mask |
uint8 | (batch, seq_len) | 0=text, 1=image, 2=audio, 3=video, 4=code |
targets |
int32 | (batch, seq_len) | Right-shifted input_ids for causal LM |
Special Tokens
| ID | Token | Purpose |
|---|---|---|
| 0 | <PAD> |
Padding |
| 1 | <UNK> |
Unknown |
| 2 | <BOS> |
Beginning of sequence |
| 3 | <EOS> |
End of sequence |
| 4-5 | <IMG> / <IMG_END> |
Image region |
| 6-7 | <AUDIO> / <AUDIO_END> |
Audio region |
| 8-9 | <VIDEO> / <VIDEO_END> |
Video region |
| 10 | <FUSE> |
Cross-modal fusion |
| 11 | <SEP> |
Separator |
| 12 | <MASK> |
Masked LM |
| 13-14 | <CODE> / <CODE_END> |
Code block |
| 15 | <THINK> |
Chain-of-thought |
Available Datasets
| Dataset | Size | Description |
|---|---|---|
| wikipedia | 20 GB | English Wikipedia |
| c4 | 750 GB | Cleaned Common Crawl |
| redpajama | 1.2 TB | LLaMA training data |
| openwebtext | 40 GB | Reddit links |
| bookcorpus | 5 GB | 11k books |
| the_stack | 3 TB | Source code (deduplicated) |
Performance
| Operation | Speed |
|---|---|
| Text preprocessing | 10k samples/sec |
| MinHash deduplication | 5k samples/sec |
| FAISS dedup | 3k samples/sec |
| BPE encoding | < 1 ms/sample |
| SafeTensors writing | 50 MB/s |
| Image tokenization | 50 ms/image |
| Video tokenization | 200 ms/video |
Configuration
# configs/production.yaml
pipeline:
name: production-pipeline
output_dir: ./data/shards
deduplicate: true
quality_filter: true
remove_pii: true
seed: 42
checkpoint_every: 10000
advanced_quality:
enabled: true
perplexity_filter: true
max_perplexity: 1500.0
Documentation
For the full documentation, architecture diagrams, distributed processing guide, and contributor guide, visit the GitHub repository.
License
Apache License 2.0 — see LICENSE.
Built by AuralithAI for RT-DLM.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file auralith_data_pipeline-0.1.11.tar.gz.
File metadata
- Download URL: auralith_data_pipeline-0.1.11.tar.gz
- Upload date:
- Size: 143.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15a38ffbf4c73c054b3ca29b9fa4dcba82239deef29848b58e57e0b5bb4d6f67
|
|
| MD5 |
a49457ea9f381d3665b18e2d3d06f9c1
|
|
| BLAKE2b-256 |
64dd840276296e697ebb68a05e323703b174d062d03c89ac3a494d1a969bf712
|
Provenance
The following attestation bundles were made for auralith_data_pipeline-0.1.11.tar.gz:
Publisher:
release.yml on AuralithAI/Auralith-Data-Pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
auralith_data_pipeline-0.1.11.tar.gz -
Subject digest:
15a38ffbf4c73c054b3ca29b9fa4dcba82239deef29848b58e57e0b5bb4d6f67 - Sigstore transparency entry: 1110063142
- Sigstore integration time:
-
Permalink:
AuralithAI/Auralith-Data-Pipeline@970756f2fbe96369ba63e4a3a74036cb6b36f73b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/AuralithAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@970756f2fbe96369ba63e4a3a74036cb6b36f73b -
Trigger Event:
push
-
Statement type:
File details
Details for the file auralith_data_pipeline-0.1.11-py3-none-any.whl.
File metadata
- Download URL: auralith_data_pipeline-0.1.11-py3-none-any.whl
- Upload date:
- Size: 160.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
700c7959d61eafe8aca26360c21668a532f376d1b669b684f9ab18e52e7648b5
|
|
| MD5 |
b8acafbea90214d62ec8973430f33e5d
|
|
| BLAKE2b-256 |
14373a2d144a4e765197b3a84472b02d4431f8205bb0d8e4648873cfc557157f
|
Provenance
The following attestation bundles were made for auralith_data_pipeline-0.1.11-py3-none-any.whl:
Publisher:
release.yml on AuralithAI/Auralith-Data-Pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
auralith_data_pipeline-0.1.11-py3-none-any.whl -
Subject digest:
700c7959d61eafe8aca26360c21668a532f376d1b669b684f9ab18e52e7648b5 - Sigstore transparency entry: 1110063162
- Sigstore integration time:
-
Permalink:
AuralithAI/Auralith-Data-Pipeline@970756f2fbe96369ba63e4a3a74036cb6b36f73b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/AuralithAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@970756f2fbe96369ba63e4a3a74036cb6b36f73b -
Trigger Event:
push
-
Statement type: