Production-grade data collection and processing pipeline for training LLMs and multimodal AI

These details have not been verified by PyPI

Project description

Auralith Data Pipeline

Production-grade multimodal data processing pipeline for training RT-DLM and large-scale AI systems.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                          Auralith Data Pipeline                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐   │
│  │   Text   │  │  Images   │  │  Audio    │  │  Video    │  │   Code    │   │
│  │  (HF/CC) │  │  (.npy)   │  │  (.wav)   │  │  (.mp4)   │  │(TheStack) │   │
│  └────┬─────┘  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘   │
│       │              │              │              │              │         │
│       ▼              ▼              ▼              ▼              ▼         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                       Quality Curation                              │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────────┐     │    │
│  │  │Perplexity │  │ LLM-as-   │  │   FAISS   │  │   License     │     │    │
│  │  │  Filter   │  │  Judge    │  │  DeDup    │  │  Detection    │     │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────────┘     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                   Tokenization (BPE + VQ)                           │    │
│  │  Text → BPE  │  Img → Patch+VQ  │  Audio → Mel+VQ  │  Video → VQ │  │    │
│  │                                                                     │    │
│  │  Special Tokens: <IMG> <AUDIO> <VIDEO> <FUSE> <CODE> <THINK>        │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │              SafeTensors Shards (RT-DLM Compatible)                 │    │
│  │  ┌───────────┐ ┌───────────────┐ ┌──────────────┐ ┌────────────┐    │    │
│  │  │ input_ids │ │attention_mask │ │modality_mask │ │  targets   │    │    │
│  │  │  int32    │ │    uint8      │ │    uint8     │ │   int32    │    │    │
│  │  └───────────┘ └───────────────┘ └──────────────┘ └────────────┘    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│       │                                                                     │
│  ┌────┴──────────────────────────────────────────────────────────────┐      │
│  │  Observability            │  Orchestration                        │      │
│  │  MLflow / W&B / Local     │  Argo Workflows / Ray / Helm          │      │
│  │  Lineage + Data Cards     │  K8s + DGX Cloud                      │      │
│  └───────────────────────────┴───────────────────────────────────────┘      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                           ┌─────────────────┐
                           │     RT-DLM      │
                           │  (JAX / Haiku)  │
                           │  Distributed    │
                           │  Training       │
                           └─────────────────┘

Capabilities

Category	Feature	Status
Schema	SafeTensors 4-tensor schema v2 (`targets`, uint8 masks)	✅
Schema	Video frame extraction + VQ tokenizer	✅
Schema	16 special tokens (`<IMG>`, `<VIDEO>`, `<FUSE>`, `<THINK>`, etc.)	✅
Quality	GPT-2 perplexity filter	✅
Quality	LLM-as-Judge quality scoring	✅
Quality	FAISS embedding deduplication	✅
Quality	Local data augmentation (sentence shuffle, noise, back-translate)	✅
Observability	MLflow / W&B experiment tracking	✅
Observability	Per-sample lineage (source → shard provenance)	✅
Observability	Auto data card generation	✅
Orchestration	Argo Workflows DAG orchestration	✅
Orchestration	Helm chart for K8s deployment	✅
Orchestration	Ray distributed pipeline runner	✅
Orchestration	Distributed coordinator + workers (Redis or in-memory)	✅
Orchestration	Embedded single-machine mode (no Redis)	✅
Orchestration	Worker failure detection + automatic task requeue	✅
Compliance	License detection (permissive/copyleft)	✅
Compliance	Full audit logging (JSONL)	✅
Compliance	E2E schema validation tests	✅
Security	Multi-jurisdiction PII scrubbing (15+ countries)	✅
Security	Credential / secret sanitization	✅
Security	IRSA / Workload Identity (no static keys)	✅
Pipeline	`process` command: raw data → production `.safetensors` shards	✅

Installation

Install from PyPI (no clone needed)

# Core text pipeline
pip install auralith-data-pipeline

# With all extras (multimodal, cloud, distributed, dev tools — ~3 GB)
pip install "auralith-data-pipeline[all]"

# Pick only what you need
pip install "auralith-data-pipeline[quality]"        # + perplexity filter + FAISS dedup
pip install "auralith-data-pipeline[distributed]"    # + Ray
pip install "auralith-data-pipeline[cloud,pdf]"      # + S3/GCS/Azure + PDF extraction

Once installed, the CLI is available globally:

auralith-pipeline --help
auralith-pipeline list-datasets
auralith-pipeline process --input data/ --output shards/

Developer setup (clone + editable install)

One-command setup (recommended)

git clone https://github.com/AuralithAI/Auralith-Data-Pipeline.git
cd Auralith-Data-Pipeline

chmod +x setup.sh
./setup.sh            # Creates .venv + installs ALL extras (~3 GB with PyTorch)

The setup script automatically:

Detects your OS (macOS, Linux/WSL, Windows via Git Bash)
Checks for Python 3.10+ — if missing, offers to install it via your system package manager (brew, apt, dnf, pacman, winget, etc.)
Installs system build dependencies (libffi, openssl, gcc, etc.) needed for native extensions
Creates a .venv virtual environment and activates it
Bootstraps pip (via ensurepip) if not available, then upgrades pip + setuptools + wheel
Installs the package in editable mode with all dependencies
Verifies the installation and CLI entry point

Lighter install profiles:

./setup.sh --core     # Core text pipeline only (~500 MB)
./setup.sh --dev      # Core + dev tools (pytest, black, ruff, mypy)
./setup.sh --help     # Show all options

After setup completes, activate the environment and you're ready:

source .venv/bin/activate    # Linux / macOS / WSL
# or .\.venv\Scripts\activate  # Windows (PowerShell)

auralith-pipeline --help

Manual installation

Click to expand manual steps

git clone https://github.com/AuralithAI/Auralith-Data-Pipeline.git
cd Auralith-Data-Pipeline

python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# or .\.venv\Scripts\activate  # Windows

# Recommended — install everything (includes multimodal, cloud, dev tools)
pip install -e ".[all]"            # ~3 GB with PyTorch

# Or install only what you need
pip install -e .                   # Core only (text pipeline)
pip install -e ".[quality]"        # + Perplexity filter + FAISS dedup
pip install -e ".[tracking]"       # + MLflow + W&B
pip install -e ".[distributed]"    # + Ray
pip install -e ".[multimodal]"     # + Video + image + audio (PyTorch)
pip install -e ".[cloud,pdf]"      # + Cloud storage + PDF extraction

Tip: If you plan to use the full CLI (tokenizer training, multimodal processing, distributed jobs), install with [all] to avoid missing dependency errors at runtime.

Quick Start

CLI Usage

When you run any auralith-pipeline command, a vibrant startup banner is displayed with version and environment info. Suppress it with --no-banner or set AURALITH_NO_BANNER=1.

# List available datasets
auralith-pipeline list-datasets

# Process Wikipedia dataset (production preset)
auralith-pipeline collect \
  --dataset wikipedia \
  --output ./data/shards \
  --max-samples 100000 \
  --preset production

End-to-End Workflow: Raw Data → Trained Tokenizers → Production Shards

The full pipeline has three stages:

Prepare raw data — gather text, images, audio, and video into a folder
Train tokenizers — learn BPE vocabulary + VQ codebooks from your data
Process — tokenize everything and produce .safetensors shards for RT-DLM

  data/raw/              tokenizers/                    shards/
  ├── docs/*.txt    ──►  ├── text/   (BPE)         ──►  ├── shard_000000.safetensors
  ├── imgs/*.npy    ──►  ├── image/  (VQ codebook) ──►  ├── shard_000001.safetensors
  ├── audio/*.npy   ──►  ├── audio/  (VQ codebook) ──►  └── ...
  └── videos/*.mp4  ──►  └── video/  (VQ codebook) ──►

Step 1 — Prepare Raw Data

Organise your data into a single directory. The pipeline auto-detects file types:

Modality	Accepted formats
Text	`.txt`, `.md`, `.rst`, `.csv`, `.json`, `.jsonl`, `.tsv`, `.xml`, `.html`, `.py`, `.rs`
Image	`.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.webp`
Audio	`.wav`, `.mp3`, `.flac`, `.ogg`, `.m4a`
Video	`.mp4`, `.avi`, `.mov`, `.mkv`, `.webm`
Image / Audio	`.npy` — resolved by parent directory name (see below)

.npy disambiguation: Because .npy arrays can represent either images (H, W, 3) or audio waveforms, the pipeline inspects the parent directory name. Place image arrays under a folder whose name contains image, img, photo, picture, or visual; place audio arrays under a folder containing audio, speech, sound, music, or waveform. Files in directories that match neither keyword are skipped.

data/raw/
├── corpus/
│   ├── wikipedia.txt
│   └── books.txt
├── images/
│   ├── img_001.npy
│   └── img_002.npy
├── audio/
│   ├── speech_001.npy
│   └── speech_002.npy
└── videos/
    └── lecture_001.mp4

Step 2 — Train Tokenizers

Train all modality tokenizers in one command:

auralith-pipeline train-tokenizer all \
  --corpus  data/raw/corpus/ \
  --images  data/raw/images/ \
  --audio   data/raw/audio/ \
  --videos  data/raw/videos/ \
  --output  tokenizers/ \
  --vocab-size 32000 \
  --codebook-size 1024 \
  --audio-codebook-size 512

This creates:

tokenizers/
├── text/          # BPE tokenizer (vocab.json, merges.txt, config.json)
├── image/         # Image VQ tokenizer (config.json, vq_codebook.json)
├── audio/         # Audio VQ tokenizer (config.json, vq_codebook.json)
└── video/         # Video VQ tokenizer (config.json, vq_codebook.json)

Or train each modality separately for finer control:

# Text BPE tokenizer
auralith-pipeline train-tokenizer text \
  --corpus data/raw/corpus/ \
  --output tokenizers/text \
  --vocab-size 32000

# Image VQ tokenizer
auralith-pipeline train-tokenizer image \
  --images data/raw/images/ \
  --output tokenizers/image \
  --codebook-size 1024 \
  --image-size 224 \
  --patch-size 16

# Audio VQ tokenizer
auralith-pipeline train-tokenizer audio \
  --audio data/raw/audio/ \
  --output tokenizers/audio \
  --codebook-size 512 \
  --sample-rate 16000

# Video VQ tokenizer
auralith-pipeline train-tokenizer video \
  --videos data/raw/videos/ \
  --output tokenizers/video \
  --codebook-size 1024 \
  --max-frames 32

Tip: Store trained tokenizers in version control or cold storage (S3/GCS). They are small (~2 MB each) and must stay frozen for the lifetime of a model.

Step 3 — Process Raw Data into Shards

auralith-pipeline process \
  --input  data/raw/ \
  --output shards/ \
  --tokenizers tokenizers/ \
  --max-seq-len 4096 \
  --shard-size 10000

Each shard is a .safetensors file with the schema v2 tensors (input_ids, attention_mask, modality_mask, targets), ready for RT-DLM training.

Step 4 — Feed into RT-DLM

# Upload shards to cloud storage
auralith-pipeline upload --source shards/ --dest s3://my-bucket/training-data/

# Or upload to HuggingFace Hub
auralith-pipeline upload --source shards/ --dest hf://AuralithAI/training-shards

# Train RT-DLM
python src/train.py --data-dir shards/

Python API

from auralith_pipeline import Pipeline, PipelineConfig
from auralith_pipeline.sources import create_source

# Configure pipeline
config = PipelineConfig.from_preset("production")

# Create and run pipeline
pipeline = Pipeline(config)
source = create_source("wikipedia", streaming=True, max_samples=1_000_000)
pipeline.add_source(source)

stats = pipeline.run()
print(stats.summary())

Using Processed Shards with RT-DLM

from safetensors.numpy import load_file

shard = load_file("data/shards/shard_00000.safetensors")

input_ids      = shard["input_ids"]       # (batch, seq_len) — int32
attention_mask = shard["attention_mask"]   # (batch, seq_len) — uint8 (1=real, 0=pad)
modality_mask  = shard["modality_mask"]    # (batch, seq_len) — uint8 (0=text,1=img,2=aud,3=vid,4=code)
targets        = shard["targets"]         # (batch, seq_len) — int32, right-shifted input_ids

# targets[:, t] == input_ids[:, t+1]  (causal LM next-token prediction)
# JAX uses attention_mask to zero out padding in the loss — no -100 ignore index.

# Feed directly to RT-DLM training
# python src/train.py --data-dir ./data/shards

SafeTensors Schema (v2)

Every shard is RT-DLM compatible. All sequences are padded/truncated to a fixed seq_len (default 2048) for JAX compatibility.

Tensor	Dtype	Shape	Description
`input_ids`	int32	(batch, seq_len)	All tokens (text + image + audio + video + code)
`attention_mask`	uint8	(batch, seq_len)	1 = real token, 0 = padding
`modality_mask`	uint8	(batch, seq_len)	0=text, 1=image, 2=audio, 3=video, 4=code
`targets`	int32	(batch, seq_len)	Right-shifted `input_ids` for causal LM (next-token prediction)

Schema v2 changes (from v1): labels → targets (right-shifted, no −100 ignore index), attention_mask dtype int32 → uint8 (4× memory savings), fixed-length padding, SHA-256 checksums (was MD5).

Token ID Layout

Range	Purpose
0–15	Special tokens (see below)
16–271	Byte tokens (`<byte_00>` – `<byte_ff>`) — lossless UTF-8 fallback
272+	BPE merge tokens (learned vocabulary)

Special Tokens (IDs 0–15)

ID	Token	Purpose
0	`<PAD>`	Padding
1	`<UNK>`	Unknown
2	`<BOS>`	Beginning of sequence
3	`<EOS>`	End of sequence
4	`<IMG>`	Image region start
5	`<IMG_END>`	Image region end
6	`<AUDIO>`	Audio region start
7	`<AUDIO_END>`	Audio region end
8	`<VIDEO>`	Video region start
9	`<VIDEO_END>`	Video region end
10	`<FUSE>`	Cross-modal fusion
11	`<SEP>`	Separator
12	`<MASK>`	Masked LM
13	`<CODE>`	Code block start
14	`<CODE_END>`	Code block end
15	`<THINK>`	Chain-of-thought

Features

Data Processing

Multi-source ingestion (HuggingFace, Common Crawl, local files, video)
Weighted round-robin interleaving across multiple sources
MinHash + FAISS embedding deduplication
Quality filtering (length, language, perplexity, LLM-as-Judge)
PII removal (automatic detection and redaction)
License compliance scanning for code data
Document extraction (PDF, DOCX, HTML, Markdown)
SafeTensors sharding with Zstd compression and SHA-256 checksums
Streaming checkpointing with seeded reproducibility (numpy + stdlib RNG)
Deterministic resumption from checkpoint (skip-ahead + RNG state restore)

Tokenization

Custom BPE tokenizer (16 special tokens, byte-level fallback, no external dependency)
256 byte tokens (IDs 16–271) for lossless UTF-8 encoding of any input
LRU-bounded merge cache (100k entries) for fast encoding
Vector quantization for images, audio, and video
Multimodal token fusion with encode_with_mask()
Configurable vocab size (32k–128k)

Quality & Compliance

Perplexity filter: GPT-2 based scoring with configurable thresholds
LLM-as-Judge: Score coherence, toxicity, educational value
FAISS dedup: Cosine similarity with IVFFlat/IVFPQ indexes
License detection: Permissive vs copyleft classification
Audit logging: Full accept/reject decisions to JSONL
Local augmentation: Sentence shuffle, paragraph extract, token noise, back-translate

Observability

MLflow / W&B experiment tracking (params, metrics, artifacts)
Per-sample lineage — track every sample from source to shard
Auto data cards — HuggingFace-compatible README.md generation

Orchestration

Argo Workflows — DAG-based parallel dataset processing
Helm chart — deploy on any K8s cluster or DGX Cloud
Ray — horizontal scaling across machines

Storage & Deployment

Cloud storage (HuggingFace Hub, S3, GCS, Azure Blob)
Docker support for containerized deployment
CI via GitHub Actions (lint, test, build)

Configuration

# configs/production.yaml
pipeline:
  name: production-pipeline
  output_dir: ./data/shards
  deduplicate: true
  quality_filter: true
  remove_pii: true
  seed: 42                       # Reproducibility (numpy + stdlib RNG)
  checkpoint_every: 10000        # Save resume checkpoint every N accepted samples

advanced_quality:
  enabled: true
  perplexity_filter: true
  max_perplexity: 1500.0

deduplication:
  method: minhash    # or: embedding (FAISS)
  minhash_threshold: 0.85

tracking:
  enabled: true
  backend: local     # or: mlflow, wandb

compliance:
  enabled: true
  license_detection: true
  allow_copyleft: false
  audit_log_path: ./data/audit/audit.jsonl

video:
  enabled: false
  frame_strategy: uniform
  max_frames: 32

See configs/production.yaml for the full configuration reference.

Deploy to DGX Cloud in 5 Steps

# 1. Build container
docker build -t auralith-pipeline:latest .

# 2. Push to registry
docker tag auralith-pipeline:latest nvcr.io/YOUR_ORG/auralith-pipeline:latest
docker push nvcr.io/YOUR_ORG/auralith-pipeline:latest

# 3. Install Helm chart
helm install auralith docker/kubernetes/helm/ \
  --set image.repository=nvcr.io/YOUR_ORG/auralith-pipeline \
  --set image.tag=latest \
  --set pipeline.config=production

# 4. Submit Argo workflow (parallel datasets)
argo submit docker/kubernetes/argo-workflow.yaml

# 5. Monitor with Ray dashboard
ray dashboard  # http://localhost:8265

Available Datasets

Dataset	Size	Description
wikipedia	20GB	English Wikipedia (verified)
c4	750GB	Cleaned Common Crawl
redpajama	1.2TB	LLaMA training data
openwebtext	40GB	Reddit links
bookcorpus	5GB	11k books
wikitext	500MB	Wikipedia subset
dolly	15MB	Instruction following
the_stack	3TB	Source code (deduplicated)

Tokenization

Training Tokenizers (Detailed Guide)

Before you can process raw files into shards, you need trained tokenizers for each modality your model will consume. These tokenizers are frozen artifacts — once trained, they must not change for the entire model's lifecycle.

Why train your own tokenizers?

Text (BPE): Learns subword units tuned to your domain vocabulary (e.g. medical, legal, code).
Image (VQ): Learns a discrete codebook that maps image patches → token IDs.
Audio (VQ): Learns a codebook over mel-spectrogram patches.
Video (VQ): Same as image, but trained on video frames for temporal consistency.

Recommended Training Data Sizes

Modality	Minimum	Recommended	Notes
Text	1 MB	1–10 GB	More data = better subword coverage
Image	100 images	10k+ images	`.npy` arrays (H, W, 3) or JPEG/PNG
Audio	100 files	10k+ files	`.npy` waveforms or `.wav/.flac`
Video	50 videos	1k+ videos	`.mp4/.avi/.mov` — frames extracted automatically

Training Commands

# All at once (recommended)
auralith-pipeline train-tokenizer all \
  --corpus  data/corpus/ \
  --images  data/images/ \
  --audio   data/audio/ \
  --videos  data/videos/ \
  --output  tokenizers/ \
  --vocab-size 32000 \
  --codebook-size 1024 \
  --audio-codebook-size 512 \
  --image-size 224 \
  --patch-size 16 \
  --sample-rate 16000 \
  --max-frames 32

# Or individually
auralith-pipeline train-tokenizer text  --corpus data/corpus.txt --output tokenizers/text --vocab-size 32000
auralith-pipeline train-tokenizer image --images data/images/    --output tokenizers/image --codebook-size 1024
auralith-pipeline train-tokenizer audio --audio  data/audio/     --output tokenizers/audio --codebook-size 512
auralith-pipeline train-tokenizer video --videos data/videos/    --output tokenizers/video --codebook-size 1024

Output Structure

tokenizers/
├── text/
│   ├── vocab.json       # Token → ID mapping
│   ├── merges.txt       # BPE merge rules (ordered)
│   └── config.json      # Tokenizer hyperparameters
├── image/
│   ├── config.json      # image_size, patch_size, codebook_size
│   └── vq_codebook.json # Learned VQ centroids
├── audio/
│   ├── config.json      # sample_rate, n_fft, codebook_size
│   └── vq_codebook.json
└── video/
    ├── config.json      # image_size, patch_size, max_frames
    └── vq_codebook.json

Cold Storage: Archive tokenizers/ to S3/GCS alongside your model checkpoints. The process command reads these frozen tokenizers at inference time.

Processing Raw Data into Shards

Once tokenizers are trained, use process to convert raw files into production shards:

auralith-pipeline process \
  --input  data/raw/ \
  --output shards/ \
  --tokenizers tokenizers/ \
  --max-seq-len 4096 \
  --shard-size 10000

Option	Default	Description
`--input`	—	Folder with raw `.txt/.jpg/.wav/.mp4` etc.
`--output`	—	Where `.safetensors` shards are written
`--tokenizers`	—	Root folder containing `text/`, `image/`, `audio/`, `video/` subdirs
`--max-seq-len`	4096	Maximum token sequence length per sample
`--shard-size`	10000	Maximum samples per shard file

Each shard contains 4 tensors matching the SafeTensors Schema v2: input_ids, attention_mask, modality_mask, and targets.

Token ID Layout

Range	Purpose
0–15	Special tokens (see below)
16–271	Byte tokens (`<byte_00>` – `<byte_ff>`) — lossless UTF-8 fallback
272+	BPE merge tokens (learned vocabulary)
100,000+	Image VQ codes (offset to avoid collisions)
200,000+	Audio VQ codes
300,000+	Video VQ codes

Distributed Processing

For large datasets, distribute tokenization and sharding across multiple CPU cores or machines. Two modes are supported:

Embedded Mode (single machine, no Redis)

Spins up a coordinator and N workers in-process using an in-memory state store — perfect for a beefy server with many cores:

auralith-pipeline submit-job \
  -c configs/distributed.yaml \
  -i data/raw/ \
  -o shards/ \
  -t tokenizers/ \
  --embedded \
  -w 8                   # 8 workers

External Mode (multi-machine, Redis)

Run the coordinator, workers, and job submission in separate terminals (or on separate machines). Requires a Redis instance for shared state.

# Terminal 1 — start the coordinator
auralith-pipeline coordinator \
  -c configs/distributed.yaml \
  --host 0.0.0.0 --port 8080

# Terminal 2..N — start workers (can be on different machines)
auralith-pipeline worker \
  -c configs/distributed.yaml \
  --coordinator 10.0.0.1:8080 \
  --worker-id worker-1

# Terminal N+1 — submit the job
auralith-pipeline submit-job \
  -c configs/distributed.yaml \
  -i data/raw/ -o shards/ -t tokenizers/ \
  --external

How it works

DistributedPipeline scans the input directory for raw files.
Files are split into tasks (--files-per-task, default 500).
Tasks are submitted to the Coordinator, which assigns them to workers via a pluggable strategy (round-robin, least-busy, or dynamic).
Each Worker tokenizes its assigned files and writes SafeTensors shards.
The coordinator monitors heartbeats; if a worker dies, its tasks are automatically requeued (up to max_retries).
The pipeline polls until all tasks complete (or time out).

Cloud Redis

For production multi-machine deployments, point the coordinator at a managed Redis instance (ElastiCache, Memorystore, etc.):

# configs/distributed.yaml
coordinator:
  state_store_type: redis
  state_store_host: my-redis.xxxx.cache.amazonaws.com
  state_store_port: 6379
  state_store_password: "<your-auth-token>"

Performance

Operation	Speed	Notes
Text preprocessing	10k samples/sec	Single core
MinHash deduplication	5k samples/sec	With LSH index
FAISS dedup	3k samples/sec	IVFFlat on CPU
Perplexity filter	500 samples/sec	GPT-2, GPU
BPE encoding	<1 ms/sample	LRU-cached (100k entries)
SafeTensors writing	50 MB/s	Zstd compressed
Image tokenization	50 ms/image	224×224, 196 patches
Video tokenization	200 ms/video	32 frames, uniform
Ray distributed	Linear scaling	Up to 64 workers

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run E2E schema validation
pytest tests/test_e2e_schema.py -v

# Code quality
black src/ tests/ scripts/
ruff check src/ tests/ scripts/
mypy src/

Project Structure

auralith_pipeline/
├── cli.py                        # CLI commands (collect, process, train-tokenizer, etc.)
├── pipeline.py                   # Main pipeline (tracking, compliance, security)
├── config/                       # Configuration management
│   └── pipeline_config.py        # All config dataclasses
├── sources/
│   ├── data_sources.py           # HuggingFace, local, JSONL sources
│   └── video.py                  # Video frame sampler
├── preprocessing/
│   ├── preprocessor.py           # Text normalization, MinHash, PII
│   ├── quality.py                # Perplexity + LLM judge
│   ├── deduplication.py          # FAISS embedding dedup
│   ├── synthetic.py              # Local data augmentation
│   └── compliance.py             # License detection + audit
├── tokenization/
│   ├── bpe_tokenizer.py          # Custom BPE (16 special tokens)
│   ├── multimodal_tokenizer.py   # Text+Image+Audio+Video fusion
│   ├── video_tokenizer.py        # Video VQ tokenizer
│   └── tokenizer.py              # TokenizedSample + pipeline wrapper
├── sharding/
│   └── shard_writer.py           # SafeTensors writer (4-tensor schema v2)
├── security/
│   ├── pii_scrubber.py           # Multi-jurisdiction PII detection
│   ├── data_sanitizer.py         # Credential / secret sanitization
│   ├── privacy_config.py         # Privacy policies + PII categories
│   └── audit.py                  # Privacy audit logger
├── storage/
│   └── backends.py               # HF Hub, S3, GCS, Azure
├── distributed/
│   ├── coordinator.py            # Job manager, task scheduling, failure recovery
│   ├── worker.py                 # Worker node (tokenize + shard writing)
│   ├── pipeline.py               # DistributedPipeline (embedded + external modes)
│   ├── state.py                  # StateStore ABC (Redis + InMemory)
│   ├── strategies.py             # Round-robin, least-busy, dynamic assignment
│   ├── client.py                 # Monitoring / control client
│   └── config.py                 # Distributed config dataclasses
├── spark/                        # Apache Spark transforms
└── utils/
    ├── helpers.py                # Formatting utilities
    └── tracking.py               # MLflow/W&B + lineage

docker/kubernetes/
├── argo-workflow.yaml            # Argo DAG
└── helm/                         # Helm chart
    ├── Chart.yaml
    ├── values.yaml
    └── templates/

tests/
├── test_cli.py                   # CLI command tests (process, train-tokenizer, etc.)
├── test_distributed.py           # Distributed module tests (50 tests)
├── test_pipeline.py              # Core pipeline tests
├── test_tokenization.py          # Tokenizer tests
├── test_e2e_schema.py            # E2E validation
└── test_security.py              # Security & PII tests

Environment Variables

# HuggingFace Hub (required for upload)
export HF_TOKEN=hf_xxxxxxxxxxxxx

# AWS S3 (optional)
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=xxxxx

# MLflow (optional)
export MLFLOW_TRACKING_URI=http://mlflow.internal:5000

# W&B (optional)
export WANDB_API_KEY=xxxxx

Releasing

Every merge to main automatically tests, builds, publishes to PyPI, and creates a GitHub Release — zero manual steps.

How it works

Test — Python 3.10 / 3.11 / 3.12
Build — sdist + universal py3-none-any wheel
Tag — auto-increments patch (v0.1.2 → v0.1.3)
Publish — pushes to PyPI via OIDC trusted publisher
Release — creates GitHub Release with changelog + wheel assets

Version is derived from git tags at build time via hatch-vcs — no hardcoded version strings anywhere.

Major / Minor bumps

Patch versions are automatic. For minor or major bumps, push a tag manually:

./scripts/bump-version.sh 0.2.0    # creates + pushes v0.2.0 tag
./scripts/bump-version.sh 1.0.0    # creates + pushes v1.0.0 tag

The pushed tag triggers the same release pipeline.

License

Apache License 2.0 — See LICENSE

Contributing

See CONTRIBUTING.md

Built by AuralithAI for RT-DLM

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.11

Mar 16, 2026

0.1.10

Mar 16, 2026

0.1.9

Mar 15, 2026

0.1.8

Mar 2, 2026

0.1.7

Mar 2, 2026

0.1.6

Mar 2, 2026

0.1.5

Mar 2, 2026

This version

0.1.4

Mar 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auralith_data_pipeline-0.1.4.tar.gz (124.7 kB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

auralith_data_pipeline-0.1.4-py3-none-any.whl (133.4 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file auralith_data_pipeline-0.1.4.tar.gz.

File metadata

Download URL: auralith_data_pipeline-0.1.4.tar.gz
Upload date: Mar 2, 2026
Size: 124.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for auralith_data_pipeline-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`78b8f6d8b5c9b9981a5685cd00c48fa72c5416d0eba874c6593e9fe481874a56`
MD5	`184096afa38656cba5a93ba7ec7ea683`
BLAKE2b-256	`16e76aaa7274d0f6f437f7a2be62493c1d1b2b35701908d523930fea65d33b3e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for auralith_data_pipeline-0.1.4.tar.gz:

Publisher: release.yml on AuralithAI/Auralith-Data-Pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: auralith_data_pipeline-0.1.4.tar.gz
- Subject digest: 78b8f6d8b5c9b9981a5685cd00c48fa72c5416d0eba874c6593e9fe481874a56
- Sigstore transparency entry: 1008321105
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: AuralithAI/Auralith-Data-Pipeline@ecd87e7873bea3ce4946e811cc8b06d12fd40af9
- Branch / Tag: refs/heads/main
- Owner: https://github.com/AuralithAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ecd87e7873bea3ce4946e811cc8b06d12fd40af9
- Trigger Event: push

File details

Details for the file auralith_data_pipeline-0.1.4-py3-none-any.whl.

File metadata

Download URL: auralith_data_pipeline-0.1.4-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 133.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for auralith_data_pipeline-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bc0e789e5eb2341df81d2296d45eaf58231f844b625d1bdd33f58743ca56ab1e`
MD5	`4bc9ac97ee6ee23a40c2bd872f1692d9`
BLAKE2b-256	`f6529c171d8ca6f594eca6e0a550784262d710dad3330a768425df078e536701`

See more details on using hashes here.

Provenance

The following attestation bundles were made for auralith_data_pipeline-0.1.4-py3-none-any.whl:

Publisher: release.yml on AuralithAI/Auralith-Data-Pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: auralith_data_pipeline-0.1.4-py3-none-any.whl
- Subject digest: bc0e789e5eb2341df81d2296d45eaf58231f844b625d1bdd33f58743ca56ab1e
- Sigstore transparency entry: 1008321106
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: AuralithAI/Auralith-Data-Pipeline@ecd87e7873bea3ce4946e811cc8b06d12fd40af9
- Branch / Tag: refs/heads/main
- Owner: https://github.com/AuralithAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ecd87e7873bea3ce4946e811cc8b06d12fd40af9
- Trigger Event: push

auralith-data-pipeline 0.1.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Auralith Data Pipeline

Architecture

Capabilities

Installation

Install from PyPI (no clone needed)

Developer setup (clone + editable install)

One-command setup (recommended)

Manual installation

Quick Start

CLI Usage

End-to-End Workflow: Raw Data → Trained Tokenizers → Production Shards

Step 1 — Prepare Raw Data

Step 2 — Train Tokenizers

Step 3 — Process Raw Data into Shards

Step 4 — Feed into RT-DLM

Python API

Using Processed Shards with RT-DLM

SafeTensors Schema (v2)

Token ID Layout

Special Tokens (IDs 0–15)

Features

Data Processing

Tokenization

Quality & Compliance

Observability

Orchestration

Storage & Deployment

Configuration

Deploy to DGX Cloud in 5 Steps

Available Datasets

Tokenization

Training Tokenizers (Detailed Guide)

Why train your own tokenizers?

Recommended Training Data Sizes

Training Commands

Output Structure

Processing Raw Data into Shards

Token ID Layout

Distributed Processing

Embedded Mode (single machine, no Redis)

External Mode (multi-machine, Redis)

How it works

Cloud Redis

Performance

Development

Project Structure

Environment Variables

Releasing

How it works

Major / Minor bumps

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata