Skip to main content

Production-grade data collection and processing pipeline for training LLMs and multimodal AI

Project description

Auralith Data Pipeline

Production-grade multimodal data processing pipeline for training RT-DLM and large-scale AI systems.

Python 3.10+ License: Apache 2.0 Code style: black

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                          Auralith Data Pipeline                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐   │
│  │   Text   │  │  Images   │  │  Audio    │  │  Video    │  │   Code    │   │
│  │  (HF/CC) │  │  (.npy)   │  │  (.wav)   │  │  (.mp4)   │  │(TheStack) │   │
│  └────┬─────┘  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘   │
│       │              │              │              │              │         │
│       ▼              ▼              ▼              ▼              ▼         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                       Quality Curation                              │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────────┐     │    │
│  │  │Perplexity │  │ LLM-as-   │  │   FAISS   │  │   License     │     │    │
│  │  │  Filter   │  │  Judge    │  │  DeDup    │  │  Detection    │     │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────────┘     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                   Tokenization (BPE + VQ)                           │    │
│  │  Text → BPE  │  Img → Patch+VQ  │  Audio → Mel+VQ  │  Video → VQ │  │    │
│  │                                                                     │    │
│  │  Special Tokens: <IMG> <AUDIO> <VIDEO> <FUSE> <CODE> <THINK>        │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │              SafeTensors Shards (RT-DLM Compatible)                 │    │
│  │  ┌───────────┐ ┌───────────────┐ ┌──────────────┐ ┌────────────┐    │    │
│  │  │ input_ids │ │attention_mask │ │modality_mask │ │  targets   │    │    │
│  │  │  int32    │ │    uint8      │ │    uint8     │ │   int32    │    │    │
│  │  └───────────┘ └───────────────┘ └──────────────┘ └────────────┘    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│       │                                                                     │
│  ┌────┴──────────────────────────────────────────────────────────────┐      │
│  │  Observability            │  Orchestration                        │      │
│  │  MLflow / W&B / Local     │  Argo Workflows / Ray / Helm          │      │
│  │  Lineage + Data Cards     │  K8s + DGX Cloud                      │      │
│  └───────────────────────────┴───────────────────────────────────────┘      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                           ┌─────────────────┐
                           │     RT-DLM      │
                           │  (JAX / Haiku)  │
                           │  Distributed    │
                           │  Training       │
                           └─────────────────┘

Capabilities

Category Feature Status
Schema SafeTensors 4-tensor schema v2 (targets, uint8 masks)
Schema Video frame extraction + VQ tokenizer
Schema 16 special tokens (<IMG>, <VIDEO>, <FUSE>, <THINK>, etc.)
Quality GPT-2 perplexity filter
Quality LLM-as-Judge quality scoring
Quality FAISS embedding deduplication
Quality Local data augmentation (sentence shuffle, noise, back-translate)
Observability MLflow / W&B experiment tracking
Observability Per-sample lineage (source → shard provenance)
Observability Auto data card generation
Orchestration Argo Workflows DAG orchestration
Orchestration Helm chart for K8s deployment
Orchestration Ray distributed pipeline runner
Orchestration Distributed coordinator + workers (Redis or in-memory)
Orchestration Embedded single-machine mode (no Redis)
Orchestration Worker failure detection + automatic task requeue
Compliance License detection (permissive/copyleft)
Compliance Full audit logging (JSONL)
Compliance E2E schema validation tests
Security Multi-jurisdiction PII scrubbing (15+ countries)
Security Credential / secret sanitization
Security IRSA / Workload Identity (no static keys)
Pipeline process command: raw data → production .safetensors shards

Installation

Install from PyPI (no clone needed)

# Core text pipeline
pip install auralith-data-pipeline

# With all extras (multimodal, cloud, distributed, dev tools — ~3 GB)
pip install "auralith-data-pipeline[all]"

# Pick only what you need
pip install "auralith-data-pipeline[quality]"        # + perplexity filter + FAISS dedup
pip install "auralith-data-pipeline[distributed]"    # + Ray
pip install "auralith-data-pipeline[cloud,pdf]"      # + S3/GCS/Azure + PDF extraction

Once installed, the CLI is available globally:

auralith-pipeline --help
auralith-pipeline list-datasets
auralith-pipeline process --input data/ --output shards/

Developer setup (clone + editable install)

One-command setup (recommended)

git clone https://github.com/AuralithAI/Auralith-Data-Pipeline.git
cd Auralith-Data-Pipeline

chmod +x setup.sh
./setup.sh            # Creates .venv + installs ALL extras (~3 GB with PyTorch)

The setup script automatically:

  • Detects your OS (macOS, Linux/WSL, Windows via Git Bash)
  • Checks for Python 3.10+ — if missing, offers to install it via your system package manager (brew, apt, dnf, pacman, winget, etc.)
  • Installs system build dependencies (libffi, openssl, gcc, etc.) needed for native extensions
  • Creates a .venv virtual environment and activates it
  • Bootstraps pip (via ensurepip) if not available, then upgrades pip + setuptools + wheel
  • Installs the package in editable mode with all dependencies
  • Verifies the installation and CLI entry point

Lighter install profiles:

./setup.sh --core     # Core text pipeline only (~500 MB)
./setup.sh --dev      # Core + dev tools (pytest, black, ruff, mypy)
./setup.sh --help     # Show all options

After setup completes, activate the environment and you're ready:

source .venv/bin/activate    # Linux / macOS / WSL
# or .\.venv\Scripts\activate  # Windows (PowerShell)

auralith-pipeline --help

Manual installation

Click to expand manual steps
git clone https://github.com/AuralithAI/Auralith-Data-Pipeline.git
cd Auralith-Data-Pipeline

python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# or .\.venv\Scripts\activate  # Windows

# Recommended — install everything (includes multimodal, cloud, dev tools)
pip install -e ".[all]"            # ~3 GB with PyTorch

# Or install only what you need
pip install -e .                   # Core only (text pipeline)
pip install -e ".[quality]"        # + Perplexity filter + FAISS dedup
pip install -e ".[tracking]"       # + MLflow + W&B
pip install -e ".[distributed]"    # + Ray
pip install -e ".[multimodal]"     # + Video + image + audio (PyTorch)
pip install -e ".[cloud,pdf]"      # + Cloud storage + PDF extraction

Tip: If you plan to use the full CLI (tokenizer training, multimodal processing, distributed jobs), install with [all] to avoid missing dependency errors at runtime.

Quick Start

CLI Usage

When you run any auralith-pipeline command, a vibrant startup banner is displayed with version and environment info. Suppress it with --no-banner or set AURALITH_NO_BANNER=1.

# List available datasets
auralith-pipeline list-datasets

# Process Wikipedia dataset (production preset)
auralith-pipeline collect \
  --dataset wikipedia \
  --output ./data/shards \
  --max-samples 100000 \
  --preset production

End-to-End Workflow: Raw Data → Trained Tokenizers → Production Shards

The full pipeline has three stages:

  1. Prepare raw data — gather text, images, audio, and video into a folder
  2. Train tokenizers — learn BPE vocabulary + VQ codebooks from your data
  3. Process — tokenize everything and produce .safetensors shards for RT-DLM
  data/raw/              tokenizers/                    shards/
  ├── docs/*.txt    ──►  ├── text/   (BPE)         ──►  ├── shard_000000.safetensors
  ├── imgs/*.npy    ──►  ├── image/  (VQ codebook) ──►  ├── shard_000001.safetensors
  ├── audio/*.npy   ──►  ├── audio/  (VQ codebook) ──►  └── ...
  └── videos/*.mp4  ──►  └── video/  (VQ codebook) ──►

Step 1 — Prepare Raw Data

Organise your data into a single directory. The pipeline auto-detects file types:

Modality Accepted formats
Text .txt, .md, .rst, .csv, .json, .jsonl, .tsv, .xml, .html, .py, .rs
Image .jpg, .jpeg, .png, .bmp, .tiff, .webp
Audio .wav, .mp3, .flac, .ogg, .m4a
Video .mp4, .avi, .mov, .mkv, .webm
Image / Audio .npy — resolved by parent directory name (see below)

.npy disambiguation: Because .npy arrays can represent either images (H, W, 3) or audio waveforms, the pipeline inspects the parent directory name. Place image arrays under a folder whose name contains image, img, photo, picture, or visual; place audio arrays under a folder containing audio, speech, sound, music, or waveform. Files in directories that match neither keyword are skipped.

data/raw/
├── corpus/
│   ├── wikipedia.txt
│   └── books.txt
├── images/
│   ├── img_001.npy
│   └── img_002.npy
├── audio/
│   ├── speech_001.npy
│   └── speech_002.npy
└── videos/
    └── lecture_001.mp4

Step 2 — Train Tokenizers

Train all modality tokenizers in one command:

auralith-pipeline train-tokenizer all \
  --corpus  data/raw/corpus/ \
  --images  data/raw/images/ \
  --audio   data/raw/audio/ \
  --videos  data/raw/videos/ \
  --output  tokenizers/ \
  --vocab-size 32000 \
  --codebook-size 1024 \
  --audio-codebook-size 512

This creates:

tokenizers/
├── text/          # BPE tokenizer (vocab.json, merges.txt, config.json)
├── image/         # Image VQ tokenizer (config.json, vq_codebook.json)
├── audio/         # Audio VQ tokenizer (config.json, vq_codebook.json)
└── video/         # Video VQ tokenizer (config.json, vq_codebook.json)

Or train each modality separately for finer control:

# Text BPE tokenizer
auralith-pipeline train-tokenizer text \
  --corpus data/raw/corpus/ \
  --output tokenizers/text \
  --vocab-size 32000

# Image VQ tokenizer
auralith-pipeline train-tokenizer image \
  --images data/raw/images/ \
  --output tokenizers/image \
  --codebook-size 1024 \
  --image-size 224 \
  --patch-size 16

# Audio VQ tokenizer
auralith-pipeline train-tokenizer audio \
  --audio data/raw/audio/ \
  --output tokenizers/audio \
  --codebook-size 512 \
  --sample-rate 16000

# Video VQ tokenizer
auralith-pipeline train-tokenizer video \
  --videos data/raw/videos/ \
  --output tokenizers/video \
  --codebook-size 1024 \
  --max-frames 32

Tip: Store trained tokenizers in version control or cold storage (S3/GCS). They are small (~2 MB each) and must stay frozen for the lifetime of a model.

Step 3 — Process Raw Data into Shards

auralith-pipeline process \
  --input  data/raw/ \
  --output shards/ \
  --tokenizers tokenizers/ \
  --max-seq-len 4096 \
  --shard-size 10000

Each shard is a .safetensors file with the schema v2 tensors (input_ids, attention_mask, modality_mask, targets), ready for RT-DLM training.

Step 4 — Feed into RT-DLM

# Upload shards to cloud storage
auralith-pipeline upload --source shards/ --dest s3://my-bucket/training-data/

# Or upload to HuggingFace Hub
auralith-pipeline upload --source shards/ --dest hf://AuralithAI/training-shards

# Train RT-DLM
python src/train.py --data-dir shards/

Python API

from auralith_pipeline import Pipeline, PipelineConfig
from auralith_pipeline.sources import create_source

# Configure pipeline
config = PipelineConfig.from_preset("production")

# Create and run pipeline
pipeline = Pipeline(config)
source = create_source("wikipedia", streaming=True, max_samples=1_000_000)
pipeline.add_source(source)

stats = pipeline.run()
print(stats.summary())

Using Processed Shards with RT-DLM

from safetensors.numpy import load_file

shard = load_file("data/shards/shard_00000.safetensors")

input_ids      = shard["input_ids"]       # (batch, seq_len) — int32
attention_mask = shard["attention_mask"]   # (batch, seq_len) — uint8 (1=real, 0=pad)
modality_mask  = shard["modality_mask"]    # (batch, seq_len) — uint8 (0=text,1=img,2=aud,3=vid,4=code)
targets        = shard["targets"]         # (batch, seq_len) — int32, right-shifted input_ids

# targets[:, t] == input_ids[:, t+1]  (causal LM next-token prediction)
# JAX uses attention_mask to zero out padding in the loss — no -100 ignore index.

# Feed directly to RT-DLM training
# python src/train.py --data-dir ./data/shards

SafeTensors Schema (v2)

Every shard is RT-DLM compatible. All sequences are padded/truncated to a fixed seq_len (default 2048) for JAX compatibility.

Tensor Dtype Shape Description
input_ids int32 (batch, seq_len) All tokens (text + image + audio + video + code)
attention_mask uint8 (batch, seq_len) 1 = real token, 0 = padding
modality_mask uint8 (batch, seq_len) 0=text, 1=image, 2=audio, 3=video, 4=code
targets int32 (batch, seq_len) Right-shifted input_ids for causal LM (next-token prediction)

Schema v2 changes (from v1): labelstargets (right-shifted, no −100 ignore index), attention_mask dtype int32 → uint8 (4× memory savings), fixed-length padding, SHA-256 checksums (was MD5).

Token ID Layout

Range Purpose
0–15 Special tokens (see below)
16–271 Byte tokens (<byte_00><byte_ff>) — lossless UTF-8 fallback
272+ BPE merge tokens (learned vocabulary)

Special Tokens (IDs 0–15)

ID Token Purpose
0 <PAD> Padding
1 <UNK> Unknown
2 <BOS> Beginning of sequence
3 <EOS> End of sequence
4 <IMG> Image region start
5 <IMG_END> Image region end
6 <AUDIO> Audio region start
7 <AUDIO_END> Audio region end
8 <VIDEO> Video region start
9 <VIDEO_END> Video region end
10 <FUSE> Cross-modal fusion
11 <SEP> Separator
12 <MASK> Masked LM
13 <CODE> Code block start
14 <CODE_END> Code block end
15 <THINK> Chain-of-thought

Features

Data Processing

  • Multi-source ingestion (HuggingFace, Common Crawl, local files, video)
  • Weighted round-robin interleaving across multiple sources
  • MinHash + FAISS embedding deduplication
  • Quality filtering (length, language, perplexity, LLM-as-Judge)
  • PII removal (automatic detection and redaction)
  • License compliance scanning for code data
  • Document extraction (PDF, DOCX, HTML, Markdown)
  • SafeTensors sharding with Zstd compression and SHA-256 checksums
  • Streaming checkpointing with seeded reproducibility (numpy + stdlib RNG)
  • Deterministic resumption from checkpoint (skip-ahead + RNG state restore)

Tokenization

  • Custom BPE tokenizer (16 special tokens, byte-level fallback, no external dependency)
  • 256 byte tokens (IDs 16–271) for lossless UTF-8 encoding of any input
  • LRU-bounded merge cache (100k entries) for fast encoding
  • Vector quantization for images, audio, and video
  • Multimodal token fusion with encode_with_mask()
  • Configurable vocab size (32k–128k)

Quality & Compliance

  • Perplexity filter: GPT-2 based scoring with configurable thresholds
  • LLM-as-Judge: Score coherence, toxicity, educational value
  • FAISS dedup: Cosine similarity with IVFFlat/IVFPQ indexes
  • License detection: Permissive vs copyleft classification
  • Audit logging: Full accept/reject decisions to JSONL
  • Local augmentation: Sentence shuffle, paragraph extract, token noise, back-translate

Observability

  • MLflow / W&B experiment tracking (params, metrics, artifacts)
  • Per-sample lineage — track every sample from source to shard
  • Auto data cards — HuggingFace-compatible README.md generation

Orchestration

  • Argo Workflows — DAG-based parallel dataset processing
  • Helm chart — deploy on any K8s cluster or DGX Cloud
  • Ray — horizontal scaling across machines

Storage & Deployment

  • Cloud storage (HuggingFace Hub, S3, GCS, Azure Blob)
  • Docker support for containerized deployment
  • CI via GitHub Actions (lint, test, build)

Configuration

# configs/production.yaml
pipeline:
  name: production-pipeline
  output_dir: ./data/shards
  deduplicate: true
  quality_filter: true
  remove_pii: true
  seed: 42                       # Reproducibility (numpy + stdlib RNG)
  checkpoint_every: 10000        # Save resume checkpoint every N accepted samples

advanced_quality:
  enabled: true
  perplexity_filter: true
  max_perplexity: 1500.0

deduplication:
  method: minhash    # or: embedding (FAISS)
  minhash_threshold: 0.85

tracking:
  enabled: true
  backend: local     # or: mlflow, wandb

compliance:
  enabled: true
  license_detection: true
  allow_copyleft: false
  audit_log_path: ./data/audit/audit.jsonl

video:
  enabled: false
  frame_strategy: uniform
  max_frames: 32

See configs/production.yaml for the full configuration reference.

Deploy to DGX Cloud in 5 Steps

# 1. Build container
docker build -t auralith-pipeline:latest .

# 2. Push to registry
docker tag auralith-pipeline:latest nvcr.io/YOUR_ORG/auralith-pipeline:latest
docker push nvcr.io/YOUR_ORG/auralith-pipeline:latest

# 3. Install Helm chart
helm install auralith docker/kubernetes/helm/ \
  --set image.repository=nvcr.io/YOUR_ORG/auralith-pipeline \
  --set image.tag=latest \
  --set pipeline.config=production

# 4. Submit Argo workflow (parallel datasets)
argo submit docker/kubernetes/argo-workflow.yaml

# 5. Monitor with Ray dashboard
ray dashboard  # http://localhost:8265

Available Datasets

Dataset Size Description
wikipedia 20GB English Wikipedia (verified)
c4 750GB Cleaned Common Crawl
redpajama 1.2TB LLaMA training data
openwebtext 40GB Reddit links
bookcorpus 5GB 11k books
wikitext 500MB Wikipedia subset
dolly 15MB Instruction following
the_stack 3TB Source code (deduplicated)

Tokenization

Training Tokenizers (Detailed Guide)

Before you can process raw files into shards, you need trained tokenizers for each modality your model will consume. These tokenizers are frozen artifacts — once trained, they must not change for the entire model's lifecycle.

Why train your own tokenizers?

  • Text (BPE): Learns subword units tuned to your domain vocabulary (e.g. medical, legal, code).
  • Image (VQ): Learns a discrete codebook that maps image patches → token IDs.
  • Audio (VQ): Learns a codebook over mel-spectrogram patches.
  • Video (VQ): Same as image, but trained on video frames for temporal consistency.

Recommended Training Data Sizes

Modality Minimum Recommended Notes
Text 1 MB 1–10 GB More data = better subword coverage
Image 100 images 10k+ images .npy arrays (H, W, 3) or JPEG/PNG
Audio 100 files 10k+ files .npy waveforms or .wav/.flac
Video 50 videos 1k+ videos .mp4/.avi/.mov — frames extracted automatically

Training Commands

# All at once (recommended)
auralith-pipeline train-tokenizer all \
  --corpus  data/corpus/ \
  --images  data/images/ \
  --audio   data/audio/ \
  --videos  data/videos/ \
  --output  tokenizers/ \
  --vocab-size 32000 \
  --codebook-size 1024 \
  --audio-codebook-size 512 \
  --image-size 224 \
  --patch-size 16 \
  --sample-rate 16000 \
  --max-frames 32

# Or individually
auralith-pipeline train-tokenizer text  --corpus data/corpus.txt --output tokenizers/text --vocab-size 32000
auralith-pipeline train-tokenizer image --images data/images/    --output tokenizers/image --codebook-size 1024
auralith-pipeline train-tokenizer audio --audio  data/audio/     --output tokenizers/audio --codebook-size 512
auralith-pipeline train-tokenizer video --videos data/videos/    --output tokenizers/video --codebook-size 1024

Output Structure

tokenizers/
├── text/
│   ├── vocab.json       # Token → ID mapping
│   ├── merges.txt       # BPE merge rules (ordered)
│   └── config.json      # Tokenizer hyperparameters
├── image/
│   ├── config.json      # image_size, patch_size, codebook_size
│   └── vq_codebook.json # Learned VQ centroids
├── audio/
│   ├── config.json      # sample_rate, n_fft, codebook_size
│   └── vq_codebook.json
└── video/
    ├── config.json      # image_size, patch_size, max_frames
    └── vq_codebook.json

Cold Storage: Archive tokenizers/ to S3/GCS alongside your model checkpoints. The process command reads these frozen tokenizers at inference time.

Processing Raw Data into Shards

Once tokenizers are trained, use process to convert raw files into production shards:

auralith-pipeline process \
  --input  data/raw/ \
  --output shards/ \
  --tokenizers tokenizers/ \
  --max-seq-len 4096 \
  --shard-size 10000
Option Default Description
--input Folder with raw .txt/.jpg/.wav/.mp4 etc.
--output Where .safetensors shards are written
--tokenizers Root folder containing text/, image/, audio/, video/ subdirs
--max-seq-len 4096 Maximum token sequence length per sample
--shard-size 10000 Maximum samples per shard file

Each shard contains 4 tensors matching the SafeTensors Schema v2: input_ids, attention_mask, modality_mask, and targets.

Token ID Layout

Range Purpose
0–15 Special tokens (see below)
16–271 Byte tokens (<byte_00><byte_ff>) — lossless UTF-8 fallback
272+ BPE merge tokens (learned vocabulary)
100,000+ Image VQ codes (offset to avoid collisions)
200,000+ Audio VQ codes
300,000+ Video VQ codes

Distributed Processing

For large datasets, distribute tokenization and sharding across multiple CPU cores or machines. Two modes are supported:

Embedded Mode (single machine, no Redis)

Spins up a coordinator and N workers in-process using an in-memory state store — perfect for a beefy server with many cores:

auralith-pipeline submit-job \
  -c configs/distributed.yaml \
  -i data/raw/ \
  -o shards/ \
  -t tokenizers/ \
  --embedded \
  -w 8                   # 8 workers

External Mode (multi-machine, Redis)

Run the coordinator, workers, and job submission in separate terminals (or on separate machines). Requires a Redis instance for shared state.

# Terminal 1 — start the coordinator
auralith-pipeline coordinator \
  -c configs/distributed.yaml \
  --host 0.0.0.0 --port 8080

# Terminal 2..N — start workers (can be on different machines)
auralith-pipeline worker \
  -c configs/distributed.yaml \
  --coordinator 10.0.0.1:8080 \
  --worker-id worker-1

# Terminal N+1 — submit the job
auralith-pipeline submit-job \
  -c configs/distributed.yaml \
  -i data/raw/ -o shards/ -t tokenizers/ \
  --external

How it works

  1. DistributedPipeline scans the input directory for raw files.
  2. Files are split into tasks (--files-per-task, default 500).
  3. Tasks are submitted to the Coordinator, which assigns them to workers via a pluggable strategy (round-robin, least-busy, or dynamic).
  4. Each Worker tokenizes its assigned files and writes SafeTensors shards.
  5. The coordinator monitors heartbeats; if a worker dies, its tasks are automatically requeued (up to max_retries).
  6. The pipeline polls until all tasks complete (or time out).

Cloud Redis

For production multi-machine deployments, point the coordinator at a managed Redis instance (ElastiCache, Memorystore, etc.):

# configs/distributed.yaml
coordinator:
  state_store_type: redis
  state_store_host: my-redis.xxxx.cache.amazonaws.com
  state_store_port: 6379
  state_store_password: "<your-auth-token>"

Performance

Operation Speed Notes
Text preprocessing 10k samples/sec Single core
MinHash deduplication 5k samples/sec With LSH index
FAISS dedup 3k samples/sec IVFFlat on CPU
Perplexity filter 500 samples/sec GPT-2, GPU
BPE encoding <1 ms/sample LRU-cached (100k entries)
SafeTensors writing 50 MB/s Zstd compressed
Image tokenization 50 ms/image 224×224, 196 patches
Video tokenization 200 ms/video 32 frames, uniform
Ray distributed Linear scaling Up to 64 workers

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run E2E schema validation
pytest tests/test_e2e_schema.py -v

# Code quality
black src/ tests/ scripts/
ruff check src/ tests/ scripts/
mypy src/

Project Structure

auralith_pipeline/
├── cli.py                        # CLI commands (collect, process, train-tokenizer, etc.)
├── pipeline.py                   # Main pipeline (tracking, compliance, security)
├── config/                       # Configuration management
│   └── pipeline_config.py        # All config dataclasses
├── sources/
│   ├── data_sources.py           # HuggingFace, local, JSONL sources
│   └── video.py                  # Video frame sampler
├── preprocessing/
│   ├── preprocessor.py           # Text normalization, MinHash, PII
│   ├── quality.py                # Perplexity + LLM judge
│   ├── deduplication.py          # FAISS embedding dedup
│   ├── synthetic.py              # Local data augmentation
│   └── compliance.py             # License detection + audit
├── tokenization/
│   ├── bpe_tokenizer.py          # Custom BPE (16 special tokens)
│   ├── multimodal_tokenizer.py   # Text+Image+Audio+Video fusion
│   ├── video_tokenizer.py        # Video VQ tokenizer
│   └── tokenizer.py              # TokenizedSample + pipeline wrapper
├── sharding/
│   └── shard_writer.py           # SafeTensors writer (4-tensor schema v2)
├── security/
│   ├── pii_scrubber.py           # Multi-jurisdiction PII detection
│   ├── data_sanitizer.py         # Credential / secret sanitization
│   ├── privacy_config.py         # Privacy policies + PII categories
│   └── audit.py                  # Privacy audit logger
├── storage/
│   └── backends.py               # HF Hub, S3, GCS, Azure
├── distributed/
│   ├── coordinator.py            # Job manager, task scheduling, failure recovery
│   ├── worker.py                 # Worker node (tokenize + shard writing)
│   ├── pipeline.py               # DistributedPipeline (embedded + external modes)
│   ├── state.py                  # StateStore ABC (Redis + InMemory)
│   ├── strategies.py             # Round-robin, least-busy, dynamic assignment
│   ├── client.py                 # Monitoring / control client
│   └── config.py                 # Distributed config dataclasses
├── spark/                        # Apache Spark transforms
└── utils/
    ├── helpers.py                # Formatting utilities
    └── tracking.py               # MLflow/W&B + lineage

docker/kubernetes/
├── argo-workflow.yaml            # Argo DAG
└── helm/                         # Helm chart
    ├── Chart.yaml
    ├── values.yaml
    └── templates/

tests/
├── test_cli.py                   # CLI command tests (process, train-tokenizer, etc.)
├── test_distributed.py           # Distributed module tests (50 tests)
├── test_pipeline.py              # Core pipeline tests
├── test_tokenization.py          # Tokenizer tests
├── test_e2e_schema.py            # E2E validation
└── test_security.py              # Security & PII tests

Environment Variables

# HuggingFace Hub (required for upload)
export HF_TOKEN=hf_xxxxxxxxxxxxx

# AWS S3 (optional)
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=xxxxx

# MLflow (optional)
export MLFLOW_TRACKING_URI=http://mlflow.internal:5000

# W&B (optional)
export WANDB_API_KEY=xxxxx

Releasing

Every merge to main automatically tests, builds, publishes to PyPI, and creates a GitHub Release — zero manual steps.

How it works

  1. Test — Python 3.10 / 3.11 / 3.12
  2. Build — sdist + universal py3-none-any wheel
  3. Tag — auto-increments patch (v0.1.2v0.1.3)
  4. Publish — pushes to PyPI via OIDC trusted publisher
  5. Release — creates GitHub Release with changelog + wheel assets

Version is derived from git tags at build time via hatch-vcs — no hardcoded version strings anywhere.

Major / Minor bumps

Patch versions are automatic. For minor or major bumps, push a tag manually:

./scripts/bump-version.sh 0.2.0    # creates + pushes v0.2.0 tag
./scripts/bump-version.sh 1.0.0    # creates + pushes v1.0.0 tag

The pushed tag triggers the same release pipeline.

License

Apache License 2.0 — See LICENSE

Contributing

See CONTRIBUTING.md


Built by AuralithAI for RT-DLM

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auralith_data_pipeline-0.1.4.tar.gz (124.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

auralith_data_pipeline-0.1.4-py3-none-any.whl (133.4 kB view details)

Uploaded Python 3

File details

Details for the file auralith_data_pipeline-0.1.4.tar.gz.

File metadata

  • Download URL: auralith_data_pipeline-0.1.4.tar.gz
  • Upload date:
  • Size: 124.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for auralith_data_pipeline-0.1.4.tar.gz
Algorithm Hash digest
SHA256 78b8f6d8b5c9b9981a5685cd00c48fa72c5416d0eba874c6593e9fe481874a56
MD5 184096afa38656cba5a93ba7ec7ea683
BLAKE2b-256 16e76aaa7274d0f6f437f7a2be62493c1d1b2b35701908d523930fea65d33b3e

See more details on using hashes here.

Provenance

The following attestation bundles were made for auralith_data_pipeline-0.1.4.tar.gz:

Publisher: release.yml on AuralithAI/Auralith-Data-Pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file auralith_data_pipeline-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for auralith_data_pipeline-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 bc0e789e5eb2341df81d2296d45eaf58231f844b625d1bdd33f58743ca56ab1e
MD5 4bc9ac97ee6ee23a40c2bd872f1692d9
BLAKE2b-256 f6529c171d8ca6f594eca6e0a550784262d710dad3330a768425df078e536701

See more details on using hashes here.

Provenance

The following attestation bundles were made for auralith_data_pipeline-0.1.4-py3-none-any.whl:

Publisher: release.yml on AuralithAI/Auralith-Data-Pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page