Production-grade data collection and processing pipeline for training LLMs and multimodal AI

These details have not been verified by PyPI

Project description

Auralith Data Pipeline

Production-grade multimodal data processing pipeline for training RT-DLM and large-scale AI systems.

Overview

Auralith Data Pipeline ingests raw text, images, audio, video, and code, applies production-quality curation (perplexity filtering, LLM-as-Judge scoring, FAISS deduplication, PII scrubbing, license detection), tokenizes everything through BPE + Vector Quantization, and outputs SafeTensors shards ready for distributed model training.

Pipeline Stages

Stage	What happens
Ingestion	Text (HuggingFace, Common Crawl, local), images (`.npy`/JPEG/PNG), audio (`.wav`/`.npy`), video (`.mp4`), code (TheStack)
Quality Curation	GPT-2 perplexity filter, LLM-as-Judge scoring, FAISS embedding dedup, license detection
Tokenization	BPE for text, patch + VQ for images, mel + VQ for audio, frame + VQ for video
Sharding	SafeTensors v2 schema with `input_ids`, `attention_mask`, `modality_mask`, `targets`
Observability	MLflow / W&B tracking, per-sample lineage, auto data cards
Orchestration	Argo Workflows, Helm/K8s, Ray, distributed coordinator + workers

Installation

# Core text pipeline
pip install auralith-data-pipeline

# With all extras (multimodal, cloud, distributed, dev tools)
pip install "auralith-data-pipeline[all]"

# Pick only what you need
pip install "auralith-data-pipeline[quality]"        # + perplexity filter + FAISS dedup
pip install "auralith-data-pipeline[distributed]"    # + Ray
pip install "auralith-data-pipeline[cloud,pdf]"      # + S3/GCS/Azure + PDF extraction
pip install "auralith-data-pipeline[multimodal]"     # + video/image/audio (PyTorch)
pip install "auralith-data-pipeline[tracking]"       # + MLflow + W&B

Quick Start

CLI

# List available datasets
auralith-pipeline list-datasets

# Process Wikipedia dataset
auralith-pipeline collect \
  --dataset wikipedia \
  --output ./data/shards \
  --max-samples 100000 \
  --preset production

End-to-End Workflow

# 1. Train tokenizers (BPE + VQ codebooks)
auralith-pipeline train-tokenizer all \
  --corpus  data/corpus/ \
  --images  data/images/ \
  --audio   data/audio/ \
  --videos  data/videos/ \
  --output  tokenizers/ \
  --vocab-size 32000 \
  --codebook-size 1024

# 2. Process raw data into SafeTensors shards
auralith-pipeline process \
  --input  data/raw/ \
  --output shards/ \
  --tokenizers tokenizers/ \
  --max-seq-len 4096 \
  --shard-size 10000

# 3. Upload to cloud storage or HuggingFace Hub
auralith-pipeline upload --source shards/ --dest s3://my-bucket/training-data/

Python API

from auralith_pipeline import Pipeline, PipelineConfig
from auralith_pipeline.sources import create_source

config = PipelineConfig.from_preset("production")
pipeline = Pipeline(config)
source = create_source("wikipedia", streaming=True, max_samples=1_000_000)
pipeline.add_source(source)

stats = pipeline.run()
print(stats.summary())

Key Features

Data Processing

Multi-source ingestion (HuggingFace, Common Crawl, local files, video)
Weighted round-robin interleaving across multiple sources
MinHash + FAISS embedding deduplication
Quality filtering (length, language, perplexity, LLM-as-Judge)
PII removal (multi-jurisdiction, 15+ countries)
License compliance scanning for code data
Document extraction (PDF, DOCX, HTML, Markdown)
SafeTensors sharding with Zstd compression and SHA-256 checksums
Streaming checkpointing with seeded reproducibility

Tokenization

Custom BPE tokenizer with 16 special tokens and byte-level fallback
Vector quantization for images, audio, and video
Multimodal token fusion with encode_with_mask()
Configurable vocab size (32k-128k)

Distributed Processing

Embedded mode — in-process coordinator + workers (no Redis needed)
External mode — multi-machine with Redis state store
Worker failure detection + automatic task requeue
Linear scaling up to 64+ workers

Observability & Compliance

MLflow / Weights & Biases experiment tracking
Per-sample lineage (source to shard provenance)
Auto-generated data cards (HuggingFace-compatible)
Full audit logging (JSONL) for accept/reject decisions
Credential and secret sanitization

SafeTensors Schema (v2)

Every output shard is directly compatible with RT-DLM training.

Tensor	Dtype	Shape	Description
`input_ids`	int32	(batch, seq_len)	All tokens (text + image + audio + video + code)
`attention_mask`	uint8	(batch, seq_len)	1 = real token, 0 = padding
`modality_mask`	uint8	(batch, seq_len)	0=text, 1=image, 2=audio, 3=video, 4=code
`targets`	int32	(batch, seq_len)	Right-shifted `input_ids` for causal LM

Special Tokens

ID	Token	Purpose
0	`<PAD>`	Padding
1	`<UNK>`	Unknown
2	`<BOS>`	Beginning of sequence
3	`<EOS>`	End of sequence
4-5	`<IMG>` / `<IMG_END>`	Image region
6-7	`<AUDIO>` / `<AUDIO_END>`	Audio region
8-9	`<VIDEO>` / `<VIDEO_END>`	Video region
10	`<FUSE>`	Cross-modal fusion
11	`<SEP>`	Separator
12	`<MASK>`	Masked LM
13-14	`<CODE>` / `<CODE_END>`	Code block
15	`<THINK>`	Chain-of-thought

Available Datasets

Dataset	Size	Description
wikipedia	20 GB	English Wikipedia
c4	750 GB	Cleaned Common Crawl
redpajama	1.2 TB	LLaMA training data
openwebtext	40 GB	Reddit links
bookcorpus	5 GB	11k books
the_stack	3 TB	Source code (deduplicated)

Performance

Operation	Speed
Text preprocessing	10k samples/sec
MinHash deduplication	5k samples/sec
FAISS dedup	3k samples/sec
BPE encoding	< 1 ms/sample
SafeTensors writing	50 MB/s
Image tokenization	50 ms/image
Video tokenization	200 ms/video

Configuration

# configs/production.yaml
pipeline:
  name: production-pipeline
  output_dir: ./data/shards
  deduplicate: true
  quality_filter: true
  remove_pii: true
  seed: 42
  checkpoint_every: 10000

advanced_quality:
  enabled: true
  perplexity_filter: true
  max_perplexity: 1500.0

Documentation

For the full documentation, architecture diagrams, distributed processing guide, and contributor guide, visit the GitHub repository.

License

Apache License 2.0 — see LICENSE.

Built by AuralithAI for RT-DLM.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.11

Mar 16, 2026

0.1.10

Mar 16, 2026

This version

0.1.9

Mar 15, 2026

0.1.8

Mar 2, 2026

0.1.7

Mar 2, 2026

0.1.6

Mar 2, 2026

0.1.5

Mar 2, 2026

0.1.4

Mar 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auralith_data_pipeline-0.1.9.tar.gz (123.5 kB view details)

Uploaded Mar 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

auralith_data_pipeline-0.1.9-py3-none-any.whl (138.8 kB view details)

Uploaded Mar 15, 2026 Python 3

File details

Details for the file auralith_data_pipeline-0.1.9.tar.gz.

File metadata

Download URL: auralith_data_pipeline-0.1.9.tar.gz
Upload date: Mar 15, 2026
Size: 123.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for auralith_data_pipeline-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`01bc22684b45ea6114347c299c79fc5385c6a086258b63dbf1a8289aef336005`
MD5	`6e84a1cc8dea851bec2b800e557b4646`
BLAKE2b-256	`d09c329f21366df7c52af52c29864cca38c3e274d250fea8f7de79b576fd548e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for auralith_data_pipeline-0.1.9.tar.gz:

Publisher: release.yml on AuralithAI/Auralith-Data-Pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: auralith_data_pipeline-0.1.9.tar.gz
- Subject digest: 01bc22684b45ea6114347c299c79fc5385c6a086258b63dbf1a8289aef336005
- Sigstore transparency entry: 1107681189
- Sigstore integration time: Mar 15, 2026
Source repository:
- Permalink: AuralithAI/Auralith-Data-Pipeline@1e5b6702ac380e572d1d9a1cb84ab8e2cd8dc1a1
- Branch / Tag: refs/heads/main
- Owner: https://github.com/AuralithAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1e5b6702ac380e572d1d9a1cb84ab8e2cd8dc1a1
- Trigger Event: push

File details

Details for the file auralith_data_pipeline-0.1.9-py3-none-any.whl.

File metadata

Download URL: auralith_data_pipeline-0.1.9-py3-none-any.whl
Upload date: Mar 15, 2026
Size: 138.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for auralith_data_pipeline-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`91f4eb1289acd2ba90d738b4707f6a8ca0cc8e32fcf231ca7a446db241492490`
MD5	`df5964050fc3d11dbe45d1f3061e5e9e`
BLAKE2b-256	`c34fc4a4d5ddd51b130331400698fae6f2648b99a19e7157f1dadd1c4d9ad90b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for auralith_data_pipeline-0.1.9-py3-none-any.whl:

Publisher: release.yml on AuralithAI/Auralith-Data-Pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: auralith_data_pipeline-0.1.9-py3-none-any.whl
- Subject digest: 91f4eb1289acd2ba90d738b4707f6a8ca0cc8e32fcf231ca7a446db241492490
- Sigstore transparency entry: 1107681190
- Sigstore integration time: Mar 15, 2026
Source repository:
- Permalink: AuralithAI/Auralith-Data-Pipeline@1e5b6702ac380e572d1d9a1cb84ab8e2cd8dc1a1
- Branch / Tag: refs/heads/main
- Owner: https://github.com/AuralithAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1e5b6702ac380e572d1d9a1cb84ab8e2cd8dc1a1
- Trigger Event: push

auralith-data-pipeline 0.1.9

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Auralith Data Pipeline

Overview

Pipeline Stages

Installation

Quick Start

CLI

End-to-End Workflow

Python API

Key Features

Data Processing

Tokenization

Distributed Processing

Observability & Compliance

SafeTensors Schema (v2)

Special Tokens

Available Datasets

Performance

Configuration

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance