Cureta: Unified Data Curation Framework by Mira — a pipeline of Taggers that enrich a standard Document object.
Project description
cureta
Unified Data Curation Framework — enrich text datasets at scale with a composable pipeline of feature-extraction Taggers.
Install
pip install -e ".[ray,llm,dev]" # full install (Ray + LLM taggers + dev tools)
pip install -e ".[ray,dev]" # heuristic taggers only (no GPU required)
Requires: Python ≥ 3.10
Quick example
from cureta import quick_tag
tags = quick_tag(
"यह एक नमूना दस्तावेज़ है।",
pipeline_config="pipelines/example/pipeline_config.yaml",
)
# {'cureta_id': {'cureta_id': 'a3f8d2c...'}, 'num_words': {'num_words': 5}}
Or from the command line:
# Create a small sample dataset
python -c "
import json, pathlib
rows = [{'text': f'This is sample document number {i}.'} for i in range(200)]
pathlib.Path('sample.jsonl').write_text('\n'.join(json.dumps(r) for r in rows))
print('Created sample.jsonl with 200 rows')
"
# Run the example pipeline
cureta run --pipeline example --dataset ./sample.jsonl --limit 100
For HuggingFace datasets see docs/how_to/load_huggingface_data.md
Documentation
| Tutorial: Your first pipeline | Get a working result in 5 minutes |
| Tutorial: Writing a custom tagger | Build and run your own tagger |
| Tutorial: Deploying on cloud | Run on GPU cluster with SkyPilot |
| Reference: CLI | cureta command reference |
| Reference: Python API | run_pipeline, quick_tag, Pipeline, ... |
| Reference: Taggers | All 45 built-in taggers with output schemas |
| Concepts: Execution model | SIMD vs SDMI, two-phase model, Ray Data |
| Writing taggers | Tier-1 and Tier-2 tagger authoring guide |
Examples
| 01_quickstart | Local data, four heuristic taggers, Parquet output |
| 02_custom_tagger | Write and run a custom tagger |
| 03_huggingface | Load from HuggingFace, use text_column |
| 04_programmatic_api | All four Python API surfaces |
| 05_streaming | stream_tagged_batches → in-memory / Kafka |
Key concepts
- Document — canonical data envelope with
raw_content,document_id,metadata, andtags - Tagger — atomic feature-extraction unit; subclass
CPUTaggerorGPUTagger, implementprocess_document(Tier 1) orrun(Tier 2) - Pipeline — orchestrator that loads config, resolves dependencies, and dispatches taggers
45 built-in taggers ship with the library across three categories:
Heuristic taggers (CPU, no model weights) — word count, line/paragraph/word-length statistics, Gopher/C4/FineWeb quality filters, compression ratio, vocabulary diversity, MinHash/SimHash/exact dedup signals, PII regex, HTML density, Markdown structure, content type, URL parsing, domain blocklist, date extraction, and readability metrics.
Lightweight ML taggers — GlotLID language ID (1665 languages), paragraph-level language ID, FastText quality scoring, programming language detection, and Microsoft Presidio PII detection.
GPU taggers — FineWeb educational quality (0–5 score), domain classification (26 domains), multi-label toxicity, dense sentence embeddings, LLM-as-judge scoring, and perplexity.
See Reference: Taggers for the full list with output schemas.
Contributing
See CONTRIBUTING.md and docs/contributing/setup.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cureta-0.1.0.tar.gz.
File metadata
- Download URL: cureta-0.1.0.tar.gz
- Upload date:
- Size: 274.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16b5b9e4c71691f875a233b44c3fdf7202c6b246a185a3c1c540d9d3d2413e51
|
|
| MD5 |
e8d29a4723529d9ea6b4abb7676a6508
|
|
| BLAKE2b-256 |
bd961c80dc4b571d646e291e8495501c7709c808ef5038dae109d9a431b78de0
|
File details
Details for the file cureta-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cureta-0.1.0-py3-none-any.whl
- Upload date:
- Size: 124.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ededd018eba0f732a3ccc156cac38b3ce3fb6460efb959c907770e29d20360a
|
|
| MD5 |
4489731579ece2d8aaea8eecbf14ae0b
|
|
| BLAKE2b-256 |
06b381d24d70f0c7292ba6801905dfc59fb261bc4748ae08ee1d88b791bb2732
|