Cureta: Unified Data Curation Framework by Mira — a pipeline of Taggers that enrich a standard Document object.

These details have not been verified by PyPI

Project description

cureta

Unified Data Curation Framework — enrich text datasets at scale with a composable pipeline of feature-extraction Taggers.

Install

pip install -e ".[ray,llm,dev]"   # full install (Ray + LLM taggers + dev tools)
pip install -e ".[ray,dev]"        # heuristic taggers only (no GPU required)

Requires: Python ≥ 3.10

Quick example

from cureta import quick_tag

tags = quick_tag(
    "यह एक नमूना दस्तावेज़ है।",
    pipeline_config="pipelines/example/pipeline_config.yaml",
)
# {'cureta_id': {'cureta_id': 'a3f8d2c...'}, 'num_words': {'num_words': 5}}

Or from the command line:

# Create a small sample dataset
python -c "
import json, pathlib
rows = [{'text': f'This is sample document number {i}.'} for i in range(200)]
pathlib.Path('sample.jsonl').write_text('\n'.join(json.dumps(r) for r in rows))
print('Created sample.jsonl with 200 rows')
"

# Run the example pipeline
cureta run --pipeline example --dataset ./sample.jsonl --limit 100

For HuggingFace datasets see docs/how_to/load_huggingface_data.md

Documentation


Tutorial: Your first pipeline	Get a working result in 5 minutes
Tutorial: Writing a custom tagger	Build and run your own tagger
Tutorial: Deploying on cloud	Run on GPU cluster with SkyPilot
Reference: CLI	`cureta` command reference
Reference: Python API	`run_pipeline`, `quick_tag`, `Pipeline`, ...
Reference: Taggers	All 45 built-in taggers with output schemas
Concepts: Execution model	SIMD vs SDMI, two-phase model, Ray Data
Writing taggers	Tier-1 and Tier-2 tagger authoring guide

Examples


01_quickstart	Local data, four heuristic taggers, Parquet output
02_custom_tagger	Write and run a custom tagger
03_huggingface	Load from HuggingFace, use text_column
04_programmatic_api	All four Python API surfaces
05_streaming	stream_tagged_batches → in-memory / Kafka

Key concepts

Document — canonical data envelope with raw_content, document_id, metadata, and tags
Tagger — atomic feature-extraction unit; subclass CPUTagger or GPUTagger, implement process_document (Tier 1) or run (Tier 2)
Pipeline — orchestrator that loads config, resolves dependencies, and dispatches taggers

45 built-in taggers ship with the library across three categories:

Heuristic taggers (CPU, no model weights) — word count, line/paragraph/word-length statistics, Gopher/C4/FineWeb quality filters, compression ratio, vocabulary diversity, MinHash/SimHash/exact dedup signals, PII regex, HTML density, Markdown structure, content type, URL parsing, domain blocklist, date extraction, and readability metrics.

Lightweight ML taggers — GlotLID language ID (1665 languages), paragraph-level language ID, FastText quality scoring, programming language detection, and Microsoft Presidio PII detection.

GPU taggers — FineWeb educational quality (0–5 score), domain classification (26 domains), multi-label toxicity, dense sentence embeddings, LLM-as-judge scoring, and perplexity.

See Reference: Taggers for the full list with output schemas.

Contributing

See CONTRIBUTING.md and docs/contributing/setup.md.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cureta-0.1.0.tar.gz (274.7 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cureta-0.1.0-py3-none-any.whl (124.3 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file cureta-0.1.0.tar.gz.

File metadata

Download URL: cureta-0.1.0.tar.gz
Upload date: Jun 10, 2026
Size: 274.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cureta-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`16b5b9e4c71691f875a233b44c3fdf7202c6b246a185a3c1c540d9d3d2413e51`
MD5	`e8d29a4723529d9ea6b4abb7676a6508`
BLAKE2b-256	`bd961c80dc4b571d646e291e8495501c7709c808ef5038dae109d9a431b78de0`

See more details on using hashes here.

File details

Details for the file cureta-0.1.0-py3-none-any.whl.

File metadata

Download URL: cureta-0.1.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 124.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cureta-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3ededd018eba0f732a3ccc156cac38b3ce3fb6460efb959c907770e29d20360a`
MD5	`4489731579ece2d8aaea8eecbf14ae0b`
BLAKE2b-256	`06b381d24d70f0c7292ba6801905dfc59fb261bc4748ae08ee1d88b791bb2732`

See more details on using hashes here.

cureta 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

cureta

Install

Quick example

For HuggingFace datasets see docs/how_to/load_huggingface_data.md

Documentation

Examples

Key concepts

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes