Skip to main content

Cureta: Unified Data Curation Framework by Mira — a pipeline of Taggers that enrich a standard Document object.

Project description

cureta

Unified Data Curation Framework — enrich text datasets at scale with a composable pipeline of feature-extraction Taggers.


Install

pip install -e ".[ray,llm,dev]"   # full install (Ray + LLM taggers + dev tools)
pip install -e ".[ray,dev]"        # heuristic taggers only (no GPU required)

Requires: Python ≥ 3.10


Quick example

from cureta import quick_tag

tags = quick_tag(
    "यह एक नमूना दस्तावेज़ है।",
    pipeline_config="pipelines/example/pipeline_config.yaml",
)
# {'cureta_id': {'cureta_id': 'a3f8d2c...'}, 'num_words': {'num_words': 5}}

Or from the command line:

# Create a small sample dataset
python -c "
import json, pathlib
rows = [{'text': f'This is sample document number {i}.'} for i in range(200)]
pathlib.Path('sample.jsonl').write_text('\n'.join(json.dumps(r) for r in rows))
print('Created sample.jsonl with 200 rows')
"

# Run the example pipeline
cureta run --pipeline example --dataset ./sample.jsonl --limit 100

For HuggingFace datasets see docs/how_to/load_huggingface_data.md


Documentation

Tutorial: Your first pipeline Get a working result in 5 minutes
Tutorial: Writing a custom tagger Build and run your own tagger
Tutorial: Deploying on cloud Run on GPU cluster with SkyPilot
Reference: CLI cureta command reference
Reference: Python API run_pipeline, quick_tag, Pipeline, ...
Reference: Taggers All 45 built-in taggers with output schemas
Concepts: Execution model SIMD vs SDMI, two-phase model, Ray Data
Writing taggers Tier-1 and Tier-2 tagger authoring guide

Examples

01_quickstart Local data, four heuristic taggers, Parquet output
02_custom_tagger Write and run a custom tagger
03_huggingface Load from HuggingFace, use text_column
04_programmatic_api All four Python API surfaces
05_streaming stream_tagged_batches → in-memory / Kafka

Key concepts

  • Document — canonical data envelope with raw_content, document_id, metadata, and tags
  • Tagger — atomic feature-extraction unit; subclass CPUTagger or GPUTagger, implement process_document (Tier 1) or run (Tier 2)
  • Pipeline — orchestrator that loads config, resolves dependencies, and dispatches taggers

45 built-in taggers ship with the library across three categories:

Heuristic taggers (CPU, no model weights) — word count, line/paragraph/word-length statistics, Gopher/C4/FineWeb quality filters, compression ratio, vocabulary diversity, MinHash/SimHash/exact dedup signals, PII regex, HTML density, Markdown structure, content type, URL parsing, domain blocklist, date extraction, and readability metrics.

Lightweight ML taggers — GlotLID language ID (1665 languages), paragraph-level language ID, FastText quality scoring, programming language detection, and Microsoft Presidio PII detection.

GPU taggers — FineWeb educational quality (0–5 score), domain classification (26 domains), multi-label toxicity, dense sentence embeddings, LLM-as-judge scoring, and perplexity.

See Reference: Taggers for the full list with output schemas.


Contributing

See CONTRIBUTING.md and docs/contributing/setup.md.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cureta-0.1.0.tar.gz (274.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cureta-0.1.0-py3-none-any.whl (124.3 kB view details)

Uploaded Python 3

File details

Details for the file cureta-0.1.0.tar.gz.

File metadata

  • Download URL: cureta-0.1.0.tar.gz
  • Upload date:
  • Size: 274.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cureta-0.1.0.tar.gz
Algorithm Hash digest
SHA256 16b5b9e4c71691f875a233b44c3fdf7202c6b246a185a3c1c540d9d3d2413e51
MD5 e8d29a4723529d9ea6b4abb7676a6508
BLAKE2b-256 bd961c80dc4b571d646e291e8495501c7709c808ef5038dae109d9a431b78de0

See more details on using hashes here.

File details

Details for the file cureta-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cureta-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 124.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cureta-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3ededd018eba0f732a3ccc156cac38b3ce3fb6460efb959c907770e29d20360a
MD5 4489731579ece2d8aaea8eecbf14ae0b
BLAKE2b-256 06b381d24d70f0c7292ba6801905dfc59fb261bc4748ae08ee1d88b791bb2732

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page