catchfly

Catch the structured data.

These details have not been verified by PyPI

Project links

Project description

catchfly

Catch the structured data.

Catchfly automates schema discovery → structured extraction → normalization from unstructured text at scale. Interchangeable strategies at each stage let you go from raw documents to clean, normalized data with minimal effort.

Quick Start

pip install catchfly[openai,clustering]

from catchfly import Pipeline
from catchfly.demo import load_samples

docs = load_samples("product_reviews")

pipeline = Pipeline.quick(model="gpt-5.4-mini")
results = pipeline.run(
    documents=docs,
    domain_hint="Electronics product reviews",
    normalize_fields=["pros"],
)

results.to_dataframe()  # → pandas DataFrame

Strategies at a Glance

Stage	Strategy	Description
Discovery	`SinglePassDiscovery`	One LLM call → JSON Schema from sample docs
	`ThreeStageDiscovery`	3-stage progressive refinement (initial → refine → expand)
	`SchemaOptimizer`	PARSE-style iterative field enrichment (descriptions, examples, synonyms)
Extraction	`LLMDirectExtraction`	Per-document extraction with tool calling, retries, chunking
Normalization	`CascadeNormalization`	Chain strategies with confidence-based routing + self-learning
	`OntologyMapping`	Embed → NN search → LLM rerank against HPO/custom ontologies
	`LLMCanonicalization`	LLM groups synonyms, map-reduce for large sets (>200 values)
	`EmbeddingClustering`	Embed → HDBSCAN/agglomerative → canonical selection
Infrastructure	`SchemaRegistry`	Version, diff, and persist schemas across runs

Biomedical Normalization

Map clinical terms to ontology entries using local SapBERT embeddings (zero API cost) with optional LLM reranking:

from catchfly.normalization import CascadeNormalization, OntologyMapping
from catchfly.providers import SentenceTransformerEmbeddingClient

# SapBERT embeddings — 0.802 Acc@1 on BC5CDR, beats OpenAI embeddings
embed_client = SentenceTransformerEmbeddingClient()  # default: SapBERT

normalizer = OntologyMapping(
    ontology="hpo",
    embedding_client=embed_client,
    augment_queries=True,  # LLM generates alternative phrasings (+10-20pp recall)
)
result = await normalizer.anormalize(
    ["seizures", "high temperature", "low muscle tone"],
    context_field="phenotype",
)
# result.mapping: {"seizures": "Seizure", "high temperature": "Fever", ...}

# Self-learning cascade — learns from results, cheaper on re-runs
cascade = CascadeNormalization.default(
    dictionary={"ALT": "Alanine aminotransferase"},
    ontology="hpo",
    use_confidence=True,  # confidence-based routing between steps
)
result = await cascade.anormalize(values, context_field="phenotype")
cascade.learn(result)  # next run resolves known mappings instantly ($0)

Requires: pip install catchfly[embeddings,medical]

Local Models (Ollama)

pipeline = Pipeline.quick(
    model="qwen3.5",
    base_url="http://localhost:11434/v1",
)

Works with any OpenAI-compatible endpoint: Ollama, vLLM, LMStudio, llama.cpp.

Modular Usage

Each stage works independently — use one, two, or all three:

# Discovery
from catchfly.discovery.single_pass import SinglePassDiscovery
schema = SinglePassDiscovery(model="gpt-5.4-mini").discover(docs, domain_hint="...")

# Extraction (bring your own schema)
from catchfly.extraction.llm_direct import LLMDirectExtraction
records = LLMDirectExtraction(model="gpt-5.4-mini").extract(schema=MyModel, documents=docs)

# Normalization (bring your own data)
from catchfly.normalization.embedding_cluster import EmbeddingClustering
mapping = EmbeddingClustering().normalize(values=["NYC", "New York", "NY"], context_field="city")

Schema Optimizer (PARSE-style)

Iteratively enrich field descriptions for better extraction and normalization:

from catchfly.discovery.optimizer import SchemaOptimizer

optimizer = SchemaOptimizer(model="gpt-5.4-mini", num_iterations=3)
enriched = optimizer.optimize(schema=MyModel, test_documents=docs[:10])
# enriched.field_metadata has descriptions, examples, synonyms per field

kLLMmeans with Schema-Seeded Warmstart

The core novel contribution — bridge schema optimization and normalization:

from catchfly.normalization.kllmeans import KLLMeansClustering

normalizer = KLLMeansClustering(
    num_clusters=5,
    seed_from_schema=True,         # use enriched field descriptions as initial centroids
    summarize_every=3,             # LLM generates textual centroids every 3 iterations
)
result = normalizer.normalize(
    values=messy_values,
    context_field="medication",
    field_metadata=enriched.field_metadata["medication"],
)

Production Features

# Cost control
results = pipeline.run(documents=docs, max_cost_usd=20.0)

# Checkpoint/resume (for 1000+ documents)
results = pipeline.run(documents=large_corpus, checkpoint_dir="./state/")

# Error handling
extractor = LLMDirectExtraction(model="gpt-5.4-mini", on_error="collect")
results = extractor.extract(schema=MyModel, documents=docs)
print(results.errors)  # failed documents collected, not raised

# Export
results.to_dataframe()
results.to_csv("output.csv")
results.to_parquet("output.parquet")

Async Support

All strategies are async-first with sync wrappers (Jupyter-safe):

# Async
results = await pipeline.arun(documents=docs)

# Sync (auto-detects running event loop in notebooks)
results = pipeline.run(documents=docs)

Installation

pip install catchfly                  # Core only (~5 MB)
pip install catchfly[openai]          # + OpenAI SDK
pip install catchfly[embeddings]      # + sentence-transformers (SapBERT, local)
pip install catchfly[clustering]      # + scikit-learn, numpy, umap
pip install catchfly[export]          # + pandas, pyarrow
pip install catchfly[medical]         # + ontology loaders (HPO)
pip install catchfly[all]             # Everything

Or with uv:

uv add catchfly[openai,clustering,export]

Architecture

catchfly
├── discovery/
│   ├── SinglePassDiscovery        # 1-shot schema from samples
│   ├── ThreeStageDiscovery        # Progressive 3-stage refinement
│   └── SchemaOptimizer            # PARSE-style field enrichment
├── extraction/
│   └── LLMDirectExtraction        # Tool calling + retry + chunking
├── normalization/
│   ├── CascadeNormalization       # Chain strategies, confidence routing, learn()
│   ├── OntologyMapping            # Embed → NN → LLM rerank (RAG augmentation)
│   ├── LLMCanonicalization        # LLM synonym grouping (map-reduce)
│   ├── LearnedDictionaryCache     # Persist mappings for reuse across runs
│   └── EmbeddingClustering        # Embed → cluster → canonicalize
├── providers/
│   ├── OpenAICompatibleClient     # Any OpenAI-compatible LLM endpoint
│   ├── OpenAIEmbeddingClient      # API embeddings with caching
│   └── SentenceTransformerEmbeddingClient  # Local embeddings (SapBERT)
├── schema/
│   ├── SchemaRegistry             # Version + diff + persist
│   └── converters                 # JSON Schema ↔ Pydantic roundtrip
└── Pipeline                       # Orchestrator: quick(), run(), arun()

Requirements

Python 3.10+
An OpenAI-compatible LLM endpoint (OpenAI, Anthropic, Mistral, Ollama, vLLM)

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.4

Apr 7, 2026

1.1.3

Apr 6, 2026

This version

1.1.2

Apr 5, 2026

1.1.1

Mar 26, 2026

1.0.3

Mar 25, 2026

1.0.2

Mar 25, 2026

1.0.1

Mar 25, 2026

1.0.0

Mar 24, 2026

0.8.1

Mar 24, 2026

0.8.0

Mar 23, 2026

0.5.0

Mar 23, 2026

0.3.0

Mar 23, 2026

0.1.0

Mar 23, 2026

0.0.1

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catchfly-1.1.2.tar.gz (633.4 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

catchfly-1.1.2-py3-none-any.whl (100.7 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file catchfly-1.1.2.tar.gz.

File metadata

Download URL: catchfly-1.1.2.tar.gz
Upload date: Apr 5, 2026
Size: 633.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.19

File hashes

Hashes for catchfly-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`852cb66e10988d7b81b758633f4467762279c52497f948ff182d71efb3a5555c`
MD5	`e68162f903fab3ec23319c24a8ef15c9`
BLAKE2b-256	`9ed639001263b9323fbd46d775f67dc874fe3a3cb0b950e5172b5cfdf7ea4722`

See more details on using hashes here.

File details

Details for the file catchfly-1.1.2-py3-none-any.whl.

File metadata

Download URL: catchfly-1.1.2-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 100.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.19

File hashes

Hashes for catchfly-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b2a7824b7609bf862e3a548dec1c2d9661c0bb93279b4580fa3df50c1ff8a30`
MD5	`d20a068728dff036727f9259c4b2b389`
BLAKE2b-256	`fc20aa1db3d2eaf711d4cad28cf5554c1cf82e6fc2a78b4ba3e86b7aa7134bbc`

See more details on using hashes here.

catchfly 1.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quick Start

Strategies at a Glance

Biomedical Normalization

Local Models (Ollama)

Modular Usage

Schema Optimizer (PARSE-style)

kLLMmeans with Schema-Seeded Warmstart

Production Features

Async Support

Installation

Architecture

Requirements

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes