Catch the structured data.
Project description
Catch the structured data.
Documentation • GitHub • PyPI
Catchfly automates schema discovery → structured extraction → normalization from unstructured text at scale. Interchangeable strategies at each stage let you go from raw documents to clean, normalized data with minimal effort.
Quick Start
pip install catchfly[openai,clustering]
from catchfly import Pipeline
from catchfly.demo import load_samples
docs = load_samples("product_reviews")
pipeline = Pipeline.quick(model="gpt-5.4-mini")
results = pipeline.run(
documents=docs,
domain_hint="Electronics product reviews",
normalize_fields=["pros"],
)
results.to_dataframe() # → pandas DataFrame
Strategies at a Glance
| Stage | Strategy | Description |
|---|---|---|
| Discovery | SinglePassDiscovery |
One LLM call → JSON Schema from sample docs |
ThreeStageDiscovery |
3-stage progressive refinement (initial → refine → expand) | |
SchemaOptimizer |
PARSE-style iterative field enrichment (descriptions, examples, synonyms) | |
| Extraction | LLMDirectExtraction |
Per-document extraction with tool calling, retries, chunking |
| Normalization | EmbeddingClustering |
Embed → HDBSCAN/agglomerative → canonical selection |
LLMCanonicalization |
LLM groups synonyms, map-reduce for large sets (>200 values) | |
KLLMeansClustering |
k-means + LLM textual centroids, schema-seeded warmstart | |
| Infrastructure | SchemaRegistry |
Version, diff, and persist schemas across runs |
Local Models (Ollama)
pipeline = Pipeline.quick(
model="qwen3.5",
base_url="http://localhost:11434/v1",
)
Works with any OpenAI-compatible endpoint: Ollama, vLLM, LMStudio, llama.cpp.
Modular Usage
Each stage works independently — use one, two, or all three:
# Discovery
from catchfly.discovery.single_pass import SinglePassDiscovery
schema = SinglePassDiscovery(model="gpt-5.4-mini").discover(docs, domain_hint="...")
# Extraction (bring your own schema)
from catchfly.extraction.llm_direct import LLMDirectExtraction
records = LLMDirectExtraction(model="gpt-5.4-mini").extract(schema=MyModel, documents=docs)
# Normalization (bring your own data)
from catchfly.normalization.embedding_cluster import EmbeddingClustering
mapping = EmbeddingClustering().normalize(values=["NYC", "New York", "NY"], context_field="city")
Schema Optimizer (PARSE-style)
Iteratively enrich field descriptions for better extraction and normalization:
from catchfly.discovery.optimizer import SchemaOptimizer
optimizer = SchemaOptimizer(model="gpt-5.4-mini", num_iterations=3)
enriched = optimizer.optimize(schema=MyModel, test_documents=docs[:10])
# enriched.field_metadata has descriptions, examples, synonyms per field
kLLMmeans with Schema-Seeded Warmstart
The core novel contribution — bridge schema optimization and normalization:
from catchfly.normalization.kllmeans import KLLMeansClustering
normalizer = KLLMeansClustering(
num_clusters=5,
seed_from_schema=True, # use enriched field descriptions as initial centroids
summarize_every=3, # LLM generates textual centroids every 3 iterations
)
result = normalizer.normalize(
values=messy_values,
context_field="medication",
field_metadata=enriched.field_metadata["medication"],
)
Production Features
# Cost control
results = pipeline.run(documents=docs, max_cost_usd=20.0)
# Checkpoint/resume (for 1000+ documents)
results = pipeline.run(documents=large_corpus, checkpoint_dir="./state/")
# Error handling
extractor = LLMDirectExtraction(model="gpt-5.4-mini", on_error="collect")
results = extractor.extract(schema=MyModel, documents=docs)
print(results.errors) # failed documents collected, not raised
# Export
results.to_dataframe()
results.to_csv("output.csv")
results.to_parquet("output.parquet")
Async Support
All strategies are async-first with sync wrappers (Jupyter-safe):
# Async
results = await pipeline.arun(documents=docs)
# Sync (auto-detects running event loop in notebooks)
results = pipeline.run(documents=docs)
Installation
pip install catchfly # Core only (~5 MB)
pip install catchfly[openai] # + OpenAI SDK
pip install catchfly[clustering] # + scikit-learn, numpy, umap
pip install catchfly[export] # + pandas, pyarrow
pip install catchfly[medical] # + ontology loaders (HPO)
pip install catchfly[all] # Everything
Or with uv:
uv add catchfly[openai,clustering,export]
Architecture
catchfly
├── discovery/
│ ├── SinglePassDiscovery # 1-shot schema from samples
│ ├── ThreeStageDiscovery # Progressive 3-stage refinement
│ └── SchemaOptimizer # PARSE-style field enrichment
├── extraction/
│ └── LLMDirectExtraction # Tool calling + retry + chunking
├── normalization/
│ ├── EmbeddingClustering # Embed → cluster → canonicalize
│ ├── LLMCanonicalization # LLM synonym grouping (map-reduce)
│ └── KLLMeansClustering # k-means + LLM centroids + schema seed
├── providers/
│ ├── OpenAICompatibleClient # Any OpenAI-compatible LLM endpoint
│ └── OpenAIEmbeddingClient # Embeddings with caching
├── schema/
│ ├── SchemaRegistry # Version + diff + persist
│ └── converters # JSON Schema ↔ Pydantic roundtrip
└── Pipeline # Orchestrator: quick(), run(), arun()
Requirements
- Python 3.10+
- An OpenAI-compatible LLM endpoint (OpenAI, Anthropic, Mistral, Ollama, vLLM)
License
Apache 2.0 — see LICENSE.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file catchfly-1.0.0.tar.gz.
File metadata
- Download URL: catchfly-1.0.0.tar.gz
- Upload date:
- Size: 525.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df2a2a7faaa08a73580174bf49a0ba2cf7fc8478d3085618178bd63206afdb26
|
|
| MD5 |
db687e6ecb375da80c5f4daf55579cb2
|
|
| BLAKE2b-256 |
36d66e7180b3ffd04917433ee06ee6a2b2163b6cf6a873e1c4818b18c2d4dbab
|
File details
Details for the file catchfly-1.0.0-py3-none-any.whl.
File metadata
- Download URL: catchfly-1.0.0-py3-none-any.whl
- Upload date:
- Size: 74.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
feec344fd7c78ba29fc6bbcb91bb12eeaa46b7e3bb77273e88b710a86829fd46
|
|
| MD5 |
98079c83f5d040c67c34311a02c10d13
|
|
| BLAKE2b-256 |
ce8034e0863eb08fead5c9e34572cffd37ec557b1e87970459f5f147ebf6a52b
|