From chaos to categories. Schema discovery + term normalization for unstructured text data.
Project description
catchfly
From chaos to categories.
Schema discovery + term normalization for unstructured text data.
Named after Silene (catchfly) — plants that secrete a sticky substance to capture insects. Catchfly captures scattered terms and groups them into canonical categories.
Part of the Silene Systems ecosystem for rare disease research.
The problem
You extracted 3,000 mentions from 200 scientific papers. Or an LLM classified 3 million emails into 4,000 labels. Now you have:
- No schema — you don't know what categories should exist
- Duplicates everywhere — "miglustat" and "Zavesca" are the same drug
- Ambiguous boundaries — is "cognitive decline" the same as "cognitive impairment"?
- Related but distinct entities — "ALT" and "AST" are similar but clinically different
No existing tool does both schema discovery and term normalization as a composable Python library.
What catchfly does
Two operations, one pipeline:
Discover — "What categories exist in my data?"
from catchfly import Resolver
resolver = Resolver(embed_provider="gemini", llm_provider="openai/gpt-5.4")
schema = resolver.discover(
mentions=["miglustat", "splenomegaly", "ALT: 120", "ataxia", ...],
contexts={"miglustat": ["Patient received miglustat 200mg daily"], ...},
)
# schema.categories → ["Treatments", "Symptoms", "Lab Values", ...]
# schema.examples → {"Treatments": ["miglustat", "arimoclomol"], ...}
schema.rename("Lab Values", "Laboratory Findings")
schema.merge("Symptoms", "Clinical Signs")
Normalize — "Which terms belong where, and which are synonyms?"
result = resolver.normalize(
mentions=all_mentions,
schema=schema,
contexts=contexts,
)
# result.groups → [NormGroup(canonical="miglustat", members=["Zavesca", "NB-DNJ"])]
# result.ambiguous → [AmbiguousPair("cognitive decline", "cognitive impairment")]
Or all-in-one:
result = resolver.resolve(mentions, contexts=contexts)
Self-improving prompts
Give catchfly 20–50 labeled examples and it optimizes its own prompts for your domain. Inspired by GEPA, implemented from scratch with zero external dependencies.
optimized = resolver.optimize(
mentions=mentions,
contexts=contexts,
ground_truth={"miglustat": "miglustat", "Zavesca": "miglustat", "ALT": "ALT", "AST": "AST"},
iterations=20, # ~$1–2, ~15 min
)
result = optimized.resolve(new_mentions) # 20–30% better accuracy on your domain
How it works
Four components, composable as a pipeline:
| Component | What it does | Cost |
|---|---|---|
| EmbeddingSimilarity | Pre-filter: finds candidate pairs via cosine similarity | Embedding API only |
| LLMGrouper | Core engine: LLM analyzes clusters, proposes categories/synonyms with 4-way relation typing (synonym / hierarchy / related / distinct) | LLM API |
| LLMClassifier | Propagation: assigns remaining mentions to approved categories via few-shot classification | LLM API |
| UserSeeded | User guidance: seed mappings override and anchor discovery + normalization | Embedding only |
Tiered quality
| Tier | What runs | Cost / 1K mentions | Accuracy | Best for |
|---|---|---|---|---|
| Tier 1 (free) | Embedding only | $0 (local models) | 60–70% | Exploration |
| Tier 2 (standard) | Embedding + LLM | $0.15–0.50 | 82–90% | Production |
| Tier 3 (optimized) | Tier 2 + optimize() | $1–3 one-time | 90–95%+ | Systematic reviews |
Use cases
- Medical literature — normalize symptoms, drugs, genetic variants from systematic reviews
- Email categorization — collapse 4,000 LLM-generated labels into 150 clean groups
- E-commerce tags — build taxonomy from 50,000 user-generated product tags
- Any domain — if you have messy text labels, catchfly cleans them up
Provider support
# Embeddings
Resolver(embed_provider="gemini") # Google Gemini Embedding 2
Resolver(embed_provider="openai") # OpenAI text-embedding-3
Resolver(embed_provider="local") # sentence-transformers (offline, free)
Resolver(embed_provider=my_function) # any callable
# LLM
Resolver(llm_provider="openai/gpt-5.4")
Resolver(llm_provider="gemini/flash")
Resolver(llm_provider=my_function) # any callable
Evaluation
ground_truth = {"miglustat": "miglustat", "Zavesca": "miglustat", "ALT": "ALT", "AST": "AST"}
metrics = resolver.evaluate(result, ground_truth)
# → precision, recall, false_merges, missed_merges, category_accuracy
Installation
pip install catchfly
v0.0.1 is a name reservation. Active development underway. First functional release (v0.1.0) expected Q2 2026.
Part of the Silene ecosystem
| Project | What it is |
|---|---|
| Silene Systems | Computational phenotyping platform for rare diseases |
| Campion | Agentic batch literature extraction platform (SaaS) |
| catchfly | Schema discovery + term normalization library (open source, this repo) |
Silene tomentosa (Gibraltar campion) is one of the rarest plants in the world — thought extinct in 1992, rediscovered in 1994 on the Rock of Gibraltar. Like rare diseases, it hides in plain sight, waiting for someone to look carefully enough.
License
Apache-2.0
Citation
If you use catchfly in academic work:
@software{catchfly2026,
author = {Michalski, Adrian},
title = {catchfly: Schema discovery and category normalization for unstructured text data},
year = {2026},
url = {https://github.com/silene-systems/catchfly},
license = {Apache-2.0}
}
``
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file catchfly-0.0.1.tar.gz.
File metadata
- Download URL: catchfly-0.0.1.tar.gz
- Upload date:
- Size: 6.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b9ac53308a5f236f19e961cd30b1992b48dd61d14307ce2b393d73549d89944
|
|
| MD5 |
4df53d118fff7fc6ec98a94e3b2540b3
|
|
| BLAKE2b-256 |
f7fa89b80b6d264e00008b36e69ee9da4e234ff1b6627d60d785e00206bd2f66
|
File details
Details for the file catchfly-0.0.1-py3-none-any.whl.
File metadata
- Download URL: catchfly-0.0.1-py3-none-any.whl
- Upload date:
- Size: 5.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4450f1d243845e6e5431214f9171db18e6ed3cdd517e7a511f700c98e0a52de1
|
|
| MD5 |
b54d1ca47761aa0657041bd13df6a371
|
|
| BLAKE2b-256 |
a9b6d6f01b18a3d74f0ab43846ea98c8e51236bd7cf287fa9fe6e8f3aac38342
|