From chaos to categories. Schema discovery + term normalization for unstructured text data.

These details have not been verified by PyPI

Project links

Project description

catchfly

From chaos to categories.

Schema discovery + term normalization for unstructured text data.

Named after Silene (catchfly) — plants that secrete a sticky substance to capture insects. Catchfly captures scattered terms and groups them into canonical categories.

Part of the Silene Systems ecosystem for rare disease research.

The problem

You extracted 3,000 mentions from 200 scientific papers. Or an LLM classified 3 million emails into 4,000 labels. Now you have:

No schema — you don't know what categories should exist
Duplicates everywhere — "miglustat" and "Zavesca" are the same drug
Ambiguous boundaries — is "cognitive decline" the same as "cognitive impairment"?
Related but distinct entities — "ALT" and "AST" are similar but clinically different

No existing tool does both schema discovery and term normalization as a composable Python library.

What catchfly does

Two operations, one pipeline:

Discover — "What categories exist in my data?"

from catchfly import Resolver

resolver = Resolver(embed_provider="gemini", llm_provider="openai/gpt-5.4")

schema = resolver.discover(
    mentions=["miglustat", "splenomegaly", "ALT: 120", "ataxia", ...],
    contexts={"miglustat": ["Patient received miglustat 200mg daily"], ...},
)

# schema.categories → ["Treatments", "Symptoms", "Lab Values", ...]
# schema.examples  → {"Treatments": ["miglustat", "arimoclomol"], ...}

schema.rename("Lab Values", "Laboratory Findings")
schema.merge("Symptoms", "Clinical Signs")

Normalize — "Which terms belong where, and which are synonyms?"

result = resolver.normalize(
    mentions=all_mentions,
    schema=schema,
    contexts=contexts,
)

# result.groups    → [NormGroup(canonical="miglustat", members=["Zavesca", "NB-DNJ"])]
# result.ambiguous → [AmbiguousPair("cognitive decline", "cognitive impairment")]

Or all-in-one:

result = resolver.resolve(mentions, contexts=contexts)

Self-improving prompts

Give catchfly 20–50 labeled examples and it optimizes its own prompts for your domain. Inspired by GEPA, implemented from scratch with zero external dependencies.

optimized = resolver.optimize(
    mentions=mentions,
    contexts=contexts,
    ground_truth={"miglustat": "miglustat", "Zavesca": "miglustat", "ALT": "ALT", "AST": "AST"},
    iterations=20,  # ~$1–2, ~15 min
)

result = optimized.resolve(new_mentions)  # 20–30% better accuracy on your domain

How it works

Four components, composable as a pipeline:

Component	What it does	Cost
EmbeddingSimilarity	Pre-filter: finds candidate pairs via cosine similarity	Embedding API only
LLMGrouper	Core engine: LLM analyzes clusters, proposes categories/synonyms with 4-way relation typing (synonym / hierarchy / related / distinct)	LLM API
LLMClassifier	Propagation: assigns remaining mentions to approved categories via few-shot classification	LLM API
UserSeeded	User guidance: seed mappings override and anchor discovery + normalization	Embedding only

Tiered quality

Tier	What runs	Cost / 1K mentions	Accuracy	Best for
Tier 1 (free)	Embedding only	$0 (local models)	60–70%	Exploration
Tier 2 (standard)	Embedding + LLM	$0.15–0.50	82–90%	Production
Tier 3 (optimized)	Tier 2 + optimize()	$1–3 one-time	90–95%+	Systematic reviews

Use cases

Medical literature — normalize symptoms, drugs, genetic variants from systematic reviews
Email categorization — collapse 4,000 LLM-generated labels into 150 clean groups
E-commerce tags — build taxonomy from 50,000 user-generated product tags
Any domain — if you have messy text labels, catchfly cleans them up

Provider support

# Embeddings
Resolver(embed_provider="gemini")       # Google Gemini Embedding 2
Resolver(embed_provider="openai")        # OpenAI text-embedding-3
Resolver(embed_provider="local")         # sentence-transformers (offline, free)
Resolver(embed_provider=my_function)     # any callable

# LLM
Resolver(llm_provider="openai/gpt-5.4")
Resolver(llm_provider="gemini/flash")
Resolver(llm_provider=my_function)       # any callable

Evaluation

ground_truth = {"miglustat": "miglustat", "Zavesca": "miglustat", "ALT": "ALT", "AST": "AST"}
metrics = resolver.evaluate(result, ground_truth)
# → precision, recall, false_merges, missed_merges, category_accuracy

Installation

pip install catchfly

v0.0.1 is a name reservation. Active development underway. First functional release (v0.1.0) expected Q2 2026.

Part of the Silene ecosystem

Project	What it is
Silene Systems	Computational phenotyping platform for rare diseases
Campion	Agentic batch literature extraction platform (SaaS)
catchfly	Schema discovery + term normalization library (open source, this repo)

Silene tomentosa (Gibraltar campion) is one of the rarest plants in the world — thought extinct in 1992, rediscovered in 1994 on the Rock of Gibraltar. Like rare diseases, it hides in plain sight, waiting for someone to look carefully enough.

License

Apache-2.0

Citation

If you use catchfly in academic work:

@software{catchfly2026,
  author = {Michalski, Adrian},
  title = {catchfly: Schema discovery and category normalization for unstructured text data},
  year = {2026},
  url = {https://github.com/silene-systems/catchfly},
  license = {Apache-2.0}
}
``

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.4

Apr 7, 2026

1.1.3

Apr 6, 2026

1.1.2

Apr 5, 2026

1.1.1

Mar 26, 2026

1.0.3

Mar 25, 2026

1.0.2

Mar 25, 2026

1.0.1

Mar 25, 2026

1.0.0

Mar 24, 2026

0.8.1

Mar 24, 2026

0.8.0

Mar 23, 2026

0.5.0

Mar 23, 2026

0.3.0

Mar 23, 2026

0.1.0

Mar 23, 2026

This version

0.0.1

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catchfly-0.0.1.tar.gz (6.7 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

catchfly-0.0.1-py3-none-any.whl (5.0 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file catchfly-0.0.1.tar.gz.

File metadata

Download URL: catchfly-0.0.1.tar.gz
Upload date: Mar 19, 2026
Size: 6.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.19

File hashes

Hashes for catchfly-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`8b9ac53308a5f236f19e961cd30b1992b48dd61d14307ce2b393d73549d89944`
MD5	`4df53d118fff7fc6ec98a94e3b2540b3`
BLAKE2b-256	`f7fa89b80b6d264e00008b36e69ee9da4e234ff1b6627d60d785e00206bd2f66`

See more details on using hashes here.

File details

Details for the file catchfly-0.0.1-py3-none-any.whl.

File metadata

Download URL: catchfly-0.0.1-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 5.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.19

File hashes

Hashes for catchfly-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4450f1d243845e6e5431214f9171db18e6ed3cdd517e7a511f700c98e0a52de1`
MD5	`b54d1ca47761aa0657041bd13df6a371`
BLAKE2b-256	`a9b6d6f01b18a3d74f0ab43846ea98c8e51236bd7cf287fa9fe6e8f3aac38342`

See more details on using hashes here.

catchfly 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

catchfly

The problem

What catchfly does

Self-improving prompts

How it works

Tiered quality

Use cases

Provider support

Evaluation

Installation

Part of the Silene ecosystem

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes