Skip to main content

From chaos to categories. Schema discovery + term normalization for unstructured text data.

Project description

catchfly

From chaos to categories.

Schema discovery + term normalization for unstructured text data.

PyPI License Python


Named after Silene (catchfly) — plants that secrete a sticky substance to capture insects. Catchfly captures scattered terms and groups them into canonical categories.

Part of the Silene Systems ecosystem for rare disease research.

The problem

You extracted 3,000 mentions from 200 scientific papers. Or an LLM classified 3 million emails into 4,000 labels. Now you have:

  • No schema — you don't know what categories should exist
  • Duplicates everywhere — "miglustat" and "Zavesca" are the same drug
  • Ambiguous boundaries — is "cognitive decline" the same as "cognitive impairment"?
  • Related but distinct entities — "ALT" and "AST" are similar but clinically different

No existing tool does both schema discovery and term normalization as a composable Python library.

What catchfly does

Two operations, one pipeline:

Discover"What categories exist in my data?"

from catchfly import Resolver

resolver = Resolver(embed_provider="gemini", llm_provider="openai/gpt-5.4")

schema = resolver.discover(
    mentions=["miglustat", "splenomegaly", "ALT: 120", "ataxia", ...],
    contexts={"miglustat": ["Patient received miglustat 200mg daily"], ...},
)

# schema.categories → ["Treatments", "Symptoms", "Lab Values", ...]
# schema.examples  → {"Treatments": ["miglustat", "arimoclomol"], ...}

schema.rename("Lab Values", "Laboratory Findings")
schema.merge("Symptoms", "Clinical Signs")

Normalize"Which terms belong where, and which are synonyms?"

result = resolver.normalize(
    mentions=all_mentions,
    schema=schema,
    contexts=contexts,
)

# result.groups    → [NormGroup(canonical="miglustat", members=["Zavesca", "NB-DNJ"])]
# result.ambiguous → [AmbiguousPair("cognitive decline", "cognitive impairment")]

Or all-in-one:

result = resolver.resolve(mentions, contexts=contexts)

Self-improving prompts

Give catchfly 20–50 labeled examples and it optimizes its own prompts for your domain. Inspired by GEPA, implemented from scratch with zero external dependencies.

optimized = resolver.optimize(
    mentions=mentions,
    contexts=contexts,
    ground_truth={"miglustat": "miglustat", "Zavesca": "miglustat", "ALT": "ALT", "AST": "AST"},
    iterations=20,  # ~$1–2, ~15 min
)

result = optimized.resolve(new_mentions)  # 20–30% better accuracy on your domain

How it works

Four components, composable as a pipeline:

Component What it does Cost
EmbeddingSimilarity Pre-filter: finds candidate pairs via cosine similarity Embedding API only
LLMGrouper Core engine: LLM analyzes clusters, proposes categories/synonyms with 4-way relation typing (synonym / hierarchy / related / distinct) LLM API
LLMClassifier Propagation: assigns remaining mentions to approved categories via few-shot classification LLM API
UserSeeded User guidance: seed mappings override and anchor discovery + normalization Embedding only

Tiered quality

Tier What runs Cost / 1K mentions Accuracy Best for
Tier 1 (free) Embedding only $0 (local models) 60–70% Exploration
Tier 2 (standard) Embedding + LLM $0.15–0.50 82–90% Production
Tier 3 (optimized) Tier 2 + optimize() $1–3 one-time 90–95%+ Systematic reviews

Use cases

  • Medical literature — normalize symptoms, drugs, genetic variants from systematic reviews
  • Email categorization — collapse 4,000 LLM-generated labels into 150 clean groups
  • E-commerce tags — build taxonomy from 50,000 user-generated product tags
  • Any domain — if you have messy text labels, catchfly cleans them up

Provider support

# Embeddings
Resolver(embed_provider="gemini")       # Google Gemini Embedding 2
Resolver(embed_provider="openai")        # OpenAI text-embedding-3
Resolver(embed_provider="local")         # sentence-transformers (offline, free)
Resolver(embed_provider=my_function)     # any callable

# LLM
Resolver(llm_provider="openai/gpt-5.4")
Resolver(llm_provider="gemini/flash")
Resolver(llm_provider=my_function)       # any callable

Evaluation

ground_truth = {"miglustat": "miglustat", "Zavesca": "miglustat", "ALT": "ALT", "AST": "AST"}
metrics = resolver.evaluate(result, ground_truth)
# → precision, recall, false_merges, missed_merges, category_accuracy

Installation

pip install catchfly

v0.0.1 is a name reservation. Active development underway. First functional release (v0.1.0) expected Q2 2026.

Part of the Silene ecosystem

Project What it is
Silene Systems Computational phenotyping platform for rare diseases
Campion Agentic batch literature extraction platform (SaaS)
catchfly Schema discovery + term normalization library (open source, this repo)

Silene tomentosa (Gibraltar campion) is one of the rarest plants in the world — thought extinct in 1992, rediscovered in 1994 on the Rock of Gibraltar. Like rare diseases, it hides in plain sight, waiting for someone to look carefully enough.

License

Apache-2.0

Citation

If you use catchfly in academic work:

@software{catchfly2026,
  author = {Michalski, Adrian},
  title = {catchfly: Schema discovery and category normalization for unstructured text data},
  year = {2026},
  url = {https://github.com/silene-systems/catchfly},
  license = {Apache-2.0}
}
``

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catchfly-0.0.1.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

catchfly-0.0.1-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file catchfly-0.0.1.tar.gz.

File metadata

  • Download URL: catchfly-0.0.1.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for catchfly-0.0.1.tar.gz
Algorithm Hash digest
SHA256 8b9ac53308a5f236f19e961cd30b1992b48dd61d14307ce2b393d73549d89944
MD5 4df53d118fff7fc6ec98a94e3b2540b3
BLAKE2b-256 f7fa89b80b6d264e00008b36e69ee9da4e234ff1b6627d60d785e00206bd2f66

See more details on using hashes here.

File details

Details for the file catchfly-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: catchfly-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for catchfly-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4450f1d243845e6e5431214f9171db18e6ed3cdd517e7a511f700c98e0a52de1
MD5 b54d1ca47761aa0657041bd13df6a371
BLAKE2b-256 a9b6d6f01b18a3d74f0ab43846ea98c8e51236bd7cf287fa9fe6e8f3aac38342

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page