Skip to main content

LangCore hybrid provider — combines deterministic rule/regex extraction with LLM fallback for cost savings

Project description

LangCore Hybrid Provider

A provider plugin for LangCore that combines deterministic rule-based extraction (regex, callable functions) with LLM fallback. Saves 50–80% of LLM costs on well-structured documents.

Note: This is a third-party provider plugin for LangCore. For the main LangCore library, visit google/langcore.

Installation

Install from source:

git clone <repo-url>
cd langcore-hybrid
pip install -e .

For optional spaCy NER support:

pip install -e ".[spacy]"

Features

  • Rules first, LLM fallback — deterministic extraction for patterns that don't need an LLM
  • Regex rules — extract dates, amounts, reference numbers via named capture groups
  • Callable rules — plug in any Python function for custom extraction logic
  • Confidence thresholds — optionally fall back to LLM when rule confidence is low
  • Batch-aware async — only prompts that miss all rules are batched for LLM inference
  • Observability — built-in counters for rule hits vs LLM fallbacks
  • Thread-safe counters — rule-hit and LLM-fallback counters are protected by a lock for safe concurrent use
  • Text extractor hook — optional text_extractor callable in RuleConfig isolates document text from prompt instructions before rule evaluation
  • Output formatter hook — optional output_formatter callable in RuleConfig normalises rule outputs (e.g. wrap in JSON)
  • Zero overhead on hits — rule evaluation is pure Python, no network calls

Usage

Regex Rules for Contract Extraction

import langcore as lx
from langcore_hybrid import (
    HybridLanguageModel,
    RegexRule,
    RuleConfig,
)

# Create the fallback LLM provider
inner_config = lx.factory.ModelConfig(
    model_id="litellm/azure/gpt-4o",
    provider="LiteLLMLanguageModel",
)
inner_model = lx.factory.create_model(inner_config)

# Define rules for common patterns
rules = RuleConfig(rules=[
    RegexRule(
        r"Date:\s*(?P<date>\d{4}-\d{2}-\d{2})",
        description="ISO date extraction",
    ),
    RegexRule(
        r"Amount:\s*\$(?P<amount>[\d,.]+)",
        description="USD amount extraction",
    ),
    RegexRule(
        r"Ref(?:erence)?[:\s]+(?P<ref>[A-Z]+-\d+)",
        description="Reference number extraction",
    ),
])

# Create hybrid provider
hybrid_model = HybridLanguageModel(
    model_id="hybrid/gpt-4o",
    inner=inner_model,
    rule_config=rules,
)

# Use as normal
result = lx.extract(
    text_or_documents="Contract Ref: ABC-123, Date: 2026-01-15...",
    model=hybrid_model,
    prompt_description="Extract contract metadata.",
)

# Check cost savings
print(f"Rule hits: {hybrid_model.rule_hits}")
print(f"LLM fallbacks: {hybrid_model.llm_fallbacks}")

Callable Rules

import json
from langcore_hybrid import CallableRule, RuleConfig

def extract_email(prompt: str) -> str | None:
    """Extract email addresses deterministically."""
    import re
    emails = re.findall(r'\b[\w.+-]+@[\w-]+\.[\w.]+\b', prompt)
    if emails:
        return json.dumps({"emails": emails})
    return None  # Signal miss — fall back to LLM

rules = RuleConfig(rules=[
    CallableRule(
        func=extract_email,
        description="email extraction",
    ),
])

Custom Output Templates

import json
from langcore_hybrid import RegexRule

rule = RegexRule(
    r"Amount:\s*\$(?P<amount>[\d,.]+)",
    output_template=lambda groups: json.dumps({
        "amount": float(groups["amount"].replace(",", "")),
        "currency": "USD",
    }),
)

Confidence Thresholds

When fallback_on_low_confidence is enabled, rule results with confidence below min_confidence trigger LLM fallback instead:

from langcore_hybrid import RegexRule, RuleConfig

rules = RuleConfig(
    rules=[
        RegexRule(r"(?P<date>\d{1,2}/\d{1,2}/\d{2,4})", confidence=0.6),
        RegexRule(r"(?P<date>\d{4}-\d{2}-\d{2})", confidence=1.0),
    ],
    fallback_on_low_confidence=True,
    min_confidence=0.8,
)
# Ambiguous dates (MM/DD/YY) fall back to LLM; ISO dates are trusted

Rule Evaluation Order

Rules are evaluated in list order. The first rule that hits and meets the confidence threshold wins — later rules are not evaluated.

When fallback_on_low_confidence=True, a rule that hits below min_confidence is skipped and evaluation continues to the next rule. If no subsequent rule produces a confident hit, the prompt falls through to the LLM.

This means rule ordering matters:

  1. Place high-confidence, specific rules first (e.g. ISO dates).
  2. Follow with lower-confidence, broader rules (e.g. ambiguous date formats).
  3. The LLM acts as the final catch-all.
rules = RuleConfig(
    rules=[
        RegexRule(r"(?P<date>\d{4}-\d{2}-\d{2})", confidence=1.0),   # ① specific
        RegexRule(r"(?P<date>\d{1,2}/\d{1,2}/\d{2,4})", confidence=0.6),  # ② broad
    ],
    fallback_on_low_confidence=True,
    min_confidence=0.8,
)
# "2026-01-15" → rule ① (confidence 1.0 ≥ 0.8) → instant result
# "1/15/26"   → rule ① miss, rule ② hit (0.6 < 0.8) → LLM fallback

Async Usage

The async path is batch-optimised — only prompts that miss all rules are sent to the LLM in a single batch:

results = await hybrid_model.async_infer([
    "Date: 2026-01-15",       # Rule hit — instant
    "Complex clause text...",  # Rule miss — LLM
    "Ref: ABC-123",           # Rule hit — instant
])
# Only 1 LLM call for the 1 miss, not 3

Observability

Counters are per-instance and cumulative — they track totals since the provider was created (or last reset). In long-running applications where a single HybridLanguageModel handles unrelated jobs, call reset_counters() between jobs or use get_counters() to take point-in-time snapshots for differential reporting.

print(f"Rule hits: {hybrid_model.rule_hits}")
print(f"LLM fallbacks: {hybrid_model.llm_fallbacks}")

# Atomic snapshot (thread-safe dict copy)
snapshot = hybrid_model.get_counters()
# {"rule_hits": 42, "llm_fallbacks": 7}

# Reset counters (also thread-safe)
hybrid_model.reset_counters()

Text Extractor

When prompts contain instructions followed by document text, rules may match instruction fragments by mistake. Use text_extractor to isolate the document:

def extract_after_marker(prompt: str) -> str:
    """Return text after '---DOCUMENT---' marker."""
    marker = "---DOCUMENT---"
    idx = prompt.find(marker)
    return prompt[idx + len(marker) :].strip() if idx >= 0 else prompt

rules = RuleConfig(
    rules=[RegexRule(r"Date:\s*(?P<date>\d{4}-\d{2}-\d{2})")],
    text_extractor=extract_after_marker,
)

Output Formatter

Normalise rule outputs for downstream consumers:

import json

rules = RuleConfig(
    rules=[RegexRule(r"Ref:\s*(?P<ref>[A-Z]+-\d+)")],
    output_formatter=lambda raw: json.dumps({"result": json.loads(raw)}),
)

Custom Rules

Implement the ExtractionRule interface:

from langcore_hybrid.rules import ExtractionRule, RuleResult

class SpacyNERRule(ExtractionRule):
    def __init__(self, nlp_model: str = "en_core_web_sm") -> None:
        import spacy
        self._nlp = spacy.load(nlp_model)

    def evaluate(self, prompt: str) -> RuleResult:
        doc = self._nlp(prompt)
        entities = [
            {"text": ent.text, "label": ent.label_}
            for ent in doc.ents
        ]
        if entities:
            import json
            return RuleResult(
                hit=True,
                output=json.dumps(entities),
                confidence=0.85,
            )
        return RuleResult(hit=False)

Development

pip install -e ".[dev]"
pytest

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langcore_hybrid_llm_regex-1.1.1.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langcore_hybrid_llm_regex-1.1.1-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file langcore_hybrid_llm_regex-1.1.1.tar.gz.

File metadata

File hashes

Hashes for langcore_hybrid_llm_regex-1.1.1.tar.gz
Algorithm Hash digest
SHA256 6302c6a040eabc7027afca046785b8fb1b8f15fad58c9a5ca7df5aa4573eff67
MD5 cc29a582b75e0d5c3ed0cbedf58cf765
BLAKE2b-256 b78b9e56eda4d05239d0970698e7df15eff4a2fa797dedc0d4e25ff8537fa26a

See more details on using hashes here.

Provenance

The following attestation bundles were made for langcore_hybrid_llm_regex-1.1.1.tar.gz:

Publisher: release.yml on IgnatG/langcore-hybrid-llm-regex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file langcore_hybrid_llm_regex-1.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for langcore_hybrid_llm_regex-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2d176bf4436e1f51b5a23e436b00ba7c00f0381a8155f08dc2642af718fa28fa
MD5 72ad065827401a42ad422e28a5bcfcbf
BLAKE2b-256 2b359271f6cf05def20c0d85b1b75116636cf0edac89e01ff15e85024bbafe3f

See more details on using hashes here.

Provenance

The following attestation bundles were made for langcore_hybrid_llm_regex-1.1.1-py3-none-any.whl:

Publisher: release.yml on IgnatG/langcore-hybrid-llm-regex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page