Skip to main content

spaCy-powered enrichment backend for lede — PERSON/ORG/GPE entity extraction via en_core_web_sm.

Project description

lede-spacy

spaCy-powered enrichment backend for lede. Adds proper named-entity recognition (PERSON / ORG / GPE) and richer phrase / fact extraction by registering itself as the "spacy" backend for lede.extract.metadata, lede.extract.phrases, and lede.extract.correlate_facts.

regex backend (default in lede) spaCy backend (this package)
Entities (PERSON / ORG / GPE) always empty populated from en_core_web_sm
Phrases repeated multi-word n-grams syntactically-grounded noun chunks (still count-filtered)
Correlate facts regex pattern → entity↔number dependency-parse → wider net of entity↔number relationships
Latency sub-millisecond ~5 ms after warmup, ~50 ms first call (model load)
Determinism byte-identical Python ↔ Rust spaCy is deterministic but Python-only and not byte-comparable to Rust
Install footprint stdlib only spacy>=3.8 + en_core_web_sm (~50 MB model)

When you actually want this

lede's regex backend covers the majority of structured-extract use cases for free — dates, amounts, URLs, numeric facts with sentence context, and repeated-phrase mining all work with zero dependencies. You want lede-spacy specifically when you need named entities — people, companies, places — pulled out of arbitrary text. That's where the regex backend explicitly returns nothing.

You also get richer correlate_facts (the dep-parser walks the syntax tree to find entity↔number relationships even for entities mentioned once, where the regex backend requires repetition).

Side-by-side: the same input, both backends

from lede.extract import metadata
import lede_spacy  # side effect: registers the 'spacy' backend

The text:

Acme Corp announced today a partnership with Yonk Labs to integrate deterministic summarization into their RAG pipeline. The deal, brokered by CEO Lin Wu and signed in San Francisco on 2024-11-15, covers $2.4M in annual licensing through 2027. Sarah Jones from Acme's engineering team and Marcus Chen from Yonk Labs will lead the joint integration. The first deployment is targeted for European customers, including teams in London, Berlin, and Paris.

backend="regex" (lede default — zero-dep)

m = metadata(text)
m.dates     # ('2024-11-15', '2027')
m.amounts   # ('$2.4M',)
m.urls      # ()
m.entities  # ()    ← regex backend returns nothing here

The regex backend caught the structured stuff (ISO date, year, dollar amount). It can't do entities — that's not a regex job.

backend="spacy" (this package)

m = metadata(text, backend="spacy")
m.dates     # ('2024-11-15', '2027')        — same regex stage runs
m.amounts   # ('$2.4M',)                    — same regex stage runs
m.urls      # ()                            — same regex stage runs
m.entities  # ('Acme Corp', 'Yonk Labs', 'RAG', 'Lin Wu',
            #  'San Francisco', 'Sarah Jones', 'Acme',
            #  'Marcus Chen', 'London', 'Berlin', 'Paris')

11 entities pulled out of the same input. PERSON ('Lin Wu', 'Sarah Jones', 'Marcus Chen'), ORG ('Acme Corp', 'Yonk Labs', 'Acme', 'RAG'), GPE ('San Francisco', 'London', 'Berlin', 'Paris'). The dates / amounts / URLs columns are unchanged — spaCy backend runs the same regex stages plus the spaCy NER stage on top.

Use case: correlate_facts finds different relationships

For some inputs the two backends produce different but overlapping fact relationships. From the same paragraph above:

correlate_facts(text)                        # regex: 2 pairs anchored on Acme Corp
correlate_facts(text, backend="spacy")       # spaCy: 4 pairs, also catches Yonk Labs and customer churn

The dep-parser approach catches relationships the regex misses, especially for entities that appear once. (And vice versa — sometimes the regex backend catches a pattern the dep-parser doesn't. The two are complementary, not strictly better/worse. Switch backend per call if you need it.)

Use this when…

  • Your callers want PERSON / ORG / GPE entities in the output. Default lede can't help.
  • You want richer entity-number correlations on documents where each entity is mentioned once.
  • You're already running spaCy in your pipeline and want to consolidate.
  • Latency budgets are in the 5–50 ms range per chunk, not sub-millisecond.

Don't use this when…

  • You need byte-identical Python ↔ Rust output. spaCy is Python-only and isn't on the parity contract.
  • You're on a sub-millisecond hot path. spaCy is ~5 ms per call after warmup, ~50 ms first call.
  • You don't actually need entities. The default lede regex backend already handles dates / amounts / URLs / numeric facts with sentence context.
  • You're shipping lede inside a constrained environment (Lambda cold-start, embedded, no-egress) — the 50 MB en_core_web_sm model has real cost.
  • You can't tolerate spaCy's transitive dependency graph (NumPy, Cython, blis, thinc, etc.) in your env.

Install

pip install lede-spacy
python -m spacy download en_core_web_sm

The first command pulls lede and spacy>=3.8,<3.9. The second pulls the ~50 MB en_core_web_sm 3.8.0 model. PyPI does not allow direct-URL dependencies, so the model is a separate install step (the same convention spaCy itself uses).

If you want a single reproducible install, pin the model wheel from requirements.txt:

lede-spacy==0.3.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl

From source (in the lede repo):

pip install -e packages/lede-spacy
python -m spacy download en_core_web_sm

Use

import lede_spacy            # side-effect: registers the "spacy" backend
from lede.extract import metadata, phrases, correlate_facts

# Per-call backend override:
m = metadata(text, backend="spacy")

# Or set the global default once:
import lede
lede.set_default_backend("auto")   # spaCy if registered, else regex

Pre-load the model once at startup to avoid the ~50 ms first-call model load:

from lede_spacy import warmup
warmup()

Performance

call latency (typical)
First call (cold model) 50–80 ms
Subsequent calls (warm) ~5 ms
Default lede regex backend (for comparison) <1 ms

Run from lede_spacy import warmup; warmup() at app startup to pay the 50 ms once instead of on the first user request.

What's registered

When you import lede_spacy, three backends register themselves into lede.extract._backends:

lede primitive spaCy backend
metadata(text, backend="spacy") runs regex dates/amounts/urls + spaCy NER for entities
phrases(text, backend="spacy") doc.noun_chunks filtered to repeated multi-word chunks (matches the regex backend's count semantics)
correlate_facts(text, backend="spacy") DepMatcher-based entity↔number pairing

The regex backend stays the default — import lede_spacy is purely additive. Existing callers using backend="regex" (or no backend= kwarg) see no behavior change.

Determinism + parity

spaCy is deterministic per-version: en_core_web_sm 3.8.0 produces the same entities for the same input on every call. It is not on lede's Python ↔ Rust parity contract, by design. The Rust port has no spaCy equivalent and Metadata.entities stays empty under any Rust call. See docs/lede-spacy-integration.md for the cross-language policy.

If you need NER from a Rust service today: call out to a Python lede-spacy worker, or to a hosted NER endpoint. A future lede-rust-ner companion crate is on the roadmap if there's demand — file an issue.

License

Apache-2.0, same as lede.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lede_spacy-0.3.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lede_spacy-0.3.0-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file lede_spacy-0.3.0.tar.gz.

File metadata

  • Download URL: lede_spacy-0.3.0.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lede_spacy-0.3.0.tar.gz
Algorithm Hash digest
SHA256 4a153c3c6446250ae7b52ac99f5dfbb2369a963f7a28e06f1cb94d01774ebbff
MD5 5864b520395a63bdc6f32156f0e1b056
BLAKE2b-256 b4e58988c5ea4cdab32e850533bd1fafc1bab7fb03a6fe490a0128cd95289a77

See more details on using hashes here.

Provenance

The following attestation bundles were made for lede_spacy-0.3.0.tar.gz:

Publisher: publish-pypi.yml on yonk-labs/lede

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lede_spacy-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: lede_spacy-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lede_spacy-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51ddeb020138184527302c4d2a08948fe475864e9f56a9482e2a6eb3a4367f0b
MD5 81fec96b2beb3308db4a94e88c098504
BLAKE2b-256 ec7d41d5a2ce405e78c3ded0946ea15328dd3b2425af7df8fc68461d3c0a8131

See more details on using hashes here.

Provenance

The following attestation bundles were made for lede_spacy-0.3.0-py3-none-any.whl:

Publisher: publish-pypi.yml on yonk-labs/lede

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page