spaCy-powered enrichment backend for lede — PERSON/ORG/GPE entity extraction via en_core_web_sm.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

lede-spacy

spaCy-powered enrichment backend for lede. Adds proper named-entity recognition (PERSON / ORG / GPE) and richer phrase / fact extraction by registering itself as the "spacy" backend for lede.extract.metadata, lede.extract.phrases, and lede.extract.correlate_facts.

	regex backend (default in `lede`)	spaCy backend (this package)
Entities (PERSON / ORG / GPE)	always empty	populated from `en_core_web_sm`
Phrases	repeated multi-word n-grams	syntactically-grounded noun chunks (still count-filtered)
Correlate facts	regex pattern → entity↔number	dependency-parse → wider net of entity↔number relationships
Latency	sub-millisecond	~5 ms after warmup, ~50 ms first call (model load)
Determinism	byte-identical Python ↔ Rust	spaCy is deterministic but Python-only and not byte-comparable to Rust
Install footprint	stdlib only	`spacy>=3.8` + `en_core_web_sm` (~50 MB model)

When you actually want this

lede's regex backend covers the majority of structured-extract use cases for free — dates, amounts, URLs, numeric facts with sentence context, and repeated-phrase mining all work with zero dependencies. You want lede-spacy specifically when you need named entities — people, companies, places — pulled out of arbitrary text. That's where the regex backend explicitly returns nothing.

You also get richer correlate_facts (the dep-parser walks the syntax tree to find entity↔number relationships even for entities mentioned once, where the regex backend requires repetition).

Side-by-side: the same input, both backends

from lede.extract import metadata
import lede_spacy  # side effect: registers the 'spacy' backend

The text:

Acme Corp announced today a partnership with Yonk Labs to integrate deterministic summarization into their RAG pipeline. The deal, brokered by CEO Lin Wu and signed in San Francisco on 2024-11-15, covers $2.4M in annual licensing through 2027. Sarah Jones from Acme's engineering team and Marcus Chen from Yonk Labs will lead the joint integration. The first deployment is targeted for European customers, including teams in London, Berlin, and Paris.

`backend="regex"` (lede default — zero-dep)

m = metadata(text)
m.dates     # ('2024-11-15', '2027')
m.amounts   # ('$2.4M',)
m.urls      # ()
m.entities  # ()    ← regex backend returns nothing here

The regex backend caught the structured stuff (ISO date, year, dollar amount). It can't do entities — that's not a regex job.

`backend="spacy"` (this package)

m = metadata(text, backend="spacy")
m.dates     # ('2024-11-15', '2027')        — same regex stage runs
m.amounts   # ('$2.4M',)                    — same regex stage runs
m.urls      # ()                            — same regex stage runs
m.entities  # ('Acme Corp', 'Yonk Labs', 'RAG', 'Lin Wu',
            #  'San Francisco', 'Sarah Jones', 'Acme',
            #  'Marcus Chen', 'London', 'Berlin', 'Paris')

11 entities pulled out of the same input. PERSON ('Lin Wu', 'Sarah Jones', 'Marcus Chen'), ORG ('Acme Corp', 'Yonk Labs', 'Acme', 'RAG'), GPE ('San Francisco', 'London', 'Berlin', 'Paris'). The dates / amounts / URLs columns are unchanged — spaCy backend runs the same regex stages plus the spaCy NER stage on top.

Use case: `correlate_facts` finds different relationships

For some inputs the two backends produce different but overlapping fact relationships. From the same paragraph above:

correlate_facts(text)                        # regex: 2 pairs anchored on Acme Corp
correlate_facts(text, backend="spacy")       # spaCy: 4 pairs, also catches Yonk Labs and customer churn

The dep-parser approach catches relationships the regex misses, especially for entities that appear once. (And vice versa — sometimes the regex backend catches a pattern the dep-parser doesn't. The two are complementary, not strictly better/worse. Switch backend per call if you need it.)

Use this when…

Your callers want PERSON / ORG / GPE entities in the output. Default lede can't help.
You want richer entity-number correlations on documents where each entity is mentioned once.
You're already running spaCy in your pipeline and want to consolidate.
Latency budgets are in the 5–50 ms range per chunk, not sub-millisecond.

Don't use this when…

You need byte-identical Python ↔ Rust output. spaCy is Python-only and isn't on the parity contract.
You're on a sub-millisecond hot path. spaCy is ~5 ms per call after warmup, ~50 ms first call.
You don't actually need entities. The default lede regex backend already handles dates / amounts / URLs / numeric facts with sentence context.
You're shipping lede inside a constrained environment (Lambda cold-start, embedded, no-egress) — the 50 MB en_core_web_sm model has real cost.
You can't tolerate spaCy's transitive dependency graph (NumPy, Cython, blis, thinc, etc.) in your env.

Install

pip install lede-spacy
python -m spacy download en_core_web_sm

The first command pulls lede and spacy>=3.8,<3.9. The second pulls the ~50 MB en_core_web_sm 3.8.0 model. PyPI does not allow direct-URL dependencies, so the model is a separate install step (the same convention spaCy itself uses).

If you want a single reproducible install, pin the model wheel from requirements.txt:

lede-spacy==0.3.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl

From source (in the lede repo):

pip install -e packages/lede-spacy
python -m spacy download en_core_web_sm

Use

import lede_spacy            # side-effect: registers the "spacy" backend
from lede.extract import metadata, phrases, correlate_facts

# Per-call backend override:
m = metadata(text, backend="spacy")

# Or set the global default once:
import lede
lede.set_default_backend("auto")   # spaCy if registered, else regex

Pre-load the model once at startup to avoid the ~50 ms first-call model load:

from lede_spacy import warmup
warmup()

Performance

call	latency (typical)
First call (cold model)	50–80 ms
Subsequent calls (warm)	~5 ms
Default lede regex backend (for comparison)	<1 ms

Run from lede_spacy import warmup; warmup() at app startup to pay the 50 ms once instead of on the first user request.

What's registered

When you import lede_spacy, three backends register themselves into lede.extract._backends:

lede primitive	spaCy backend
`metadata(text, backend="spacy")`	runs regex `dates`/`amounts`/`urls` + spaCy NER for `entities`
`phrases(text, backend="spacy")`	`doc.noun_chunks` filtered to repeated multi-word chunks (matches the regex backend's count semantics)
`correlate_facts(text, backend="spacy")`	DepMatcher-based entity↔number pairing

The regex backend stays the default — import lede_spacy is purely additive. Existing callers using backend="regex" (or no backend= kwarg) see no behavior change.

Determinism + parity

spaCy is deterministic per-version: en_core_web_sm 3.8.0 produces the same entities for the same input on every call. It is not on lede's Python ↔ Rust parity contract, by design. The Rust port has no spaCy equivalent and Metadata.entities stays empty under any Rust call. See docs/lede-spacy-integration.md for the cross-language policy.

If you need NER from a Rust service today: call out to a Python lede-spacy worker, or to a hosted NER endpoint. A future lede-rust-ner companion crate is on the roadmap if there's demand — file an issue.

License

Apache-2.0, same as lede.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

TheYonk

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lede_spacy-0.3.0.tar.gz (15.9 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lede_spacy-0.3.0-py3-none-any.whl (15.7 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file lede_spacy-0.3.0.tar.gz.

File metadata

Download URL: lede_spacy-0.3.0.tar.gz
Upload date: Apr 28, 2026
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lede_spacy-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`4a153c3c6446250ae7b52ac99f5dfbb2369a963f7a28e06f1cb94d01774ebbff`
MD5	`5864b520395a63bdc6f32156f0e1b056`
BLAKE2b-256	`b4e58988c5ea4cdab32e850533bd1fafc1bab7fb03a6fe490a0128cd95289a77`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lede_spacy-0.3.0.tar.gz:

Publisher: publish-pypi.yml on yonk-labs/lede

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lede_spacy-0.3.0.tar.gz
- Subject digest: 4a153c3c6446250ae7b52ac99f5dfbb2369a963f7a28e06f1cb94d01774ebbff
- Sigstore transparency entry: 1396671573
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: yonk-labs/lede@ec381a8bacdb7ac7fc8bdf137da1ebea58607b5f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/yonk-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@ec381a8bacdb7ac7fc8bdf137da1ebea58607b5f
- Trigger Event: workflow_dispatch

File details

Details for the file lede_spacy-0.3.0-py3-none-any.whl.

File metadata

Download URL: lede_spacy-0.3.0-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 15.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lede_spacy-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51ddeb020138184527302c4d2a08948fe475864e9f56a9482e2a6eb3a4367f0b`
MD5	`81fec96b2beb3308db4a94e88c098504`
BLAKE2b-256	`ec7d41d5a2ce405e78c3ded0946ea15328dd3b2425af7df8fc68461d3c0a8131`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lede_spacy-0.3.0-py3-none-any.whl:

Publisher: publish-pypi.yml on yonk-labs/lede

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lede_spacy-0.3.0-py3-none-any.whl
- Subject digest: 51ddeb020138184527302c4d2a08948fe475864e9f56a9482e2a6eb3a4367f0b
- Sigstore transparency entry: 1396671581
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: yonk-labs/lede@ec381a8bacdb7ac7fc8bdf137da1ebea58607b5f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/yonk-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@ec381a8bacdb7ac7fc8bdf137da1ebea58607b5f
- Trigger Event: workflow_dispatch

lede-spacy 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

lede-spacy

When you actually want this

Side-by-side: the same input, both backends

backend="regex" (lede default — zero-dep)

backend="spacy" (this package)

Use case: correlate_facts finds different relationships

Use this when…

Don't use this when…

Install

Use

Performance

What's registered

Determinism + parity

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`backend="regex"` (lede default — zero-dep)

`backend="spacy"` (this package)

Use case: `correlate_facts` finds different relationships