Standalone, open-source EuroVoc classifier: tag any text into the EU's official subject space (multilingual, incl. Catalan). A modern successor to PyEuroVoc.
Project description
eurovoc
A standalone, open-source EuroVoc classifier. It tags any text into the EU's official subject space: 7,029 descriptors (IDs) rolled up to 127 microthesauri (MT) and 21 domains (DO), multilingual including Catalan.
A modern, open successor to the 2021 PyEuroVoc package (which fine-tuned a per-language BERT, truncated documents to 512 tokens, and no longer loads under transformers 5.x), and the natural home for a retrained long-context model.
How it works
Default backend is zero-shot retrieval: a multilingual sentence-transformers model (intfloat/multilingual-e5-base) embeds the text and the EuroVoc descriptor labels (cached once), and returns the top-K by cosine behind a confidence gate. The gate prefers returning nothing over wrong tags, so proper-noun-heavy or off-topic text yields []. This is a port of Brubru's production classifier, not a trained multi-label model.
Install
pip install "brubru-eurovoc[local]" # [local] pulls sentence-transformers
The distribution is named brubru-eurovoc (the name eurovoc was already taken on PyPI), but the import name is eurovoc:
import eurovoc
The package itself only needs numpy; the [local] extra adds the model. The first classify() downloads the model (~1 GB) and computes the label-embedding matrix once, caching it under ~/.cache/eurovoc/.
Usage
import eurovoc
for d in eurovoc.classify("Markets in crypto-assets regulation"):
print(d.label, "|", d.domain, d.domain_label, "|", round(d.score, 3))
# financial instrument | 24 FINANCE | 0.88 ...
eurovoc.classify("Regulació de la protecció de dades personals") # multilingual
eurovoc.classify("xyzzy plugh") # -> [] (gate rejects noise)
Each result is a Descriptor(id, label, score, mt, domain, domain_label). The 21 domains are in eurovoc.DOMAINS.
Interop with the Brubru API
Enrich the raw descriptors the API (or the brubru SDK) returns:
import eurovoc
tags = eurovoc.from_descriptors(extract_item["eurovoc_descriptors"])
Or classify a live EU URL through Brubru's hosted extract engine (needs the brubru SDK and a key):
pip install "brubru-eurovoc[brubru]"
tags = eurovoc.classify_url("https://environment.ec.europa.eu/news_en", api_key="brubru_live_...")
Configuration (env vars)
EUROVOC_ST_MODEL, EUROVOC_TOPK, EUROVOC_MIN_SCORE, EUROVOC_MIN_MARGIN, EUROVOC_MIN_CLUSTER, EUROVOC_CACHE_DIR.
Tests
pip install -e '.[test,local]'
pytest -m "not model" # fast: enrichment, pruning, packaging (no model)
pytest -m model # loads the model and classifies real text
MIT licensed. Built by Beresol BV. EuroVoc is a trademark of the Publications Office of the European Union.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file brubru_eurovoc-0.1.0.tar.gz.
File metadata
- Download URL: brubru_eurovoc-0.1.0.tar.gz
- Upload date:
- Size: 129.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bb9e4b0632e07d13482179624bf02423839113fd6896e37c36a3528bd4a312d
|
|
| MD5 |
e6d9b226414d4b59edf40f1b917e37ae
|
|
| BLAKE2b-256 |
00040034ca98e8d6e1618608582e64ff47a130ff3ad9dc328043da3c475ada69
|
Provenance
The following attestation bundles were made for brubru_eurovoc-0.1.0.tar.gz:
Publisher:
publish.yml on Beresol-BV/brubru-EU-scraper-library
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
brubru_eurovoc-0.1.0.tar.gz -
Subject digest:
9bb9e4b0632e07d13482179624bf02423839113fd6896e37c36a3528bd4a312d - Sigstore transparency entry: 2010874550
- Sigstore integration time:
-
Permalink:
Beresol-BV/brubru-EU-scraper-library@33eaae4c35e250d49622294fdb1a8d8b1fd252a8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Beresol-BV
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@33eaae4c35e250d49622294fdb1a8d8b1fd252a8 -
Trigger Event:
release
-
Statement type:
File details
Details for the file brubru_eurovoc-0.1.0-py3-none-any.whl.
File metadata
- Download URL: brubru_eurovoc-0.1.0-py3-none-any.whl
- Upload date:
- Size: 131.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17fa6906b0eb4f243f54403a9e9d82028cf8bcae13c3dac81085e8f2dda51ab8
|
|
| MD5 |
835fed94a4859a83aa4f904e903b5629
|
|
| BLAKE2b-256 |
57bd28d72b6392e1462471263d8754ae6bac2f973eb0efeddfb899580a66ef36
|
Provenance
The following attestation bundles were made for brubru_eurovoc-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Beresol-BV/brubru-EU-scraper-library
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
brubru_eurovoc-0.1.0-py3-none-any.whl -
Subject digest:
17fa6906b0eb4f243f54403a9e9d82028cf8bcae13c3dac81085e8f2dda51ab8 - Sigstore transparency entry: 2010874774
- Sigstore integration time:
-
Permalink:
Beresol-BV/brubru-EU-scraper-library@33eaae4c35e250d49622294fdb1a8d8b1fd252a8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Beresol-BV
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@33eaae4c35e250d49622294fdb1a8d8b1fd252a8 -
Trigger Event:
release
-
Statement type: