Skip to main content

Standalone, open-source EuroVoc classifier: tag any text into the EU's official subject space (multilingual, incl. Catalan). A modern successor to PyEuroVoc.

Project description

eurovoc

A standalone, open-source EuroVoc classifier. It tags any text into the EU's official subject space: 7,029 descriptors (IDs) rolled up to 127 microthesauri (MT) and 21 domains (DO), multilingual including Catalan.

A modern, open successor to the 2021 PyEuroVoc package (which fine-tuned a per-language BERT, truncated documents to 512 tokens, and no longer loads under transformers 5.x), and the natural home for a retrained long-context model.

How it works

Default backend is zero-shot retrieval: a multilingual sentence-transformers model (intfloat/multilingual-e5-base) embeds the text and the EuroVoc descriptor labels (cached once), and returns the top-K by cosine behind a confidence gate. The gate prefers returning nothing over wrong tags, so proper-noun-heavy or off-topic text yields []. This is a port of Brubru's production classifier, not a trained multi-label model.

Install

pip install "brubru-eurovoc[local]"      # [local] pulls sentence-transformers

The distribution is named brubru-eurovoc (the name eurovoc was already taken on PyPI), but the import name is eurovoc:

import eurovoc

The package itself only needs numpy; the [local] extra adds the model. The first classify() downloads the model (~1 GB) and computes the label-embedding matrix once, caching it under ~/.cache/eurovoc/.

Usage

import eurovoc

for d in eurovoc.classify("Markets in crypto-assets regulation"):
    print(d.label, "|", d.domain, d.domain_label, "|", round(d.score, 3))
# financial instrument | 24 FINANCE | 0.88 ...

eurovoc.classify("Regulació de la protecció de dades personals")  # multilingual
eurovoc.classify("xyzzy plugh")   # -> []  (gate rejects noise)

Each result is a Descriptor(id, label, score, mt, domain, domain_label). The 21 domains are in eurovoc.DOMAINS.

Interop with the Brubru API

Enrich the raw descriptors the API (or the brubru SDK) returns:

import eurovoc
tags = eurovoc.from_descriptors(extract_item["eurovoc_descriptors"])

Or classify a live EU URL through Brubru's hosted extract engine (needs the brubru SDK and a key):

pip install "brubru-eurovoc[brubru]"
tags = eurovoc.classify_url("https://environment.ec.europa.eu/news_en", api_key="brubru_live_...")

Configuration (env vars)

EUROVOC_ST_MODEL, EUROVOC_TOPK, EUROVOC_MIN_SCORE, EUROVOC_MIN_MARGIN, EUROVOC_MIN_CLUSTER, EUROVOC_CACHE_DIR.

Tests

pip install -e '.[test,local]'
pytest -m "not model"     # fast: enrichment, pruning, packaging (no model)
pytest -m model           # loads the model and classifies real text

MIT licensed. Built by Beresol BV. EuroVoc is a trademark of the Publications Office of the European Union.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

brubru_eurovoc-0.1.0.tar.gz (129.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

brubru_eurovoc-0.1.0-py3-none-any.whl (131.3 kB view details)

Uploaded Python 3

File details

Details for the file brubru_eurovoc-0.1.0.tar.gz.

File metadata

  • Download URL: brubru_eurovoc-0.1.0.tar.gz
  • Upload date:
  • Size: 129.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for brubru_eurovoc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9bb9e4b0632e07d13482179624bf02423839113fd6896e37c36a3528bd4a312d
MD5 e6d9b226414d4b59edf40f1b917e37ae
BLAKE2b-256 00040034ca98e8d6e1618608582e64ff47a130ff3ad9dc328043da3c475ada69

See more details on using hashes here.

Provenance

The following attestation bundles were made for brubru_eurovoc-0.1.0.tar.gz:

Publisher: publish.yml on Beresol-BV/brubru-EU-scraper-library

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file brubru_eurovoc-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: brubru_eurovoc-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 131.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for brubru_eurovoc-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 17fa6906b0eb4f243f54403a9e9d82028cf8bcae13c3dac81085e8f2dda51ab8
MD5 835fed94a4859a83aa4f904e903b5629
BLAKE2b-256 57bd28d72b6392e1462471263d8754ae6bac2f973eb0efeddfb899580a66ef36

See more details on using hashes here.

Provenance

The following attestation bundles were made for brubru_eurovoc-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Beresol-BV/brubru-EU-scraper-library

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page