A modern EUR-Lex parser for Python - fetch and parse EU legal documents

These details have not been verified by PyPI

Project links

Project description

eurlxp

Python versions

A modern EUR-Lex parser for Python. Fetch and parse EU legal documents with async support, type hints, and a CLI.

Note: This is a modern rewrite inspired by kevin91nl/eurlex, built with UV, httpx, Pydantic, and Typer.

Features

Modern Python - Supports Python 3.10-3.14
Async support - Fetch multiple documents concurrently
Type hints - Full type annotations for IDE support
CLI - Command-line interface with Typer
Pydantic models - Validated, structured data
Drop-in compatible - Same API as the original eurlex package
Bot detection handling - Browser-like headers and WAF challenge detection
Rate limiting - Configurable delays between requests
SPARQL support - Alternative data source that bypasses HTML scraping
PDF extraction - Automatic text extraction from PDF for older documents without HTML

Installation

# Using pip
pip install eurlxp

# Using uv
uv add eurlxp

# With SPARQL support (required for get_celex_dataframe, run_query, get_regulations, etc.)
pip install eurlxp[sparql]
# or
uv add eurlxp[sparql]

Note: SPARQL functions (get_celex_dataframe, run_query, get_regulations, get_documents, guess_celex_ids_via_eurlex) require the optional sparql dependencies. If you see ImportError: SPARQL dependencies not installed, install with pip install eurlxp[sparql].

PDF extraction: Included by default (no extra install needed). Older documents without HTML are automatically extracted from PDF.

How It Works

This package fetches EU legal documents from EUR-Lex using their public HTML endpoints:

https://eur-lex.europa.eu/legal-content/{LANG}/TXT/HTML/?uri=CELEX:{CELEX_ID}

You can verify this manually with curl:

# Fetch a regulation (EU Drone Regulation 2019/947)
curl -s "https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019R0947" | head -50

# Or with a different language (German)
curl -s "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:32019R0947" | head -50

The equivalent using this package's CLI:

# Fetch as HTML
uvx eurlxp fetch 32019R0947 --format html | head -50

# Fetch and parse to JSON
uvx eurlxp fetch 32019R0947 --format json | head -30

# Fetch and parse to CSV
uvx eurlxp fetch 32019R0947 --format csv | head -10

# Get document info (shows row count, articles, etc.)
uvx eurlxp info 32019R0947

Quick Start

from eurlxp import get_html_by_celex_id, parse_html, WAFChallengeError

# Fetch and parse a regulation
celex_id = "32019R0947"
try:
    html = get_html_by_celex_id(celex_id)
    df = parse_html(html)

    # Get Article 1
    df_article_1 = df[df.article == "1"]
    print(df_article_1.iloc[0].text)
    # "This Regulation lays down detailed provisions for the operation of unmanned aircraft systems..."
except WAFChallengeError:
    print("Bot detection triggered - try using SPARQL functions instead")

Async Usage

import asyncio
from eurlxp import AsyncEURLexClient, parse_html

async def fetch_documents():
    # Use rate limiting to avoid bot detection
    async with AsyncEURLexClient(request_delay=2.0) as client:
        # Fetch multiple documents concurrently
        docs = await client.fetch_multiple(["32019R0947", "32019R0945"])
        for celex_id, html in docs.items():
            df = parse_html(html)
            print(f"{celex_id}: {len(df)} rows")

asyncio.run(fetch_documents())

Handling Bot Detection

EUR-Lex uses AWS WAF (Web Application Firewall) with JavaScript challenges to detect automated requests. This cannot be bypassed in pure Python because it requires JavaScript execution to solve a cryptographic puzzle. The library provides several strategies:

from eurlxp import EURLexClient, ClientConfig, WAFChallengeError

# Strategy 1: Automatic SPARQL fallback (recommended)
# When WAF blocks HTML scraping, automatically fetch metadata via SPARQL
config = ClientConfig(sparql_fallback=True)
with EURLexClient(config=config) as client:
    html = client.get_html_by_celex_id("32019R0947")  # Falls back to SPARQL if blocked

# Strategy 2: Use rate limiting to avoid triggering WAF
with EURLexClient(request_delay=2.0) as client:  # 2 second delay between requests
    html = client.get_html_by_celex_id("32019R0947")

# Strategy 3: Use custom configuration
config = ClientConfig(
    request_delay=3.0,           # Delay between requests
    use_browser_headers=True,    # Use browser-like headers (default)
    referer="https://eur-lex.europa.eu/",  # Add referer header
)
with EURLexClient(config=config) as client:
    html = client.get_html_by_celex_id("32019R0947")

# Strategy 4: Handle WAF challenges manually
try:
    html = get_html_by_celex_id("32019R0947")
except WAFChallengeError:
    # Fall back to SPARQL manually
    from eurlxp import get_documents
    docs = get_documents(types=["REG"], limit=10)

# Strategy 5: Disable WAF exception (get raw challenge HTML)
config = ClientConfig(raise_on_waf=False)
with EURLexClient(config=config) as client:
    html = client.get_html_by_celex_id("32019R0947")  # Returns challenge HTML if blocked

Why can't we bypass WAF in Python? AWS WAF requires a real browser to execute JavaScript that solves a cryptographic challenge and sets a cookie. HTTP libraries like httpx can't execute JavaScript. For browser automation, consider Playwright or Selenium, but SPARQL is the cleaner solution.

Using SPARQL (Recommended for Bulk Data)

The SPARQL endpoint (https://publications.europa.eu/webapi/rdf/sparql) doesn't trigger bot detection and is ideal for bulk operations. It's the recommended approach when HTML scraping is blocked.

from eurlxp import get_documents, get_regulations, run_query, guess_celex_ids_via_eurlex

# Convert slash notation to CELEX ID (uses SPARQL, not HTML scraping)
celex_ids = guess_celex_ids_via_eurlex("2019/947")
# Returns: ['32019R0947']

# Get list of regulations (returns CELLAR IDs)
cellar_ids = get_regulations(limit=100)

# Get documents with metadata
docs = get_documents(types=["REG", "DIR"], limit=50)
for doc in docs:
    print(f"{doc['celex']}: {doc['date']} - {doc['type']}")

# Run custom SPARQL queries
results = run_query("""
    SELECT ?doc ?celex WHERE {
        ?doc cdm:resource_legal_id_celex ?celex .
    } LIMIT 10
""")

SPARQL functions include automatic retry with exponential backoff for handling temporary 503 errors:

from eurlxp import run_query, SPARQLServiceError

try:
    # Automatic retry: 3 attempts with 2s, 4s, 8s delays
    results = run_query(query)
    
    # Or customize retry behavior
    results = run_query(query, max_retries=5, retry_delay=3.0)
except SPARQLServiceError as e:
    print(f"SPARQL endpoint unavailable: {e}")

Note: SPARQL functions require pip install eurlxp[sparql]

Fetching Documents by Date (Bulk Downloads)

The most reliable way to bulk download EUR-Lex documents is to query by date range, which returns both the document IDs and direct cellar URLs:

from eurlxp import get_ids_and_urls_via_date, get_html_by_cellar_url, parse_html, DateType

# Get documents published on a specific date
docs = get_ids_and_urls_via_date("2026-01-15")

# Or find documents MODIFIED in a date range (catches updates to old documents)
docs = get_ids_and_urls_via_date(
    "2026-01-01", "2026-01-31",
    date_type=DateType.MODIFIED
)

# Process each document
for doc in docs:
    print(f"ID: {doc.raw_id}")
    print(f"Valid CELEX: {doc.celex_id}")  # None if format is non-standard
    print(f"Cellar URL: {doc.cellar_url}")  # Always works for fetching

    # Fetch using the cellar URL (always works)
    html = get_html_by_cellar_url(doc.cellar_url)
    df = parse_html(html)

Date type options:

DateType.DOCUMENT (default) - Publication date
DateType.MODIFIED - Last modification date (finds amendments to old documents)
DateType.CREATED - Creation date in CELLAR

Understanding Document Identifiers

EUR-Lex uses several identifier formats. This package handles them all:

Format	Example	Description
CELEX ID	`32019R0947`	Standard format: `[sector][year][type][number]`
CELEX with suffix	`32012L0029R(06)`	CELEX + revision indicator
Cellar URL	`http://publications.europa.eu/resource/cellar/abc123`	Direct URL (always works)
Cellar ID	`cellar:abc-123-def` or `abc-123-def`	UUID-based identifier
OJ Reference	`C/2026/00064`	Official Journal reference (not a CELEX)

from eurlxp import detect_id_type, get_html, fetch_documents, parse_celex_id

# Detect identifier type
detect_id_type("32019R0947")  # Returns: "celex"
detect_id_type("http://publications.europa.eu/resource/cellar/abc")  # Returns: "cellar_url"
detect_id_type("C/2026/00064")  # Returns: "oj_reference"

# Parse CELEX ID into components
parse_celex_id("32019R0947")
# Returns: {'sector': '3', 'year': '2019', 'doc_type': 'R', 'number': '0947', 'suffix': None}

# Fetch a document by any identifier type (auto-detects)
html = get_html("32019R0947")  # CELEX
html = get_html("http://publications.europa.eu/resource/cellar/abc123")  # URL
html = get_html("C/2026/00064")  # OJ reference - uses SPARQL to find cellar URL

# Batch fetch documents with mixed identifier types
results = fetch_documents([
    "32019R0947",  # CELEX
    "http://publications.europa.eu/resource/cellar/abc123",  # URL
    "C/2026/00064",  # OJ reference - looked up via SPARQL
])

CELEX ID structure:

Sector (1 char): 3 = legislation, 5 = preparatory docs, 6 = case law, etc.
Year (4 digits): Publication year
Type (1-3 chars): R = regulation, L = directive, D = decision, etc.
Number (2-5 digits): Document number

See the official EUR-Lex documentation for complete details.

CLI Usage

# Fetch a document
eurlxp fetch 32019R0947 -o regulation.html

# Parse and convert to CSV
eurlxp fetch 32019R0947 -f csv -o regulation.csv

# Get document info
eurlxp info 32019R0947

# Convert slash notation to CELEX ID
eurlxp celex 2019/947
# Output: 32019R0947

API Reference

Functions

Function	Description
`get_html(identifier, language="en")`	Fetch HTML by any identifier (auto-detects type, uses SPARQL fallback)
`get_html_by_celex_id(celex_id, language="en")`	Fetch HTML by CELEX ID
`get_html_by_cellar_id(cellar_id, language="en")`	Fetch HTML by CELLAR ID
`get_html_by_cellar_url(cellar_url)`	Fetch HTML by cellar URL
`fetch_documents(identifiers, language="en", on_error="skip")`	Batch fetch documents (uses SPARQL fallback)
`detect_id_type(identifier)`	Detect identifier type
`lookup_cellar_url(identifier)`	Look up cellar URL for any identifier via SPARQL
`parse_html(html)`	Parse HTML to DataFrame
`get_celex_id(slash_notation, document_type="R", sector_id="3")`	Convert slash notation to CELEX ID
`get_possible_celex_ids(slash_notation)`	Get all possible CELEX IDs
`parse_celex_id(celex_id)`	Parse CELEX ID into components
`is_valid_celex_id(celex_id)`	Check if string is valid CELEX format
`get_ids_and_urls_via_date(from_date, to_date, date_type)`	Get document refs by date range

Classes

Class	Description
`EURLexClient`	Synchronous HTTP client with rate limiting and WAF detection
`AsyncEURLexClient`	Asynchronous HTTP client with rate limiting and WAF detection
`ClientConfig`	Configuration dataclass for client behavior
`WAFChallengeError`	Exception raised when bot detection is triggered

ClientConfig Options

Option	Type	Default	Description
`timeout`	float	30.0	Request timeout in seconds
`headers`	dict	None	Custom headers to merge with defaults
`request_delay`	float	0.0	Delay between requests (rate limiting)
`use_browser_headers`	bool	True	Use browser-like headers to avoid detection
`referer`	str	None	Optional referer header
`raise_on_waf`	bool	True	Raise exception on WAF challenge
`sparql_fallback`	bool	True	Auto-fallback to SPARQL when WAF blocks requests
`max_retries`	int	3	Max retry attempts for transient HTTP errors (500/502/503/504)
`retry_delay`	float	2.0	Initial delay between retries (seconds)
`retry_backoff`	float	2.0	Exponential backoff multiplier

DataFrame Columns

Column	Description
`text`	The text content
`type`	Content type (text, link, etc.)
`document`	Document title
`article`	Article number
`article_subtitle`	Article subtitle
`paragraph`	Paragraph number
`group`	Group heading
`section`	Section heading
`ref`	Reference path (e.g., `["(1)", "(a)"]`)

Development

# Clone the repository
git clone https://github.com/morrieinmaas/eurlxp.git
cd eurlxp

# Install with dev dependencies
just dev

# Run tests
just test-unit

# Run all checks (lint + type check)
just check

# Format code
just format

# Run live tests with real documents (all ID formats)
just test-live

# See all available commands
just --list

Publishing to PyPI

# Build the package
just build

# Publish to PyPI (requires PYPI_TOKEN)
just publish

License

MIT License - see LICENSE for details.

Credits

Inspired by kevin91nl/eurlex.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.0

Mar 16, 2026

0.5.0

Feb 13, 2026

0.4.1

Feb 10, 2026

0.4.0

Feb 4, 2026

0.3.3

Jan 14, 2026

0.3.2

Jan 14, 2026

0.3.1

Jan 14, 2026

0.3.0

Jan 14, 2026

0.2.5

Jan 8, 2026

0.2.4

Jan 8, 2026

0.2.3

Jan 8, 2026

0.2.2

Jan 8, 2026

0.2.1

Jan 8, 2026

0.2.0

Jan 8, 2026

0.1.0

Jan 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eurlxp-0.6.0.tar.gz (29.3 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

eurlxp-0.6.0-py3-none-any.whl (31.6 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file eurlxp-0.6.0.tar.gz.

File metadata

Download URL: eurlxp-0.6.0.tar.gz
Upload date: Mar 16, 2026
Size: 29.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for eurlxp-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`3d1355af11a43a834b74bc3368d922aa4f365172af79cf91f070783321fa910f`
MD5	`0143dc402bd0cf6d6c54ca55cd892e85`
BLAKE2b-256	`9f246b92b5d205ddb288cf56ddc2fb96fa200def8c61dc61ef6114d2ec1a32b0`

See more details on using hashes here.

File details

Details for the file eurlxp-0.6.0-py3-none-any.whl.

File metadata

Download URL: eurlxp-0.6.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 31.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for eurlxp-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`71caec63b41125ab35910a03b7d3889eb07e16c3ac7af84f4e85d27da4823730`
MD5	`6a75b5900266ababe02935ff0e96cf94`
BLAKE2b-256	`3f65eb8ad8312cd86def9625e1d2bc7bc35bcf757b08c1729adea2b5122715fa`

See more details on using hashes here.

eurlxp 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

eurlxp

Features

Installation

How It Works

Quick Start

Async Usage

Handling Bot Detection

Using SPARQL (Recommended for Bulk Data)

Fetching Documents by Date (Bulk Downloads)

Understanding Document Identifiers

CLI Usage

API Reference

Functions

Classes

ClientConfig Options

DataFrame Columns

Development

Publishing to PyPI

License

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes