Library for extracting CELLAR case law data from EUR-Lex
Project description
Cellar Extractor
A Python library for extracting CELLAR case law data from EUR-Lex.
This library contains functions to get CELLAR case law data from the EUR-Lex SPARQL endpoint and enrich additional information from InfoCuria and CELLAR item sources.
Version
Python 3.9+
Tests
- CI: the badge above tracks the default supported test workflow
- Coverage: the badge above tracks the default local test suite coverage snapshot
Contributors
Pranav Bapat |
Piotr Lewandowski |
shashankmc |
gijsvd |
venvis |
davidwickerhf |
How to install?
pip install cellar-extractor
What The Project Does
cellar-extractor builds enriched EUR-Lex / CELLAR case-law datasets.
It starts from CELLAR metadata and then enriches:
- citation edges
- summaries and keywords
- full text
- sector-specific metadata
- graph-ready node/edge projections
The extractor is currently centered on:
- sector 6 case law: CJEU-style material via InfoCuria
- sector 8 case law: mixed / national-case-law material via CELLAR RDF + item downloads
The main workflow has two stages.
get_cellar(...)- fetches the base CELLAR corpus
- returns CSV-like dataframe output or JSON-like dictionary output
get_cellar_extra(...)- enriches that corpus with citations, full text, summaries, keywords, provenance, and missing-data flags
The citation graph is now extracted through the public CELLAR SPARQL endpoint. Legacy EUR-Lex SOAP webservice support is kept only for validation tests and is not part of the production path anymore.
Data Sources By Type
| Need | Source |
|---|---|
| Base corpus metadata | CELLAR SPARQL |
Citation edges (citing, cited_by) |
CELLAR SPARQL |
| Sector 6 full text and structured metadata | InfoCuria |
| Sector 8 full text and summaries | CELLAR RDF + downloadable item manifestations |
| Legacy citation comparison only | EUR-Lex SOAP webservice |
Quick Start
1. Fetch Base CELLAR Metadata
import cellar_extractor as cell
df = cell.get_cellar(
save=False,
file_format="csv",
sd="2025-01-01",
ed="2025-01-31T23:59:59",
max_ecli=100,
)
Returns a dataframe with base metadata such as CELEX, ECLI, type, dates, and subject-matter-related fields.
You can also save explicitly to a custom path instead of the default data/ location:
cell.get_cellar(
save=True,
file_format="csv",
sd="2025-01-01",
ed="2025-01-31T23:59:59",
output_path="exports/cellar_january.csv",
)
2. Fetch The Enriched Dataset
import cellar_extractor as cell
extra_df, fulltext = cell.get_cellar_extra(
save=False,
sd="2025-01-01",
ed="2025-01-31T23:59:59",
max_ecli=100,
threads=4,
)
Returns:
extra_df: enriched dataframefulltext: list of JSON rows containing extracted text and provenance
You can independently control where the enriched CSV and fulltext JSON are written:
cell.get_cellar_extra(
save=True,
sd="2025-01-01",
ed="2025-01-31T23:59:59",
metadata_output_path="exports/cellar_extra.csv",
fulltext_output_path="exports/cellar_fulltext.json",
threads=4,
)
3. Build A Citation Graph
import cellar_extractor as cell
nodes, edges = cell.get_nodes_and_edges_lists(extra_df, only_local=True)
only_local=True keeps only edges whose target CELEX is also present in extra_df.
4. Filter By Subject Matter
filtered = cell.filter_subject_matter(extra_df, "competition")
Full-Scrape Strategy
If you want the largest reproducible scrape, do not run one enormous date range blindly. Use bounded windows and persist each window.
Recommended approach:
- choose a date window by
sd/ed - run
get_cellar(...)orget_cellar_extra(...) - save outputs to disk
- repeat for the next window
- concatenate downstream
Practical guidance:
- use month-sized or week-sized windows for stability
- keep
threadsmoderate, typically4to10 - use
save=Truefor long runs - keep the fulltext JSON files; they are the canonical extracted text output
Example file-based run:
import cellar_extractor as cell
cell.get_cellar_extra(
save=True,
sd="2025-01-01",
ed="2025-01-31T23:59:59",
max_ecli=5000,
threads=6,
)
By default this writes into data/:
- a CSV with the enriched tabular dataset
- a
_fulltext.jsonfile with the text rows
Main Outputs
get_cellar_extra(...) produces:
- an enriched dataframe / CSV
- a fulltext JSON list / file
Important Enriched DataFrame Columns
citingcited_bycelex_summarycelex_keywordscelex_directory_codescelex_eurovocadvocate_generaljudge_rapporteuraffecting_idsaffecting_stringscitations_extra_infofulltext_sourcesummary_sourcemissing_reasons
Important Fulltext JSON Fields
celexeclitexttext_sourcetext_languagetext_formatmissing_reasons
Completeness Rules
The extractor does not treat empty values as silent success.
Important cases:
- if citation data exists, it should populate
citing/cited_by - if a document has no citation edges, the columns still exist and are empty
- if full text or summary is not available upstream,
missing_reasonsshould reflect that
Typical missing_reasons values:
FULLTEXT_UNAVAILABLE_UPSTREAMSUMMARY_UNAVAILABLE_UPSTREAMUNAVAILABLE_UPSTREAM
Sector 8 is still best effort because upstream availability is uneven, but the extractor now flags absence explicitly instead of implying completeness.
Public API Reference
Root-Level Package API
Imported from cellar_extractor/__init__.py:
| Function / class | Purpose |
|---|---|
get_cellar(...) |
Fetch base CELLAR metadata |
get_cellar_extra(...) |
Fetch enriched metadata + full text |
get_nodes_and_edges_lists(df, only_local=False) |
Build citation graph lists |
filter_subject_matter(df, phrase) |
Filter dataframe by subject phrase |
FetchOperativePart |
Extract operative part from a single case document |
Writing |
Write operative-part outputs to CSV / JSON / TXT |
Core Modules
cellar_extractor/cellar.py
get_cellar(ed=None, save_file=<deprecated>, max_ecli=100, sd="2022-05-01", file_format="csv", output_dir="data", output_path=None, return_data=None, save=None)get_cellar_extra(ed=None, save_file=<deprecated>, max_ecli=100, sd="2022-05-01", threads=10, username="", password="", output_dir="data", metadata_output_path=None, fulltext_output_path=None, save_metadata=None, save_fulltext=None, return_data=None, save=None)get_nodes_and_edges_lists(df=None, only_local=False)filter_subject_matter(df=None, phrase=None)
Notes:
username/passwordare legacy compatibility parameters and no longer change the extraction pathsaveis the preferred save toggle;save_fileis kept as a deprecated compatibility aliasoutput_path,metadata_output_path, andfulltext_output_pathlet callers choose exact output locations instead of relying on fixed folders- when save flags are disabled, the package returns in-memory objects without writing files
cellar_extractor/citations_adder.py
add_citations_separate(data, threads): production citation enrichmentadd_citations_separate_webservice(data, username, password): deprecated legacy comparison pathadd_citations(data, threads): older citation replacement helper
cellar_extractor/fulltext_saving.py
add_sections(data, threads, output_path=None, json_filepath=None, fulltext_output_path=None): enriches summaries, keywords, text metadata, provenance, and missing-data flags
cellar_extractor/eurlex_scraping.py
Main higher-level adapter functions:
get_case_data_by_celex_id(celex, language="EN")get_html_text_by_celex_id(id)get_summary_html(celex)get_full_text_from_html(html_text)
This module contains the sector-aware source logic for InfoCuria and CELLAR item retrieval.
cellar_extractor/sparql.py
get_citations(source_celex, cites_depth=1, cited_depth=1, max_retries=3)get_citations_csv(celex, max_retries=3)get_citing(celex, cites_depth, max_retries=3)get_cited(celex, cited_depth, max_retries=3)run_eurlex_webservice_query(query_input, username, password)for legacy SOAP validation only
cellar_extractor/cellar_sparql_queries.py
Advanced query helper class:
CellarSparqlQueryget_endorsements()get_subjects()get_parties()get_keywords()get_citations()get_grounds()
cellar_extractor/operative_extractions.py
Classes:
FetchOperativePartWriting
Use this path when you want operative-part extraction for individual documents rather than the full dataset pipeline.
Upstream Endpoints Used
These are the upstream systems the extractor relies on.
| Endpoint family | Used for |
|---|---|
CELLAR SPARQL https://publications.europa.eu/webapi/rdf/sparql |
corpus discovery, metadata, citation edges |
InfoCuria https://infocuriaws.curia.europa.eu/... |
sector 6 text and metadata |
InfoCuria https://infocuria.curia.europa.eu/document/... |
sector 6 document HTML |
CELLAR resource/item URLs under https://publications.europa.eu/resource/cellar/... |
sector 8 downloadable text / summary manifestations |
EUR-Lex SOAP https://eur-lex.europa.eu/EURLexWebService?wsdl |
legacy redundancy tests only |
Testing
Fast Local Suite
pytest -q
Live Integration Flags
RUN_INFOCURIA_INTEGRATION=1RUN_SECTOR8_INTEGRATION=1RUN_CITATION_INTEGRATION=1
Examples:
RUN_INFOCURIA_INTEGRATION=1 pytest -q tests/test_infocuria_integration.py
RUN_SECTOR8_INTEGRATION=1 pytest -q tests/test_sector8_integration.py
RUN_CITATION_INTEGRATION=1 pytest -q tests/test_citation_graph_integration.py
Legacy Webservice Tests
Only needed if you want to re-check SOAP redundancy:
RUN_WEBSERVICE_INTEGRATION=1 pytest -q tests/test_webservice_credentials_integration.py tests/test_webservice_redundancy_integration.py
If used, credentials are read from .env:
EURLEX_WEBSERVICE_USERNAME=
EURLEX_WEBSERVICE_PASSWORD=
These credentials are not required for normal extraction.
Troubleshooting
missing_reasons is populated
That means the extractor could not find the requested upstream content. This is expected when upstream does not expose a summary or full text for the document.
Citation columns are empty
Check:
- that the document actually has graph relations upstream
- the live SPARQL endpoint availability
- whether you are looking at a very small or isolated sample
Sector 8 feels sparse
That is usually an upstream availability issue, not a silent extractor failure. Sector 8 is intentionally handled as best effort with explicit flags.
Releasing
This project uses setuptools_scm for automatic versioning based on git tags. Follow these steps to release a new version:
1. Create a git tag
git tag v<major>.<minor>.<patch>
For example:
git tag v1.2.3
2. Push the tag to remote
git push origin v<major>.<minor>.<patch>
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cellar_extractor-1.3.0.tar.gz.
File metadata
- Download URL: cellar_extractor-1.3.0.tar.gz
- Upload date:
- Size: 82.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1036653ca6b53d6aa73c01e1a7375c7c915e0d482af5900c88cfbe51fcec4689
|
|
| MD5 |
c36d5880c4d0683f0f9e3b28d07f7b8b
|
|
| BLAKE2b-256 |
88e40b08a3919c373c7899761f04b79f2ecfa6947b6323df09fd3d7a38cd49c1
|
Provenance
The following attestation bundles were made for cellar_extractor-1.3.0.tar.gz:
Publisher:
github-actions.yml on maastrichtlawtech/cellar-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cellar_extractor-1.3.0.tar.gz -
Subject digest:
1036653ca6b53d6aa73c01e1a7375c7c915e0d482af5900c88cfbe51fcec4689 - Sigstore transparency entry: 1114974660
- Sigstore integration time:
-
Permalink:
maastrichtlawtech/cellar-extractor@39196d2d1d7dd0b449d1eb93e7b86e78c1a97859 -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/maastrichtlawtech
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
github-actions.yml@39196d2d1d7dd0b449d1eb93e7b86e78c1a97859 -
Trigger Event:
push
-
Statement type:
File details
Details for the file cellar_extractor-1.3.0-py3-none-any.whl.
File metadata
- Download URL: cellar_extractor-1.3.0-py3-none-any.whl
- Upload date:
- Size: 46.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98dfd9b8b2095097652e07371aed9e586e55323bed67b0f61d93199f50d48272
|
|
| MD5 |
35d190a44af208676913b42f50b40ab2
|
|
| BLAKE2b-256 |
2c290917f9066ab0b51dc0a7267ec0fc02cb93283d1cceb17006b83639da027e
|
Provenance
The following attestation bundles were made for cellar_extractor-1.3.0-py3-none-any.whl:
Publisher:
github-actions.yml on maastrichtlawtech/cellar-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cellar_extractor-1.3.0-py3-none-any.whl -
Subject digest:
98dfd9b8b2095097652e07371aed9e586e55323bed67b0f61d93199f50d48272 - Sigstore transparency entry: 1114974666
- Sigstore integration time:
-
Permalink:
maastrichtlawtech/cellar-extractor@39196d2d1d7dd0b449d1eb93e7b86e78c1a97859 -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/maastrichtlawtech
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
github-actions.yml@39196d2d1d7dd0b449d1eb93e7b86e78c1a97859 -
Trigger Event:
push
-
Statement type: