Skip to main content

Scrapers and jurimetric coding engine for 83,596 comparative water law judicial decisions across Brazil, Canada, and the Netherlands (2016-2026)

Project description

Water Law Judicial Decisions Dataset

DOI Harvard Dataverse DANS OSF PyPI version License: MIT Dashboard

A collection of scrapers for building a comparative dataset of water law judicial decisions across Brazil (27 state courts), Canada (federal + provincial courts via CanLII), and the Netherlands (Raad van State + all 11 district courts via Rechtspraak.nl).

Scope: 2016–2026 | Cases collected: 83,596 decisions across Brazil, Canada, and Netherlands

📊 Interactive Dashboard · 📄 Preliminary Research PDF

The Legal Last Mile — Preliminary research examining administrative law, water access, and the limits of judicial inclusion across Brazil, Netherlands, and Canada. The Global Water Law Dataset is its empirical backbone.


Repository Structure

water-law-dataset/
├── scrapers/
│   ├── brazil/            # One scraper per accessible TJ court
│   │   ├── tjac_scraper.py   (TJAC – Acre,        ESAJ POST)
│   │   ├── tjdft_scraper.py  (TJDFT – Brasília,   Elasticsearch REST)
│   │   ├── tjpi_scraper.py   (TJPI – Piauí,       Rails GET)
│   │   ├── tjrj_scraper.py   (TJRJ – Rio de Janeiro, ASP.NET WebForms)
│   │   ├── tjrr_scraper.py   (TJRR – Roraima,     ESAJ POST)
│   │   ├── tjsc_scraper.py   (TJSC – Santa Catarina, ESAJ AJAX)
│   │   └── tjto_scraper.py   (TJTO – Tocantins,   PHP+Solr GET)
│   ├── canada/
│   │   ├── canlii_scraper.py          (CanLII REST API — requires free API key)
│   │   ├── canada_canlii_extra.py     (113 extra CanLII databases not in main scraper)
│   │   └── canada_ldh_scraper.py      (Legal Data Hunter semantic search — requires API key)
│   └── netherlands/
│       ├── rechtspraak_scraper.py     (Rechtspraak Open Data — no auth)
│       └── rechtspraak_expanded.py   (RvS/CBb/GHARL/HR extended crawl)
├── utils/
│   ├── merge_national.py           # Merges per-court JSON files into national CSV/XLSX
│   ├── make_progress_charts.py
│   ├── jurimetric_coding.py        # Regex-based coding engine (21 categories, 4 languages)
│   ├── build_report.py             # Generate comparative DOCX report + 6 charts from coded CSV
│   └── integrate_dissertation.py   # Integrate dataset findings into a DOCX preliminary research
├── data/                      # Output directory (gitignored — add your JSON/CSV here)
├── .env.example
├── requirements.txt
└── README.md

Quick Start

1. Clone and configure

git clone https://github.com/YOUR_USERNAME/water-law-dataset.git
cd water-law-dataset
cp .env.example .env
# Edit .env and set OUTPUT_DIR and any API keys you need

2. Run a scraper

All scrapers use only the Python standard library (Python 3.8+). No pip install needed for scraping.

# Set output directory (or edit .env)
export OUTPUT_DIR=./data        # Linux/Mac
set OUTPUT_DIR=.\data           # Windows

# Run any scraper
python scrapers/brazil/tjsc_scraper.py
python scrapers/brazil/tjdft_scraper.py
python scrapers/netherlands/rechtspraak_scraper.py

Output is written to $OUTPUT_DIR/<court>_cases_2016_2026.json.

3. CanLII (Canada) — requires API key

# Register free at https://developer.canlii.org/
export CANLII_API_KEY=your_key_here
python scrapers/canada/canlii_scraper.py       # main CanLII keyword search
python scrapers/canada/canada_canlii_extra.py  # 113 additional CanLII databases

4. Legal Data Hunter (Canada, semantic search) — requires API key

The canada_ldh_scraper.py uses Legal Data Hunter to run semantic and keyword searches across the full CanLII corpus (94,502+ Canadian legal documents). This supplements the title-based CanLII API by surfacing cases where water law is the substance of the decision, not just the title.

export LDH_API_KEY=your_key_here
python scrapers/canada/canada_ldh_scraper.py

4. Merge into national dataset

pip install pandas openpyxl          # only needed for merge + charts
export DATA_DIR=./data
python utils/merge_national.py
python utils/make_progress_charts.py

Brazilian Courts — Access Status

UF Court Cases Method Status
SP TJSP 574 ESAJ POST ✅ Done
SC TJSC 1,224 ESAJ AJAX ✅ Done
RR TJRR 21 ESAJ POST ✅ Done
AC TJAC 33 ESAJ POST ✅ Done
PI TJPI 15 Rails GET ✅ Done
TO TJTO 17 PHP+Solr GET ✅ Done
DF TJDFT 8,421 Elasticsearch REST ✅ Done
RJ TJRJ 1,219 ASP.NET WebForms ✅ Done
SP TJSP (1997–2015) 200 Hand-coded historical ✅ Done
MG TJMG DWR + CAPTCHA ❌ Blocked
BA TJBA GraphQL (server 500) ❌ Blocked
PR TJPR Full-text too broad (334K results) ❌ Blocked
CE TJCE ESAJ TLS error ❌ Blocked
SE TJSE JSF + Turnstile CAPTCHA ❌ Blocked
ES TJES JSF + Turnstile CAPTCHA ❌ Blocked
AM TJAM CAS SSO required ❌ Blocked
GO TJGO React SPA, no public API ❌ Blocked
RO TJRO Angular SPA, no public API ❌ Blocked
MT TJMT SPA, requires JS ❌ Blocked
MS TJMS ESAJ timeout ❌ Blocked
RS TJRS DNS/timeout ❌ Blocked
PB TJPB Cloudflare 520 ❌ Blocked
AP TJAP HTTP 403 ❌ Blocked
RN TJRN HTTP 403 ❌ Blocked
PE TJPE Timeout/DNS ❌ Blocked
PA TJPA Timeout/DNS ❌ Blocked
AL TJAL Timeout/DNS ❌ Blocked
MA TJMA No jurisprudência endpoint ❌ Blocked

Total collected: 11,724 Brazilian cases — 11,524 from 8 courts (TJSP + TJSC + TJDFT + TJRJ + TJRR + TJAC + TJPI + TJTO) + 200 TJSP historical 1997–2015

Netherlands total: 68,654 cases — 50,871 appellate (RvS + CBb + GHARL, via rechtspraak_expanded.py) + 17,783 district courts (all 11 Rechtbanken, via rechtspraak_scraper.py)

Canada total: 3,218 cases — CanLII keyword search + CanLII extra databases + Legal Data Hunter semantic search + superior/appellate courts

Grand total: 83,596 decisions


Search Queries Used

Primary: água abastecimento fornecimento saneamento
Secondary: corte suspensão fornecimento água
Tertiary: proteção manancial recursos hídricos ambiental


Output JSON Schema

Each case record contains:

{
  "tribunal": "TJSC",
  "estado": "SC",
  "num_processo": "0001234-56.2022.8.24.0001",
  "data_julgamento": "2022-03-15",
  "ano": 2022,
  "classe": "Apelação Cível",
  "camara_orgao": "1ª Câmara de Direito Público",
  "relator": "Des. João Silva",
  "ementa": "DIREITO À ÁGUA. Fornecimento. ...",
  "url": "https://..."
}

Legal Note

These scrapers query publicly accessible jurisprudência portals. All decisions are public court records. This dataset is intended for academic comparative law research.


Acknowledgements

Scholarly Inspiration

I would like to acknowledge that this research is inspired by the work of Professor LaDawn Haglund in the fields of comparative water law, water governance, and the judicialization of water and sanitation. Her scholarship, including Water Governance and Social Justice in São Paulo, Brazil and her broader work on human rights and urban water systems, has been central in shaping how I understand the relationship between law, policy, and access to essential resources.

I had the privilege of working with her as a research assistant some years ago, and that experience played an important role in directing my interest toward this area. I am especially grateful for her early encouragement to study water law in Brazil and to approach these challenges from a comparative and global perspective.

Her research and publications remain essential references for this project. I strongly recommend her work to anyone interested in water law, governance, and the role of legal institutions in addressing complex social and environmental challenges.

Selected publications:


Tools and Platforms

Legal Data Hunter — semantic legal search across 18M+ decisions from 110+ countries. An outstanding tool for comparative legal research that goes well beyond keyword matching. The Canadian component of this dataset was significantly enriched through LDH's semantic search over the full CanLII corpus. If you're building a legal dataset or doing cross-jurisdictional research, Legal Data Hunter is genuinely worth checking out.

CanLII — the Canadian Legal Information Institute, whose free public API made systematic collection of Canadian case law possible.

Rechtspraak.nl — the Dutch courts' open data portal, which provides structured XML access to published decisions.

If you use it in your own work, please cite:

Klaus, C. (2026). Global Water Law Judicial Decisions Dataset (v1.0). Zenodo. https://doi.org/10.5281/zenodo.19836413

Also archived at:


AI Disclosure

This dataset was built with the assistance of Claude (Anthropic), an AI assistant. Specifically, Claude assisted with:

  • Scraper development — writing and debugging the web scrapers for Brazilian state courts (TJDFT, TJRJ, and others), CanLII (Canada), and Rechtspraak.nl (Netherlands), including handling authentication challenges, pagination logic, and anti-scraping measures
  • Data pipeline — developing the merge, deduplication, and normalization scripts (merge_national.py, merge_all_countries.py)
  • Jurimetric coding engine — designing and implementing the regex-based coding engine (jurimetric_coding.py) for all seven variables across four languages (Portuguese, English, Dutch, French)
  • Repository and deposit workflows — automating deposits to Zenodo, Harvard Dataverse, DANS, and OSF via their respective APIs
  • Quality control — identifying and filing issues with upstream legal data sources (worldwidelaw/legal-sources issues #74–77)

All research design decisions, methodological choices, variable definitions, and intellectual interpretations are those of the human researcher (Claudio Klaus Junior). Claude served as a technical research infrastructure tool throughout the data collection phase.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

water_law_dataset-0.2.1.tar.gz (87.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

water_law_dataset-0.2.1-py3-none-any.whl (105.6 kB view details)

Uploaded Python 3

File details

Details for the file water_law_dataset-0.2.1.tar.gz.

File metadata

  • Download URL: water_law_dataset-0.2.1.tar.gz
  • Upload date:
  • Size: 87.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for water_law_dataset-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b56d37099cbea8419ac0d78a31f797b3821723804ef855da616ca0a0d0812dab
MD5 1a646f48d5de44a1428770fd57e27d54
BLAKE2b-256 8cb72ed87b9d86a20c8c87fbb8b603822f85345c76786327461b2f8ae2a8375c

See more details on using hashes here.

File details

Details for the file water_law_dataset-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for water_law_dataset-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3f60b4186d5ec3d6c94b0f97b40f8b011048220eb49a84768b55c11fdcda9555
MD5 1691972ff5840cad99cce44ac9c9e71d
BLAKE2b-256 2d7bd0a82c78e57a1c928f6675053ecb942584ac56e9403c8d76f4ecff62ac1b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page