Skip to main content

Open-justice RAG framework - jurisdiction-specific legal Q&A over public court decisions

Project description

Astraea

Open-justice RAG framework for building jurisdiction-specific legal Q&A tools over public court decisions.

Named after Astraea, the Greek goddess of justice who carried the scales.


What it is

A small runtime framework that provides the infrastructure for legal RAG tools - SSE streaming, concurrent request queue, statute routing, live legislation anchors, citation verification, security hardening, and smoke tests - so that a new jurisdiction only needs to provide one Python module.

from jurisdictions.nz_tenancy import jurisdiction
from core.api import create_app

app = create_app(jurisdiction)

Design principles

  • One process = one jurisdiction. No multi-tenancy, no plugin registry. Simple deployment.
  • Four required things. A jurisdiction must provide: a name, a corpus config, a system prompt, and a route table. Everything else has a working default.
  • Security and queue are non-overridable. Input sanitization, request body limits, security headers, and queue concurrency are enforced by core regardless of jurisdiction config.
  • Scraper is offline. Ingestion runs separately from the API. Core only needs a populated Qdrant collection conforming to schemas/qdrant_payload.schema.json.
  • Tests are data-driven. Jurisdictions provide smoke test fixtures; core runs the test suite against them automatically.

Supported jurisdictions

Jurisdiction Status Corpus
NZ Tenancy (nz_tenancy) Live - tenancy.localrun.ai 31,000+ Tenancy Tribunal decisions, RTA 1986 + Healthy Homes Standards 2019
NZ Legal (nz_legal) Live - nz-legal-rag.localrun.ai All NZ courts, 3M+ chunks (NZHC, NZCA, NZSC, NZERA, NZEmpC, NZTT)
NZ Employment (nz_employment) Ready 300+ ERA + Employment Court decisions through May 2026, live ERA 2000
NSW Tenancy (nsw_tenancy) PoC (framework demo) Proves interface generalises - not actively developed

Adding a new jurisdiction

See CONTRIBUTING.md for the full fork-to-running walkthrough.

Quick version:

  1. Copy examples/minimal_jurisdiction/ to jurisdictions/your_name/
  2. Implement the 4 required properties in jurisdiction.py
  3. Run the contract tests: pytest tests/core/test_jurisdiction_contract.py --jurisdiction your_name
  4. Ingest your corpus into Qdrant (see ingest/ and schemas/qdrant_payload.schema.json)
  5. Add smoke fixtures and run: pytest tests/jurisdictions/test_smoke.py --jurisdiction your_name -m retrieval

Jurisdiction extension points

Beyond the 4 required properties, jurisdictions can opt into additional behaviour:

Extra routes (register_routes)

Add jurisdiction-specific endpoints (e.g. structured data trackers) on top of the core API:

def register_routes(self, app: FastAPI) -> None:
    from jurisdictions.nz_legal.routes import register
    register(app)

Called at the end of create_app(). Route handlers access pipeline and store via request.app.state.

nz_legal uses this to expose /search, /notable, /sentencing-tracker, /pg-tracker, and /contrasting-cases.

Federated per-Act legislation retrieval (leg_sources)

By default, legislation retrieval does one vector search across the entire legislation collection. As a corpus grows (more Acts), smaller Acts get crowded out by larger ones on embedding similarity alone.

Override leg_sources to run one search per registered Act in parallel, each with its own top_k quota. The re-ranker phase (Phase 2) can then select the best sections across all sources without manual routes:

from core.jurisdiction import LegislationSource

@property
def leg_sources(self) -> list[LegislationSource]:
    return [
        LegislationSource("RTA",    "Residential Tenancies Act 1986",                         default_top_k=6, boost_top_k=10),
        LegislationSource("HHS2019","Residential Tenancies (Healthy Homes Standards) Regulations 2019", default_top_k=4, boost_top_k=8),
    ]

When a matched route targets a specific Act (e.g. healthy_homes route targets HHS2019), that Act's search uses boost_top_k instead of default_top_k, giving it more candidates before ranking.

Routes remain as hard floor guarantees - forced sections are always included in the candidate pool regardless of federated search results. This means a cross-encoder re-ranker (Phase 2) can reorder freely without risking that a critical section is dropped.

A CrossEncoderReranker (Phase 1: log-only) is available in core/reranker.py. It scores candidates after federated search and logs the scores for observability without affecting ranking. Promote to production ranking after benchmarking shows it matches route-based quality.

Case retrieval augmentation (case_synthetic_query on StatuteRoute)

When a matched route defines case_synthetic_query, a supplementary case retrieval pass runs with that query and unique results are merged into context (up to 8 total chunks).

Fixes cases where the query rewriter drops legally significant framing that is obvious from the original question but lost in rewriting:

StatuteRoute(
    intent="sham_flatmate_agreement",
    include_any=("flatmate agreement", "meant to be tenants", ...),
    forced_sections=("NZLEG/RTA/s5",),
    synthetic_query="...",
    case_synthetic_query=(
        "flatmate agreement landlord not living property sham tenancy RTA applies "
        "boarder licensee residential tenancy act tenant rights eviction notice"
    ),
)

Smoke fixture source count (min_sources on SmokeFixture)

Assert that supplementary retrieval ran and returned the expected number of case sources:

SmokeFixture(
    question="My landlord put us on a flatmate agreement...",
    expected_sections=[],
    min_sources=6,
    description="sham_flatmate_agreement route - case_synthetic_query augmentation",
)

Qdrant payload schema

All jurisdictions must produce chunks conforming to schemas/qdrant_payload.schema.json.

Required fields: document_id, court, court_name, title, date, url, text, source_type.


Stack

Component Technology
Vector database Qdrant
Embeddings nomic-embed-text-v1.5 / Qwen3-Embedding-0.6B via sentence-transformers
LLM inference llama.cpp (OpenAI-compatible)
API FastAPI + SSE streaming
Cache Redis (web verify results)
Queue Semaphore-based, per-IP fairness

Milestones

  • Milestone 0 - core interface design, runtime modules, nz_tenancy jurisdiction
  • Milestone 1 - nsw_tenancy skeleton + nz_legal + nz_employment prove interface generalises
  • Milestone 2 - smoke test runner wired to pytest (Tier 1/2/3), Docker Compose
  • Milestone 3 - CONTRIBUTING.md, packaging, NSW NCAT scraper + corpus (225+ decisions)
  • Milestone 4 - nz_legal migration: tracker endpoints, contrasting cases, register_routes hook
  • Milestone 5 - federated per-Act legislation retrieval, Healthy Homes Standards 2019 corpus, cross-encoder reranker (Phase 1 log-only), Qdrant payload indexes for fast filtered search

Related project

The NZ tenancy tool running on this framework: https://tenancy.localrun.ai

Source: https://github.com/jwongso/nz-legal-rag


MIT License. Not legal advice.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astraea_framework-0.2.0.tar.gz (65.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

astraea_framework-0.2.0-py3-none-any.whl (80.0 kB view details)

Uploaded Python 3

File details

Details for the file astraea_framework-0.2.0.tar.gz.

File metadata

  • Download URL: astraea_framework-0.2.0.tar.gz
  • Upload date:
  • Size: 65.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for astraea_framework-0.2.0.tar.gz
Algorithm Hash digest
SHA256 59098a82e53d5c304676380ae9ea0da0bc2fa1383ba35dd8e643add6e803f1c3
MD5 bb6fea90fc5429fbd6d7821aaba17f5f
BLAKE2b-256 e755c8d988dad038d0713a8b4178ce2ab787c92ea142809eb5786f7520d45ef9

See more details on using hashes here.

File details

Details for the file astraea_framework-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for astraea_framework-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee97995d494ab0e3a99a01b4c4c75d32d3043365c082851fa3c0676e1bf047f8
MD5 d4b5748ea1e8a3329c19480a0d2cca83
BLAKE2b-256 a3c4c94700aff14566fcb8c0b996575435456fcf194f0b970ff98e692a424192

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page