Skip to main content

Remove copyrighted material from legal case law PDFs

Project description

Blackletter

A reference to blackletter law, this tool removes potentially copyrighted material from legal case law PDFs. This fulfills our goal of respecting any intellectual property rights others may have while making it possible to digitize and publish case law. This is essential to our mission of making case law accessible to all — a prerequisite for meaningful participation in our democracy.

Proprietary annotations removed from judicial opinions include headnotes, captions, and key cites.

Installation

pip install blackletter

Or install from source:

git clone https://github.com/freelawproject/blackletter
cd blackletter
pip install -e .

Quick Start

Command line:

blackletter process path/to/volume.pdf --reporter f3d --volume 952 --first-page 1 --output output/

Python:

from blackletter import process

process(
    "path/to/volume.pdf",
    "output/",
    reporter="f3d",
    volume="952",
    first_page=1,
)

This runs the full pipeline: OCR (if needed), YOLO detection, page number extraction, opinion splitting, and redaction — all in one pass.

How It Works

The process command runs a single-pass pipeline:

  1. OCR (if needed): Detects image-only PDFs, downsamples pages, and adds a text layer via ocrmypdf/tesseract
  2. Detection: Runs a YOLO model to identify proprietary elements (headnotes, captions, key cites, brackets, etc.) and structural elements (page numbers, dividers, footnotes)
  3. Page Numbers: Extracts and validates page numbers using OCR on detected regions
  4. Opinion Pairing: Matches case captions to key icons to identify opinion boundaries
  5. Splitting & Redaction: Produces per-opinion variants and an optional per-page LLM split:
    • Unredacted (opt-in, via --unredacted): Raw opinion pages extracted from the source
    • Redacted: Per-opinion PDFs with potentially copyrighted content (headnotes, brackets, key icons) blacked out
    • LLM (opt-in, via --llm): One PDF per source page, sliced from the fully redacted document, with an invisible <--CASEEND--> marker stamped on every redacted Key-icon location so downstream LLM passes can detect opinion boundaries

Additionally produces:

  • A full redacted copy of the entire document
  • Extracted case law images (charts, photos, etc.) as PNGs
  • A detections.json export of all YOLO detections for review tooling

Models

Blackletter bundles three YOLO models, selected via CLI flags:

Flag File Classes Description
(default) small.pt 14 Bundled — fast, handles most cases
--medium medium.pt 17 Bundled — better structural detection
--large large.pt 21 Downloaded on first use from Hugging Face — highest accuracy, detects additional elements (editorial, judges, docket, court, citation, date)

The large model is hosted at flooie/blackletter-large and is downloaded automatically to blackletter/models/large.pt the first time --large is used.

Command Line Options

Process Command

blackletter process PDF [OPTIONS]

Positional Arguments:
  pdf                       Path to the source PDF

Options:
  --reporter STR            Reporter abbreviation (e.g. f3d, a3d)
  --volume STR              Volume number
  --first-page INT          Page number of the first page in the PDF (default: 1)
  -o, --output PATH         Base output directory (required)
  --model PATH              Path to custom YOLO model weights
  --medium                  Use the medium model (17 classes)
  --large                   Use the large model (21 classes, auto-downloaded)
  --footnotes               Extract footnotes into separate PDFs
  --unredacted              Also generate unredacted opinion PDFs
  --llm                     Also generate per-page LLM PDFs with <--CASEEND--> stamps
  --no-shrink               Skip downsampling (default: shrink to ~148 KB/page)
  --optimize {0,1,2,3}      ocrmypdf optimization level (default: 1)
  --bitonal                 Convert to 1-bit B&W before processing (for already-bitonal scans)
  --detect-only             Stop after detection and pairing — no PDFs written (Phase 1 only)

Validate Command

QA tool that checks a PDF's page number sequence for missing, duplicate, or misnumbered pages. Uses YOLO to locate page number regions, then PaddleOCR to read them, with Tesseract and GLM-OCR as fallbacks.

blackletter validate path/to/volume.pdf
blackletter validate path/to/volume.pdf --first-page 100 --last-page 500
blackletter validate path/to/volume.pdf --json

If the filename follows the convention reporter.volume.first.last.pdf (e.g. sct.143.1.888.pdf), the expected page range is inferred automatically.

Features:

  • Parallel OCR across multiple workers
  • Auto-correction of consistent OCR misreadings (e.g. systematic off-by-800 errors)
  • Detection of gaps, duplicates, backwards jumps, and page ranges (e.g. "31-32")
  • Structural checks for blank pages and orientation changes

Requires optional dependencies: pip install blackletter[analyze]

Draw Command

Visualize YOLO detections on a PDF — useful for debugging model output:

blackletter draw path/to/volume.pdf --output annotated.pdf
blackletter draw path/to/volume.pdf --output annotated.pdf --labels CASE_CAPTION KEY_ICON HEADNOTE
blackletter draw path/to/volume.pdf --output annotated.pdf --large

Output Structure

output/<reporter>/<volume>/<first-page>/
    <reporter>.<volume>.<first>.<last>.pdf   # OCR'd/processed source PDF
    <reporter>.<volume>.redacted.pdf         # Full redacted document

    detections.json       # All YOLO detections (label, bbox, confidence per page)
    pages_meta.json       # Column bounds and midpoints per page
    opinions.json         # Opinion pairs with outside-opinion rects
    redaction_rects.json  # Precomputed redaction rectangles (used by review UI)
    margin_rects.json     # Margin cleanup rectangles

    images/               # Extracted case law images (PNGs)
    unredacted/           # Individual opinion PDFs (raw, no redaction) — only if --unredacted
    redacted/             # Individual opinion PDFs (copyrighted content redacted)
    llm/                  # Per-page fully-redacted PDFs with invisible <--CASEEND-->
                          # stamps on each Key-icon location — only if --llm

The JSON files are designed for use with a review UI — they allow manual inspection and adjustment of detections and redaction boundaries before final output is committed.

Detection Labels

Labels detected across all models (availability depends on model size):

Label Models Description
KEY_ICON all West key cite icons
DIVIDER all Opinion section dividers
PAGE_HEADER all Running headers
CASE_CAPTION all Opinion title/parties
FOOTNOTES all Footnote sections
HEADNOTE_BRACKET all Bracketed headnote markers
CASE_METADATA all Court, date, counsel info
CASE_SEQUENCE all Docket/case sequence numbers
PAGE_NUMBER all Page numbers
STATE_ABBREVIATION all State abbreviation markers
IMAGE all Photos, charts, diagrams
HEADNOTE all Headnote text
BACKGROUND all Background/procedural history region
SYLLABUS all Supreme Court syllabus sections
EDITORIAL medium, large Editorial notes
JUDGES medium, large Judge name blocks
TEXT_COLUMN medium, large Column boundaries
DOCKET large Docket number regions
DATE large Decision date regions
COURT large Court name regions
CITATION large Reporter citation regions

Margin Cleanup

After redaction, Blackletter automatically white-outs scan artifacts in page margins using the PDF text layer to find content boundaries. Pages with narrow text spans (appendices, image pages) are skipped automatically.

Requirements

  • Python 3.12+
  • Tesseract OCR (for image-only PDFs)

Install tesseract:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt install tesseract-ocr

License

GNU Affero General Public License v3

Contributing

Contributions welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blackletter-0.0.10.tar.gz (68.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blackletter-0.0.10-py3-none-any.whl (68.9 MB view details)

Uploaded Python 3

File details

Details for the file blackletter-0.0.10.tar.gz.

File metadata

  • Download URL: blackletter-0.0.10.tar.gz
  • Upload date:
  • Size: 68.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blackletter-0.0.10.tar.gz
Algorithm Hash digest
SHA256 def2c5a087b400fd7f3a193eeef9a3cd29e37057a7a4cf4c4fcc6af373d7ecc3
MD5 c09f7f7ae20c2a121bb778e4d888a1a2
BLAKE2b-256 fb1a95d4d78943d76932f1a82176792a16209e49da637b8563936972cc8c1457

See more details on using hashes here.

Provenance

The following attestation bundles were made for blackletter-0.0.10.tar.gz:

Publisher: pypi.yml on freelawproject/blackletter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file blackletter-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: blackletter-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 68.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blackletter-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 944570ae5f8d8ab1002186399242e9b1db54476d78dc7008cc3c440ec0091bcc
MD5 9813480b293d507dab1791280c8b2858
BLAKE2b-256 9af33aa5c5445d1a6d5d9eab5dea35e3dc87a47768ea51c4ee3a609a9d9dbf2b

See more details on using hashes here.

Provenance

The following attestation bundles were made for blackletter-0.0.10-py3-none-any.whl:

Publisher: pypi.yml on freelawproject/blackletter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page