Skip to main content

Remove copyrighted material from legal case law PDFs

Project description

Blackletter

Remove copyrighted material from legal case law PDFs.

A reference to blackletter law, this tool removes proprietary annotations from judicial opinions—specifically, headnotes, captions, key cites, and other copyrighted materials—while preserving the authentic opinion text.

Installation

pip install blackletter

Or install from source:

git clone https://github.com/freelawproject/blackletter
cd blackletter
pip install -e .

Quick Start

Command line:

blackletter process path/to/volume.pdf --reporter f3d --volume 952 --first-page 1 --output output/

Python:

from blackletter import process

process(
    "path/to/volume.pdf",
    "output/",
    reporter="f3d",
    volume="952",
    first_page=1,
)

This runs the full pipeline: OCR (if needed), YOLO detection, page number extraction, opinion splitting, and redaction — all in one pass.

How It Works

The process command runs a single-pass pipeline:

  1. OCR (if needed): Detects image-only PDFs, downsamples pages, and adds a text layer via ocrmypdf/tesseract
  2. Detection: Runs a YOLO model to identify copyrighted elements (headnotes, captions, key cites, brackets, etc.) and structural elements (page numbers, dividers, footnotes)
  3. Page Numbers: Extracts and validates page numbers using OCR on detected regions
  4. Opinion Pairing: Matches case captions to key icons to identify opinion boundaries
  5. Splitting & Redaction: Produces three output variants per opinion:
    • Unredacted: Raw opinion pages extracted from the source
    • Redacted: Copyrighted content (headnotes, brackets, key icons) blacked out; non-opinion content whited out
    • Masked: Optimized for LLM ingestion — only the opinion text is visible

Additionally produces:

  • A full redacted copy of the entire document
  • A verification report with detection stats and page number mappings
  • Extracted case law images (charts, photos, etc.) as PNGs

Command Line Options

blackletter process PDF [OPTIONS]

Positional Arguments:
  pdf                       Path to the source PDF

Options:
  --reporter STR            Reporter abbreviation (e.g. f3d, a3d)
  --volume STR              Volume number
  --first-page INT          Page number of the first page in the PDF (default: 1)
  -o, --output PATH         Base output directory (required)
  --model PATH              Path to YOLO model weights (default: bundled run_9.pt)
  --footnotes               Extract footnotes into separate PDFs
  --no-unredacted           Skip generating unredacted opinion PDFs
  --no-shrink               Skip downsampling (default: shrink to ~148 KB/page)
  --optimize {0,1,2,3}      ocrmypdf optimization level (default: 1)

Draw Command

For debugging YOLO detections:

blackletter draw path/to/volume.pdf --output annotated.pdf
blackletter draw path/to/volume.pdf --output annotated.pdf --labels CASE_CAPTION KEY_ICON HEADNOTE

Output Structure

output/<reporter>/<volume>/<first-page>/
    verify.txt                          # Detection stats and page number report
    <reporter>.<volume>.redacted.pdf    # Full redacted document
    images/                             # Extracted images (PNGs)
    unredacted/                         # Individual opinion PDFs (raw)
    redacted/                           # Individual opinion PDFs (copyrighted content redacted)
    masked/                             # Individual opinion PDFs (for LLM ingestion)

Detection Labels

The YOLO model detects 13 element types:

Label Description
KEY_ICON West key cite icons (copyrighted)
DIVIDER Opinion section dividers
PAGE_HEADER Running headers (copyrighted)
CASE_CAPTION Opinion title/parties
FOOTNOTES Footnote sections
HEADNOTE_BRACKET Bracketed headnote markers (copyrighted)
CASE_METADATA Court, date, counsel info
CASE_SEQUENCE Docket/case sequence numbers
PAGE_NUMBER Page numbers
STATE_ABBREVIATION State abbreviation markers
IMAGE Photos, charts, diagrams
HEADNOTE Headnote text (copyrighted)
BACKGROUND Background/procedural history

Requirements

  • Python 3.12+
  • Tesseract OCR (for image-only PDFs)

Install tesseract:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt install tesseract-ocr

License

GNU Affero General Public License v3

Contributing

Contributions welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blackletter-0.0.1.tar.gz (5.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blackletter-0.0.1-py3-none-any.whl (5.7 MB view details)

Uploaded Python 3

File details

Details for the file blackletter-0.0.1.tar.gz.

File metadata

  • Download URL: blackletter-0.0.1.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for blackletter-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9bab090c1bb70d870a073218bea3ee2e8fc673dbb66b22c89e0ba5b632285e16
MD5 adb1a7074405b2898635a16090c551fc
BLAKE2b-256 97edf7ad025369ad40c0f72d4e95399797cf18a9f1f2cce196cbcae37ed89197

See more details on using hashes here.

Provenance

The following attestation bundles were made for blackletter-0.0.1.tar.gz:

Publisher: pypi.yml on freelawproject/blackletter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file blackletter-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: blackletter-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for blackletter-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a1b81d5472d1f0b7024d7a20271dd793a28dfc697fd3835738b4aa8a7929bcc0
MD5 b0813086a70f93b2adf00b174ffa4c50
BLAKE2b-256 87eec483591e43c3d86d97cb3f031e4bf92d4e55c498a3345b22d35b52704d2a

See more details on using hashes here.

Provenance

The following attestation bundles were made for blackletter-0.0.1-py3-none-any.whl:

Publisher: pypi.yml on freelawproject/blackletter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page