Remove copyrighted material from legal case law PDFs

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Blackletter

A reference to blackletter law, this tool removes potentially copyrighted material from legal case law PDFs. This fulfills our goal of respecting any intellectual property rights others may have while making it possible to digitize and publish case law. This is essential to our mission of making case law accessible to all — a prerequisite for meaningful participation in our democracy.

Proprietary annotations removed from judicial opinions include headnotes, captions, and key cites.

Installation

pip install blackletter

Or install from source:

git clone https://github.com/freelawproject/blackletter
cd blackletter
pip install -e .

Quick Start

Command line:

blackletter process path/to/volume.pdf --reporter f3d --volume 952 --first-page 1 --output output/

Python:

from blackletter import process

process(
    "path/to/volume.pdf",
    "output/",
    reporter="f3d",
    volume="952",
    first_page=1,
)

This runs the full pipeline: OCR (if needed), YOLO detection, page number extraction, opinion splitting, and redaction — all in one pass.

How It Works

The process command runs a single-pass pipeline:

OCR (if needed): Detects image-only PDFs, downsamples pages, and adds a text layer via ocrmypdf/tesseract
Detection: Runs a YOLO model to identify proprietary elements (headnotes, captions, key cites, brackets, etc.) and structural elements (page numbers, dividers, footnotes)
Page Numbers: Extracts and validates page numbers using OCR on detected regions
Opinion Pairing: Matches case captions to key icons to identify opinion boundaries
Splitting & Redaction: Produces three output variants per opinion:
- Unredacted: Raw opinion pages extracted from the source
- Redacted: Potentially copyrighted content (headnotes, brackets, key icons) blacked out; non-opinion content whited out
- Masked: Optimized for LLM ingestion — only the opinion text is visible

Additionally produces:

A full redacted copy of the entire document
Extracted case law images (charts, photos, etc.) as PNGs
A detections.json export of all YOLO detections for review tooling

Models

Blackletter bundles three YOLO models, selected via CLI flags:

Flag	File	Classes	Description
(default)	`small.pt`	14	Bundled — fast, handles most cases
`--medium`	`medium.pt`	17	Bundled — better structural detection
`--large`	`large.pt`	21	Downloaded on first use from Hugging Face — highest accuracy, detects additional elements (editorial, judges, docket, court, citation, date)

The large model is hosted at flooie/blackletter-large and is downloaded automatically to blackletter/models/large.pt the first time --large is used.

Command Line Options

Process Command

blackletter process PDF [OPTIONS]

Positional Arguments:
  pdf                       Path to the source PDF

Options:
  --reporter STR            Reporter abbreviation (e.g. f3d, a3d)
  --volume STR              Volume number
  --first-page INT          Page number of the first page in the PDF (default: 1)
  -o, --output PATH         Base output directory (required)
  --model PATH              Path to custom YOLO model weights
  --medium                  Use the medium model (17 classes)
  --large                   Use the large model (21 classes, auto-downloaded)
  --footnotes               Extract footnotes into separate PDFs
  --unredacted              Also generate unredacted opinion PDFs
  --no-shrink               Skip downsampling (default: shrink to ~148 KB/page)
  --optimize {0,1,2,3}      ocrmypdf optimization level (default: 1)
  --bitonal                 Convert to 1-bit B&W before processing (for already-bitonal scans)
  --detect-only             Stop after detection and pairing — no PDFs written (Phase 1 only)

Validate Command

QA tool that checks a PDF's page number sequence for missing, duplicate, or misnumbered pages. Uses YOLO to locate page number regions, then PaddleOCR to read them, with Tesseract and GLM-OCR as fallbacks.

blackletter validate path/to/volume.pdf
blackletter validate path/to/volume.pdf --first-page 100 --last-page 500
blackletter validate path/to/volume.pdf --json

If the filename follows the convention reporter.volume.first.last.pdf (e.g. sct.143.1.888.pdf), the expected page range is inferred automatically.

Features:

Parallel OCR across multiple workers
Auto-correction of consistent OCR misreadings (e.g. systematic off-by-800 errors)
Detection of gaps, duplicates, backwards jumps, and page ranges (e.g. "31-32")
Structural checks for blank pages and orientation changes

Requires optional dependencies: pip install blackletter[analyze]

Draw Command

Visualize YOLO detections on a PDF — useful for debugging model output:

blackletter draw path/to/volume.pdf --output annotated.pdf
blackletter draw path/to/volume.pdf --output annotated.pdf --labels CASE_CAPTION KEY_ICON HEADNOTE
blackletter draw path/to/volume.pdf --output annotated.pdf --large

Output Structure

output/<reporter>/<volume>/<first-page>/
    <reporter>.<volume>.<first>.<last>.pdf   # OCR'd/processed source PDF
    <reporter>.<volume>.redacted.pdf         # Full redacted document

    detections.json       # All YOLO detections (label, bbox, confidence per page)
    pages_meta.json       # Column bounds and midpoints per page
    opinions.json         # Opinion pairs with outside-opinion rects
    redaction_rects.json  # Precomputed redaction rectangles (used by review UI)
    margin_rects.json     # Margin cleanup rectangles

    images/               # Extracted case law images (PNGs)
    unredacted/           # Individual opinion PDFs (raw, no redaction)
    redacted/             # Individual opinion PDFs (copyrighted content redacted)
    masked/               # Individual opinion PDFs (for LLM ingestion)

The JSON files are designed for use with a review UI — they allow manual inspection and adjustment of detections and redaction boundaries before final output is committed.

Detection Labels

Labels detected across all models (availability depends on model size):

Label	Models	Description
KEY_ICON	all	West key cite icons
DIVIDER	all	Opinion section dividers
PAGE_HEADER	all	Running headers
CASE_CAPTION	all	Opinion title/parties
FOOTNOTES	all	Footnote sections
HEADNOTE_BRACKET	all	Bracketed headnote markers
CASE_METADATA	all	Court, date, counsel info
CASE_SEQUENCE	all	Docket/case sequence numbers
PAGE_NUMBER	all	Page numbers
STATE_ABBREVIATION	all	State abbreviation markers
IMAGE	all	Photos, charts, diagrams
HEADNOTE	all	Headnote text
BACKGROUND	all	Background/procedural history region
SYLLABUS	all	Supreme Court syllabus sections
EDITORIAL	medium, large	Editorial notes
JUDGES	medium, large	Judge name blocks
TEXT_COLUMN	medium, large	Column boundaries
DOCKET	large	Docket number regions
DATE	large	Decision date regions
COURT	large	Court name regions
CITATION	large	Reporter citation regions

Margin Cleanup

After redaction, Blackletter automatically white-outs scan artifacts in page margins using the PDF text layer to find content boundaries. Pages with narrow text spans (appendices, image pages) are skipped automatically.

Requirements

Python 3.12+
Tesseract OCR (for image-only PDFs)

Install tesseract:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt install tesseract-ocr

License

GNU Affero General Public License v3

Contributing

Contributions welcome!

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

floooie

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.8

Apr 1, 2026

0.0.7

Apr 1, 2026

0.0.6

Apr 1, 2026

This version

0.0.5

Apr 1, 2026

0.0.4

Mar 31, 2026

0.0.3

Mar 20, 2026

0.0.2

Mar 20, 2026

0.0.1

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blackletter-0.0.5.tar.gz (68.9 MB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

blackletter-0.0.5-py3-none-any.whl (68.8 MB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file blackletter-0.0.5.tar.gz.

File metadata

Download URL: blackletter-0.0.5.tar.gz
Upload date: Apr 1, 2026
Size: 68.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for blackletter-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`a97d75a502a0491c2167e88ba6d19473de07e55e741a30dd6a5ff363f0304a11`
MD5	`e4464a7822d0e6528774039b12fd8dc8`
BLAKE2b-256	`1414e83ca4a078f7f74121c7149e8ee90960cee49624df931979a8a87e6fbc2e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for blackletter-0.0.5.tar.gz:

Publisher: pypi.yml on freelawproject/blackletter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: blackletter-0.0.5.tar.gz
- Subject digest: a97d75a502a0491c2167e88ba6d19473de07e55e741a30dd6a5ff363f0304a11
- Sigstore transparency entry: 1206361161
- Sigstore integration time: Apr 1, 2026
Source repository:
- Permalink: freelawproject/blackletter@06f2175574e66d073bc3de8c9dbcf72a15843453
- Branch / Tag: refs/tags/v0.0.5
- Owner: https://github.com/freelawproject
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@06f2175574e66d073bc3de8c9dbcf72a15843453
- Trigger Event: push

File details

Details for the file blackletter-0.0.5-py3-none-any.whl.

File metadata

Download URL: blackletter-0.0.5-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 68.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for blackletter-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51cd197e8f5a739af61b014c9e82ba21c73f5afde3ee86b1650fc007b6cfaf2a`
MD5	`fac5d05e516f5f8aafc9b25527af42e4`
BLAKE2b-256	`d060d04abf85fc3085fff36f7201d9a729a7a6df60ee0084f2e2b7d4d4ecd98e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for blackletter-0.0.5-py3-none-any.whl:

Publisher: pypi.yml on freelawproject/blackletter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: blackletter-0.0.5-py3-none-any.whl
- Subject digest: 51cd197e8f5a739af61b014c9e82ba21c73f5afde3ee86b1650fc007b6cfaf2a
- Sigstore transparency entry: 1206361197
- Sigstore integration time: Apr 1, 2026
Source repository:
- Permalink: freelawproject/blackletter@06f2175574e66d073bc3de8c9dbcf72a15843453
- Branch / Tag: refs/tags/v0.0.5
- Owner: https://github.com/freelawproject
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@06f2175574e66d073bc3de8c9dbcf72a15843453
- Trigger Event: push

blackletter 0.0.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Blackletter

Installation

Quick Start

How It Works

Models

Command Line Options

Process Command

Validate Command

Draw Command

Output Structure

Detection Labels

Margin Cleanup

Requirements

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance