Remove copyrighted material from legal case law PDFs
Project description
Blackletter
A reference to blackletter law, this tool removes potentially copyrighted material from legal case law PDFs. This fulfills our goal of respecting any intellectual property rights others may have while making it possible to digitize and publish case law. This is essential to our mission of making case law accessible to all — a prerequisite for meaningful participation in our democracy.
Proprietary annotations removed from judicial opinions include headnotes, captions, and key cites.
Installation
pip install blackletter
Or install from source:
git clone https://github.com/freelawproject/blackletter
cd blackletter
pip install -e .
Quick Start
Command line:
blackletter process path/to/volume.pdf --reporter f3d --volume 952 --first-page 1 --output output/
Python:
from blackletter import process
process(
"path/to/volume.pdf",
"output/",
reporter="f3d",
volume="952",
first_page=1,
)
This runs the full pipeline: OCR (if needed), YOLO detection, page number extraction, opinion splitting, and redaction — all in one pass.
How It Works
The process command runs a single-pass pipeline:
- OCR (if needed): Detects image-only PDFs, downsamples pages, and adds a text layer via ocrmypdf/tesseract
- Detection: Runs a YOLO model to identify proprietary elements (headnotes, captions, key cites, brackets, etc.) and structural elements (page numbers, dividers, footnotes)
- Page Numbers: Extracts and validates page numbers using OCR on detected regions
- Opinion Pairing: Matches case captions to key icons to identify opinion boundaries
- Splitting & Redaction: Produces three output variants per opinion:
- Unredacted: Raw opinion pages extracted from the source
- Redacted: Potentially copyrighted content (headnotes, brackets, key icons) blacked out; non-opinion content whited out
- Masked: Optimized for LLM ingestion — only the opinion text is visible
Additionally produces:
- A full redacted copy of the entire document
- Extracted case law images (charts, photos, etc.) as PNGs
- A
detections.jsonexport of all YOLO detections for review tooling
Models
Blackletter bundles three YOLO models, selected via CLI flags:
| Flag | File | Classes | Description |
|---|---|---|---|
| (default) | small.pt |
14 | Bundled — fast, handles most cases |
--medium |
medium.pt |
17 | Bundled — better structural detection |
--large |
large.pt |
21 | Downloaded on first use from Hugging Face — highest accuracy, detects additional elements (editorial, judges, docket, court, citation, date) |
The large model is hosted at flooie/blackletter-large and is downloaded automatically to blackletter/models/large.pt the first time --large is used.
Command Line Options
Process Command
blackletter process PDF [OPTIONS]
Positional Arguments:
pdf Path to the source PDF
Options:
--reporter STR Reporter abbreviation (e.g. f3d, a3d)
--volume STR Volume number
--first-page INT Page number of the first page in the PDF (default: 1)
-o, --output PATH Base output directory (required)
--model PATH Path to custom YOLO model weights
--medium Use the medium model (17 classes)
--large Use the large model (21 classes, auto-downloaded)
--footnotes Extract footnotes into separate PDFs
--unredacted Also generate unredacted opinion PDFs
--no-shrink Skip downsampling (default: shrink to ~148 KB/page)
--optimize {0,1,2,3} ocrmypdf optimization level (default: 1)
--bitonal Convert to 1-bit B&W before processing (for already-bitonal scans)
--detect-only Stop after detection and pairing — no PDFs written (Phase 1 only)
Validate Command
QA tool that checks a PDF's page number sequence for missing, duplicate, or misnumbered pages. Uses YOLO to locate page number regions, then PaddleOCR to read them, with Tesseract and GLM-OCR as fallbacks.
blackletter validate path/to/volume.pdf
blackletter validate path/to/volume.pdf --first-page 100 --last-page 500
blackletter validate path/to/volume.pdf --json
If the filename follows the convention reporter.volume.first.last.pdf (e.g. sct.143.1.888.pdf), the expected page range is inferred automatically.
Features:
- Parallel OCR across multiple workers
- Auto-correction of consistent OCR misreadings (e.g. systematic off-by-800 errors)
- Detection of gaps, duplicates, backwards jumps, and page ranges (e.g. "31-32")
- Structural checks for blank pages and orientation changes
Requires optional dependencies: pip install blackletter[analyze]
Draw Command
Visualize YOLO detections on a PDF — useful for debugging model output:
blackletter draw path/to/volume.pdf --output annotated.pdf
blackletter draw path/to/volume.pdf --output annotated.pdf --labels CASE_CAPTION KEY_ICON HEADNOTE
blackletter draw path/to/volume.pdf --output annotated.pdf --large
Output Structure
output/<reporter>/<volume>/<first-page>/
<reporter>.<volume>.<first>.<last>.pdf # OCR'd/processed source PDF
<reporter>.<volume>.redacted.pdf # Full redacted document
detections.json # All YOLO detections (label, bbox, confidence per page)
pages_meta.json # Column bounds and midpoints per page
opinions.json # Opinion pairs with outside-opinion rects
redaction_rects.json # Precomputed redaction rectangles (used by review UI)
margin_rects.json # Margin cleanup rectangles
images/ # Extracted case law images (PNGs)
unredacted/ # Individual opinion PDFs (raw, no redaction)
redacted/ # Individual opinion PDFs (copyrighted content redacted)
masked/ # Individual opinion PDFs (for LLM ingestion)
The JSON files are designed for use with a review UI — they allow manual inspection and adjustment of detections and redaction boundaries before final output is committed.
Detection Labels
Labels detected across all models (availability depends on model size):
| Label | Models | Description |
|---|---|---|
| KEY_ICON | all | West key cite icons |
| DIVIDER | all | Opinion section dividers |
| PAGE_HEADER | all | Running headers |
| CASE_CAPTION | all | Opinion title/parties |
| FOOTNOTES | all | Footnote sections |
| HEADNOTE_BRACKET | all | Bracketed headnote markers |
| CASE_METADATA | all | Court, date, counsel info |
| CASE_SEQUENCE | all | Docket/case sequence numbers |
| PAGE_NUMBER | all | Page numbers |
| STATE_ABBREVIATION | all | State abbreviation markers |
| IMAGE | all | Photos, charts, diagrams |
| HEADNOTE | all | Headnote text |
| BACKGROUND | all | Background/procedural history region |
| SYLLABUS | all | Supreme Court syllabus sections |
| EDITORIAL | medium, large | Editorial notes |
| JUDGES | medium, large | Judge name blocks |
| TEXT_COLUMN | medium, large | Column boundaries |
| DOCKET | large | Docket number regions |
| DATE | large | Decision date regions |
| COURT | large | Court name regions |
| CITATION | large | Reporter citation regions |
Margin Cleanup
After redaction, Blackletter automatically white-outs scan artifacts in page margins using the PDF text layer to find content boundaries. Pages with narrow text spans (appendices, image pages) are skipped automatically.
Requirements
- Python 3.12+
- Tesseract OCR (for image-only PDFs)
Install tesseract:
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt install tesseract-ocr
License
GNU Affero General Public License v3
Contributing
Contributions welcome!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file blackletter-0.0.5.tar.gz.
File metadata
- Download URL: blackletter-0.0.5.tar.gz
- Upload date:
- Size: 68.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a97d75a502a0491c2167e88ba6d19473de07e55e741a30dd6a5ff363f0304a11
|
|
| MD5 |
e4464a7822d0e6528774039b12fd8dc8
|
|
| BLAKE2b-256 |
1414e83ca4a078f7f74121c7149e8ee90960cee49624df931979a8a87e6fbc2e
|
Provenance
The following attestation bundles were made for blackletter-0.0.5.tar.gz:
Publisher:
pypi.yml on freelawproject/blackletter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
blackletter-0.0.5.tar.gz -
Subject digest:
a97d75a502a0491c2167e88ba6d19473de07e55e741a30dd6a5ff363f0304a11 - Sigstore transparency entry: 1206361161
- Sigstore integration time:
-
Permalink:
freelawproject/blackletter@06f2175574e66d073bc3de8c9dbcf72a15843453 -
Branch / Tag:
refs/tags/v0.0.5 - Owner: https://github.com/freelawproject
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@06f2175574e66d073bc3de8c9dbcf72a15843453 -
Trigger Event:
push
-
Statement type:
File details
Details for the file blackletter-0.0.5-py3-none-any.whl.
File metadata
- Download URL: blackletter-0.0.5-py3-none-any.whl
- Upload date:
- Size: 68.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51cd197e8f5a739af61b014c9e82ba21c73f5afde3ee86b1650fc007b6cfaf2a
|
|
| MD5 |
fac5d05e516f5f8aafc9b25527af42e4
|
|
| BLAKE2b-256 |
d060d04abf85fc3085fff36f7201d9a729a7a6df60ee0084f2e2b7d4d4ecd98e
|
Provenance
The following attestation bundles were made for blackletter-0.0.5-py3-none-any.whl:
Publisher:
pypi.yml on freelawproject/blackletter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
blackletter-0.0.5-py3-none-any.whl -
Subject digest:
51cd197e8f5a739af61b014c9e82ba21c73f5afde3ee86b1650fc007b6cfaf2a - Sigstore transparency entry: 1206361197
- Sigstore integration time:
-
Permalink:
freelawproject/blackletter@06f2175574e66d073bc3de8c9dbcf72a15843453 -
Branch / Tag:
refs/tags/v0.0.5 - Owner: https://github.com/freelawproject
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@06f2175574e66d073bc3de8c9dbcf72a15843453 -
Trigger Event:
push
-
Statement type: