Remove copyrighted material from legal case law PDFs
Project description
Blackletter
Remove copyrighted material from legal case law PDFs.
A reference to blackletter law, this tool removes proprietary annotations from judicial opinions—specifically, headnotes, captions, key cites, and other copyrighted materials—while preserving the authentic opinion text.
Installation
pip install blackletter
Or install from source:
git clone https://github.com/freelawproject/blackletter
cd blackletter
pip install -e .
Quick Start
Command line:
blackletter process path/to/volume.pdf --reporter f3d --volume 952 --first-page 1 --output output/
Python:
from blackletter import process
process(
"path/to/volume.pdf",
"output/",
reporter="f3d",
volume="952",
first_page=1,
)
This runs the full pipeline: OCR (if needed), YOLO detection, page number extraction, opinion splitting, and redaction — all in one pass.
How It Works
The process command runs a single-pass pipeline:
- OCR (if needed): Detects image-only PDFs, downsamples pages, and adds a text layer via ocrmypdf/tesseract
- Detection: Runs a YOLO model to identify copyrighted elements (headnotes, captions, key cites, brackets, etc.) and structural elements (page numbers, dividers, footnotes)
- Page Numbers: Extracts and validates page numbers using OCR on detected regions
- Opinion Pairing: Matches case captions to key icons to identify opinion boundaries
- Splitting & Redaction: Produces three output variants per opinion:
- Unredacted: Raw opinion pages extracted from the source
- Redacted: Copyrighted content (headnotes, brackets, key icons) blacked out; non-opinion content whited out
- Masked: Optimized for LLM ingestion — only the opinion text is visible
Additionally produces:
- A full redacted copy of the entire document
- A verification report with detection stats and page number mappings
- Extracted case law images (charts, photos, etc.) as PNGs
Command Line Options
blackletter process PDF [OPTIONS]
Positional Arguments:
pdf Path to the source PDF
Options:
--reporter STR Reporter abbreviation (e.g. f3d, a3d)
--volume STR Volume number
--first-page INT Page number of the first page in the PDF (default: 1)
-o, --output PATH Base output directory (required)
--model PATH Path to YOLO model weights (default: bundled run_9.pt)
--footnotes Extract footnotes into separate PDFs
--no-unredacted Skip generating unredacted opinion PDFs
--no-shrink Skip downsampling (default: shrink to ~148 KB/page)
--optimize {0,1,2,3} ocrmypdf optimization level (default: 1)
Draw Command
For debugging YOLO detections:
blackletter draw path/to/volume.pdf --output annotated.pdf
blackletter draw path/to/volume.pdf --output annotated.pdf --labels CASE_CAPTION KEY_ICON HEADNOTE
Output Structure
output/<reporter>/<volume>/<first-page>/
verify.txt # Detection stats and page number report
<reporter>.<volume>.redacted.pdf # Full redacted document
images/ # Extracted images (PNGs)
unredacted/ # Individual opinion PDFs (raw)
redacted/ # Individual opinion PDFs (copyrighted content redacted)
masked/ # Individual opinion PDFs (for LLM ingestion)
Detection Labels
The YOLO model detects 13 element types:
| Label | Description |
|---|---|
| KEY_ICON | West key cite icons (copyrighted) |
| DIVIDER | Opinion section dividers |
| PAGE_HEADER | Running headers (copyrighted) |
| CASE_CAPTION | Opinion title/parties |
| FOOTNOTES | Footnote sections |
| HEADNOTE_BRACKET | Bracketed headnote markers (copyrighted) |
| CASE_METADATA | Court, date, counsel info |
| CASE_SEQUENCE | Docket/case sequence numbers |
| PAGE_NUMBER | Page numbers |
| STATE_ABBREVIATION | State abbreviation markers |
| IMAGE | Photos, charts, diagrams |
| HEADNOTE | Headnote text (copyrighted) |
| BACKGROUND | Background/procedural history |
Requirements
- Python 3.12+
- Tesseract OCR (for image-only PDFs)
Install tesseract:
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt install tesseract-ocr
License
GNU Affero General Public License v3
Contributing
Contributions welcome!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file blackletter-0.0.1.tar.gz.
File metadata
- Download URL: blackletter-0.0.1.tar.gz
- Upload date:
- Size: 5.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bab090c1bb70d870a073218bea3ee2e8fc673dbb66b22c89e0ba5b632285e16
|
|
| MD5 |
adb1a7074405b2898635a16090c551fc
|
|
| BLAKE2b-256 |
97edf7ad025369ad40c0f72d4e95399797cf18a9f1f2cce196cbcae37ed89197
|
Provenance
The following attestation bundles were made for blackletter-0.0.1.tar.gz:
Publisher:
pypi.yml on freelawproject/blackletter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
blackletter-0.0.1.tar.gz -
Subject digest:
9bab090c1bb70d870a073218bea3ee2e8fc673dbb66b22c89e0ba5b632285e16 - Sigstore transparency entry: 1005495215
- Sigstore integration time:
-
Permalink:
freelawproject/blackletter@46ca93014ae7816907b6795232bbfcb5b7abbbcc -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/freelawproject
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@46ca93014ae7816907b6795232bbfcb5b7abbbcc -
Trigger Event:
push
-
Statement type:
File details
Details for the file blackletter-0.0.1-py3-none-any.whl.
File metadata
- Download URL: blackletter-0.0.1-py3-none-any.whl
- Upload date:
- Size: 5.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1b81d5472d1f0b7024d7a20271dd793a28dfc697fd3835738b4aa8a7929bcc0
|
|
| MD5 |
b0813086a70f93b2adf00b174ffa4c50
|
|
| BLAKE2b-256 |
87eec483591e43c3d86d97cb3f031e4bf92d4e55c498a3345b22d35b52704d2a
|
Provenance
The following attestation bundles were made for blackletter-0.0.1-py3-none-any.whl:
Publisher:
pypi.yml on freelawproject/blackletter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
blackletter-0.0.1-py3-none-any.whl -
Subject digest:
a1b81d5472d1f0b7024d7a20271dd793a28dfc697fd3835738b4aa8a7929bcc0 - Sigstore transparency entry: 1005495226
- Sigstore integration time:
-
Permalink:
freelawproject/blackletter@46ca93014ae7816907b6795232bbfcb5b7abbbcc -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/freelawproject
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@46ca93014ae7816907b6795232bbfcb5b7abbbcc -
Trigger Event:
push
-
Statement type: