PDF, DOCX, and HTML text extraction and normalization for academic papers
Project description
docpluck
PDF, DOCX, and HTML text extraction and normalization for academic papers.
Built from cross-project experience across 8,000+ PDFs spanning psychology, medicine, economics, physics, and biology. Achieves 100% accuracy on 29 manually verified ground-truth passages (see BENCHMARKS.md).
Supports three input formats:
- PDF via
pdftotextdefault mode (withpdfplumberSMP recovery) - DOCX via
mammoth(DOCX → HTML → text, preserving Shift+Enter soft breaks) - HTML via
beautifulsoup4+lxml(block/inline-aware tree-walk)
All three formats feed into the same 15-step normalization pipeline and quality scoring.
Install
# PDF only (pdfplumber)
pip install docpluck
# + DOCX support (adds mammoth)
pip install docpluck[docx]
# + HTML support (adds beautifulsoup4 + lxml)
pip install docpluck[html]
# Everything
pip install docpluck[all]
System requirement for extract_pdf(): poppler-utils (provides the pdftotext binary). DOCX and HTML are pure Python — no system dependencies.
# Linux / WSL
apt-get install poppler-utils
# macOS
brew install poppler
# Windows
# Download from https://github.com/oschwartz10612/poppler-windows/releases
# Add bin/ to PATH
Install from GitHub (like R's remotes::install_github()):
pip install git+https://github.com/giladfeldman/docpluck.git
# Pinned version
pip install "docpluck>=1.3.0"
Quick Start
from docpluck import (
extract_pdf, extract_docx, extract_html,
normalize_text, NormalizationLevel, compute_quality_score,
)
# 1. Extract text from any supported format
with open("paper.pdf", "rb") as f:
text, method = extract_pdf(f.read())
# Or from DOCX:
# with open("paper.docx", "rb") as f:
# text, method = extract_docx(f.read())
# Or from HTML:
# with open("paper.html", "rb") as f:
# text, method = extract_html(f.read())
print(f"Extracted {len(text):,} chars via {method}")
# 2. Normalize for statistical pattern matching
normalized, report = normalize_text(text, NormalizationLevel.academic)
print(f"Steps applied: {report.steps_applied}")
print(f"Changes made: {report.changes_made}")
# 3. Check quality
quality = compute_quality_score(normalized)
print(f"Quality: {quality['score']}/100 ({quality['confidence']})")
if quality["garbled"]:
print("Warning: text may be corrupted (column merge or encoding failure)")
Structured extraction (v2.0)
For consumers that need tables and figures as structured data — meta-analysis tooling, statistical-claim extraction, dashboards — call extract_pdf_structured():
from docpluck import extract_pdf_structured
with open("paper.pdf", "rb") as f:
result = extract_pdf_structured(f.read())
print(f"{result['page_count']} pages")
print(f"{len(result['tables'])} tables, {len(result['figures'])} figures")
for t in result["tables"]:
print(f" {t['label']} on page {t['page']} ({t['kind']}, confidence={t['confidence']})")
if t["kind"] == "structured":
print(f" {t['n_rows']} rows × {t['n_cols']} cols")
Modes
# Default: caption-anchored fast path.
extract_pdf_structured(pdf_bytes)
# Thorough: scan every page for uncaptioned tables (slower).
extract_pdf_structured(pdf_bytes, thorough=True)
# Strip table/figure regions from `text` and replace with [Label: caption] markers.
extract_pdf_structured(pdf_bytes, table_text_mode="placeholder")
CLI
docpluck extract paper.pdf --structured > out.json
docpluck extract paper.pdf --structured --thorough --text-mode placeholder
docpluck extract paper.pdf --structured --html-tables-to ./out/
extract_pdf() (the v1 text-only path) is unchanged. New consumers opt in to the structured path; existing consumers see no behavioral change.
See docs/superpowers/specs/2026-05-06-table-extraction-design.md for the full schema and design rationale.
API Reference
extract_pdf(pdf_bytes: bytes) → tuple[str, str]
Extract text from PDF bytes.
Parameters:
pdf_bytes— Raw PDF file content asbytes
Returns: (text, method) tuple where:
text— Extracted plain text. Checktext.startswith("ERROR:")for failure.method— Engine used:"pdftotext_default"— standard extraction (fast, ~400ms)"pdftotext_default+pdfplumber_recovery"— SMP fallback triggered (~9s), used when pdftotext outputsU+FFFDreplacement characters (common in Nature/Cell papers using Mathematical Italic fonts)
Requires: pdftotext binary on PATH.
with open("paper.pdf", "rb") as f:
text, method = extract_pdf(f.read())
if text.startswith("ERROR:"):
raise RuntimeError(f"Extraction failed: {text}")
extract_docx(docx_bytes: bytes) → tuple[str, str]
Extract text from DOCX (Word) file bytes via mammoth.
Parameters:
docx_bytes— Raw DOCX file content asbytes
Returns: (text, method) tuple where method is always "mammoth".
How it works: DOCX is converted to HTML first (preserving Shift+Enter soft breaks as <br> tags), then passed through the same block/inline-aware tree-walk used by extract_html(). This preserves paragraph structure, headings, lists, and soft breaks — which mammoth.extract_raw_text() would lose.
Requires: pip install docpluck[docx] (adds mammoth>=1.8.0).
Known limitations:
- OMML equations (Office Math) are silently dropped. Inline stats written as plain text survive; stats inside equation objects do not.
- Tracked changes: only deleted paragraphs are handled minimally.
- Memory: peak usage is ~3–5× file size.
from docpluck import extract_docx
with open("paper.docx", "rb") as f:
text, method = extract_docx(f.read())
extract_html(html_bytes: bytes) → tuple[str, str]
Extract text from HTML file bytes via beautifulsoup4 + lxml.
Parameters:
html_bytes— Raw HTML file content asbytes(UTF-8 decoded with error replacement)
Returns: (text, method) tuple where method is always "beautifulsoup".
How it works: Custom tree-walk that distinguishes block from inline elements:
- Block elements (
<p>,<div>,<h1>–<h6>,<li>,<td>, etc.) get newlines before and after. - Inline elements (
<a>,<span>,<em>, etc.) get spaces before and after — critical for preventing merged words like"ChanORCID"when adjacent inline elements have no whitespace between them. - Ignored tags (
<script>,<style>,<meta>,<svg>,<iframe>, etc.) are decomposed before walking.
Why not BeautifulSoup.get_text(): get_text() cannot distinguish block from inline elements — it applies a uniform separator everywhere, which either merges paragraphs or inserts spurious whitespace. The BeautifulSoup maintainer has confirmed this will not be fixed. A custom tree-walk is required.
Requires: pip install docpluck[html] (adds beautifulsoup4>=4.12.0 and lxml>=5.0.0).
from docpluck import extract_html, html_to_text
# From bytes
with open("article.html", "rb") as f:
text, method = extract_html(f.read())
# From an already-decoded string
text = html_to_text("<p>Hello <a>world</a></p>")
count_pages(pdf_bytes: bytes) → int
Count pages in a PDF using byte pattern matching. No external binary required. PDF only — returns None is not applicable for DOCX/HTML.
with open("paper.pdf", "rb") as f:
content = f.read()
n = count_pages(content)
print(f"{n} pages")
normalize_text(text: str, level: NormalizationLevel) → tuple[str, NormalizationReport]
Apply the normalization pipeline at the specified level.
Parameters:
text— Raw extracted textlevel—NormalizationLevel.none|NormalizationLevel.standard|NormalizationLevel.academic
Returns: (normalized_text, report) tuple.
Normalization levels:
| Level | Steps | Use when |
|---|---|---|
none |
— | You want raw text, no modifications |
standard |
S0-S9 | General text processing (NLP, search indexing) |
academic |
S0-S9 + A1-A6 | Statistical pattern matching, meta-analysis |
from docpluck import normalize_text, NormalizationLevel
# Raw text
text, _ = normalize_text(raw, NormalizationLevel.none)
# General cleanup
text, report = normalize_text(raw, NormalizationLevel.standard)
# Full statistical repair (recommended for academic PDFs)
text, report = normalize_text(raw, NormalizationLevel.academic)
print(report.version) # "1.1.0"
print(report.steps_applied) # ["S0_smp_to_ascii", "S1_encoding_validation", ...]
print(report.changes_made) # {"ligatures_expanded": 27, "dashes_normalized": 3, ...}
NormalizationReport fields:
| Field | Type | Description |
|---|---|---|
level |
str |
Level used: "none", "standard", or "academic" |
version |
str |
Pipeline version (e.g. "1.1.0") |
steps_applied |
list[str] |
Step codes in order (e.g. ["S1_encoding_validation", "S3_ligature_expansion"]) |
changes_made |
dict[str, int] |
Character-level change counts per step |
compute_quality_score(text: str) → dict
Compute extraction quality metrics.
Returns:
{
"score": 85, # 0–100 composite score
"common_word_ratio": 0.142, # fraction of first 2000 words that are common English words
"garbled": False, # True if common_word_ratio < 0.02 (column merge / encoding failure)
"confidence": "high", # "high" (≥80), "medium" (≥50), "low" (<50)
"details": {
"ligatures_remaining": 0, # count of ff/fi/fl ligature chars not yet expanded
"garbled_chars": 0, # count of U+FFFD replacement characters
"non_ascii_ratio": 0.031, # fraction of non-ASCII characters
}
}
Interpreting the score:
| Score | Meaning |
|---|---|
| ≥ 80 | High quality — proceed with analysis |
| 50–79 | Medium — check manually if precision matters |
| < 50 | Low — likely garbled (column merge, encoding failure, or non-English) |
quality = compute_quality_score(text)
if quality["garbled"]:
print("Extraction likely failed — skipping this paper")
elif quality["score"] < 50:
print(f"Low quality ({quality['score']}) — verify manually")
Integration Examples
ESCIcheck / effectcheck
from docpluck import extract_pdf, normalize_text, NormalizationLevel, compute_quality_score
import re
def extract_stats(pdf_path: str) -> list[dict]:
with open(pdf_path, "rb") as f:
text, method = extract_pdf(f.read())
normalized, report = normalize_text(text, NormalizationLevel.academic)
quality = compute_quality_score(normalized)
if quality["garbled"]:
return [] # Skip garbled papers
# Now apply your statistical patterns to `normalized`
# e.g. find t-tests, F-tests, correlations, p-values
p_values = re.findall(r'p\s*[<=>]\s*\.?\d+', normalized)
return p_values
Scimeto / MetaESCI (batch processing)
from docpluck import extract_pdf, normalize_text, NormalizationLevel, compute_quality_score
from pathlib import Path
def process_corpus(pdf_dir: str) -> list[dict]:
results = []
for pdf_path in Path(pdf_dir).glob("**/*.pdf"):
with open(pdf_path, "rb") as f:
text, method = extract_pdf(f.read())
if text.startswith("ERROR:"):
results.append({"file": pdf_path.name, "error": text})
continue
normalized, report = normalize_text(text, NormalizationLevel.academic)
quality = compute_quality_score(normalized)
results.append({
"file": pdf_path.name,
"chars": len(normalized),
"method": method,
"quality": quality["score"],
"garbled": quality["garbled"],
})
return results
MetaMisCitations (URL-based)
import httpx
from docpluck import extract_pdf, normalize_text, NormalizationLevel
def extract_from_url(url: str) -> str:
response = httpx.get(url, follow_redirects=True, timeout=30)
response.raise_for_status()
text, method = extract_pdf(response.content)
normalized, _ = normalize_text(text, NormalizationLevel.academic)
return normalized
What Gets Fixed
Standard normalization (NormalizationLevel.standard)
| Artifact | Example (before → after) |
|---|---|
| Null bytes | "Study\x00 results" → "Study results" |
| Ligatures | "significant" → "significant" |
| Unicode minus | "r = −0.73" → "r = -0.73" |
| Soft hyphen (invisible) | "signifi\u00ADcant" → "significant" |
| Non-breaking spaces | "p\u00A0<\u00A0.001" → "p < .001" |
| Full-width digits | "p = 0.001" → "p = 0.001" |
| Curly quotes | "the "effect"" → "the "effect"" |
| Hyphenation | "signi-\nficant" → "significant" |
| Repeated headers | Journal name repeated on every page → stripped |
| Page numbers | Standalone 12 on its own line → stripped |
Academic normalization adds (NormalizationLevel.academic)
| Artifact | Example (before → after) |
|---|---|
| Stat line breaks | "p =\n.001" → "p = .001" |
| Dropped decimals | "p = 484" → "p = .484" |
| European decimals | "p = 0,05" → "p = 0.05" |
| CI delimiters | "[0.81; 1.92]" → "[0.81, 1.92]" |
| Greek letters | "η² = 0.12" → "eta2 = 0.12" |
| Superscripts | "r² = 0.54" → "r2 = 0.54" |
| Footnote markers | "p < .001¹" → "p < .001" |
System Requirements
| Requirement | Version | Notes |
|---|---|---|
| Python | ≥ 3.10 | |
| pdfplumber | ≥ 0.11.0 | Core pip dependency — installed automatically |
| poppler-utils | any recent | System package — for extract_pdf() only |
| mammoth | ≥ 1.8.0 | Optional ([docx]) — pure Python, no system deps |
| beautifulsoup4 | ≥ 4.12.0 | Optional ([html]) — pure Python |
| lxml | ≥ 5.0.0 | Optional ([html]) — has prebuilt wheels |
The normalization and quality functions (normalize_text, compute_quality_score) have no system requirements — pure Python, no external binaries. DOCX and HTML extraction are pure Python too; only PDF needs a system binary.
License
MIT. See LICENSE.
Citation
If you use docpluck in research, please cite:
Feldman, G. (2026). docpluck: PDF text extraction and normalization for academic papers.
https://github.com/giladfeldman/docpluck
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docpluck-2.4.69.tar.gz.
File metadata
- Download URL: docpluck-2.4.69.tar.gz
- Upload date:
- Size: 3.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f29005f58dccc54d3e354dadc7d1bcb8a1629755e3fbb15ee8f6d703d141cfa6
|
|
| MD5 |
0c18118f4cae3231df78ea80e17b1b10
|
|
| BLAKE2b-256 |
e3ee9e0bb70fef15cbe0636113cc07d11ab215c45b094a6c244ed22e7bda6bb9
|
Provenance
The following attestation bundles were made for docpluck-2.4.69.tar.gz:
Publisher:
publish.yml on giladfeldman/docpluck
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docpluck-2.4.69.tar.gz -
Subject digest:
f29005f58dccc54d3e354dadc7d1bcb8a1629755e3fbb15ee8f6d703d141cfa6 - Sigstore transparency entry: 1603480231
- Sigstore integration time:
-
Permalink:
giladfeldman/docpluck@72a4d4320eade77ccb8ee015673c60ab78f2d184 -
Branch / Tag:
refs/tags/v2.4.69 - Owner: https://github.com/giladfeldman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@72a4d4320eade77ccb8ee015673c60ab78f2d184 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docpluck-2.4.69-py3-none-any.whl.
File metadata
- Download URL: docpluck-2.4.69-py3-none-any.whl
- Upload date:
- Size: 194.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdbb5466b68f998ade3e7d52118e74353ee91ec1c230bee8af5863a0c6ea777d
|
|
| MD5 |
f0739d6e0f227f0ea5b945a927b2567d
|
|
| BLAKE2b-256 |
743fd64f03d2ea1705485e8b5159c500aed5e05f97d4512c74d22351ddc6e450
|
Provenance
The following attestation bundles were made for docpluck-2.4.69-py3-none-any.whl:
Publisher:
publish.yml on giladfeldman/docpluck
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docpluck-2.4.69-py3-none-any.whl -
Subject digest:
fdbb5466b68f998ade3e7d52118e74353ee91ec1c230bee8af5863a0c6ea777d - Sigstore transparency entry: 1603480458
- Sigstore integration time:
-
Permalink:
giladfeldman/docpluck@72a4d4320eade77ccb8ee015673c60ab78f2d184 -
Branch / Tag:
refs/tags/v2.4.69 - Owner: https://github.com/giladfeldman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@72a4d4320eade77ccb8ee015673c60ab78f2d184 -
Trigger Event:
push
-
Statement type: