PDF, DOCX, and HTML text extraction and normalization for academic papers

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

giladfeldman

These details have not been verified by PyPI

Project links

Project description

docpluck

PDF, DOCX, and HTML text extraction and normalization for academic papers.

Built from cross-project experience across 8,000+ PDFs spanning psychology, medicine, economics, physics, and biology. Achieves 100% accuracy on 29 manually verified ground-truth passages (see BENCHMARKS.md).

Supports three input formats:

PDF via pdftotext default mode (with pdfplumber SMP recovery)
DOCX via mammoth (DOCX → HTML → text, preserving Shift+Enter soft breaks)
HTML via beautifulsoup4 + lxml (block/inline-aware tree-walk)

All three formats feed into the same 15-step normalization pipeline and quality scoring.

Install

# PDF only (pdfplumber)
pip install docpluck

# + DOCX support (adds mammoth)
pip install docpluck[docx]

# + HTML support (adds beautifulsoup4 + lxml)
pip install docpluck[html]

# Everything
pip install docpluck[all]

System requirement for extract_pdf(): poppler-utils (provides the pdftotext binary). DOCX and HTML are pure Python — no system dependencies.

# Linux / WSL
apt-get install poppler-utils

# macOS
brew install poppler

# Windows
# Download from https://github.com/oschwartz10612/poppler-windows/releases
# Add bin/ to PATH

Install from GitHub (like R's remotes::install_github()):

pip install git+https://github.com/giladfeldman/docpluck.git

# Pinned version
pip install "docpluck>=1.3.0"

Quick Start

from docpluck import (
    extract_pdf, extract_docx, extract_html,
    normalize_text, NormalizationLevel, compute_quality_score,
)

# 1. Extract text from any supported format
with open("paper.pdf", "rb") as f:
    text, method = extract_pdf(f.read())

# Or from DOCX:
# with open("paper.docx", "rb") as f:
#     text, method = extract_docx(f.read())

# Or from HTML:
# with open("paper.html", "rb") as f:
#     text, method = extract_html(f.read())

print(f"Extracted {len(text):,} chars via {method}")

# 2. Normalize for statistical pattern matching
normalized, report = normalize_text(text, NormalizationLevel.academic)

print(f"Steps applied: {report.steps_applied}")
print(f"Changes made: {report.changes_made}")

# 3. Check quality
quality = compute_quality_score(normalized)
print(f"Quality: {quality['score']}/100 ({quality['confidence']})")
if quality["garbled"]:
    print("Warning: text may be corrupted (column merge or encoding failure)")

Structured extraction (v2.0)

For consumers that need tables and figures as structured data — meta-analysis tooling, statistical-claim extraction, dashboards — call extract_pdf_structured():

from docpluck import extract_pdf_structured

with open("paper.pdf", "rb") as f:
    result = extract_pdf_structured(f.read())

print(f"{result['page_count']} pages")
print(f"{len(result['tables'])} tables, {len(result['figures'])} figures")

for t in result["tables"]:
    print(f"  {t['label']} on page {t['page']} ({t['kind']}, confidence={t['confidence']})")
    if t["kind"] == "structured":
        print(f"    {t['n_rows']} rows × {t['n_cols']} cols")

Modes

# Default: caption-anchored fast path.
extract_pdf_structured(pdf_bytes)

# Thorough: scan every page for uncaptioned tables (slower).
extract_pdf_structured(pdf_bytes, thorough=True)

# Strip table/figure regions from `text` and replace with [Label: caption] markers.
extract_pdf_structured(pdf_bytes, table_text_mode="placeholder")

CLI

docpluck extract paper.pdf --structured > out.json
docpluck extract paper.pdf --structured --thorough --text-mode placeholder
docpluck extract paper.pdf --structured --html-tables-to ./out/

extract_pdf() (the v1 text-only path) is unchanged. New consumers opt in to the structured path; existing consumers see no behavioral change.

See docs/superpowers/specs/2026-05-06-table-extraction-design.md for the full schema and design rationale.

API Reference

`extract_pdf(pdf_bytes: bytes) → tuple[str, str]`

Extract text from PDF bytes.

Parameters:

pdf_bytes — Raw PDF file content as bytes

Returns: (text, method) tuple where:

text — Extracted plain text. Check text.startswith("ERROR:") for failure.
method — Engine used:
- "pdftotext_default" — standard extraction (fast, ~400ms)
- "pdftotext_default+pdfplumber_recovery" — SMP fallback triggered (~9s), used when pdftotext outputs U+FFFD replacement characters (common in Nature/Cell papers using Mathematical Italic fonts)

Requires: pdftotext binary on PATH.

with open("paper.pdf", "rb") as f:
    text, method = extract_pdf(f.read())

if text.startswith("ERROR:"):
    raise RuntimeError(f"Extraction failed: {text}")

`extract_docx(docx_bytes: bytes) → tuple[str, str]`

Extract text from DOCX (Word) file bytes via mammoth.

Parameters:

docx_bytes — Raw DOCX file content as bytes

Returns: (text, method) tuple where method is always "mammoth".

How it works: DOCX is converted to HTML first (preserving Shift+Enter soft breaks as <br> tags), then passed through the same block/inline-aware tree-walk used by extract_html(). This preserves paragraph structure, headings, lists, and soft breaks — which mammoth.extract_raw_text() would lose.

Requires: pip install docpluck[docx] (adds mammoth>=1.8.0).

Known limitations:

OMML equations (Office Math) are silently dropped. Inline stats written as plain text survive; stats inside equation objects do not.
Tracked changes: only deleted paragraphs are handled minimally.
Memory: peak usage is ~3–5× file size.

from docpluck import extract_docx

with open("paper.docx", "rb") as f:
    text, method = extract_docx(f.read())

`extract_html(html_bytes: bytes) → tuple[str, str]`

Extract text from HTML file bytes via beautifulsoup4 + lxml.

Parameters:

html_bytes — Raw HTML file content as bytes (UTF-8 decoded with error replacement)

Returns: (text, method) tuple where method is always "beautifulsoup".

How it works: Custom tree-walk that distinguishes block from inline elements:

Block elements (<p>, <div>, <h1>–<h6>, <li>, <td>, etc.) get newlines before and after.
Inline elements (<a>, <span>, <em>, etc.) get spaces before and after — critical for preventing merged words like "ChanORCID" when adjacent inline elements have no whitespace between them.
Ignored tags (<script>, <style>, <meta>, <svg>, <iframe>, etc.) are decomposed before walking.

Why not BeautifulSoup.get_text(): get_text() cannot distinguish block from inline elements — it applies a uniform separator everywhere, which either merges paragraphs or inserts spurious whitespace. The BeautifulSoup maintainer has confirmed this will not be fixed. A custom tree-walk is required.

Requires: pip install docpluck[html] (adds beautifulsoup4>=4.12.0 and lxml>=5.0.0).

from docpluck import extract_html, html_to_text

# From bytes
with open("article.html", "rb") as f:
    text, method = extract_html(f.read())

# From an already-decoded string
text = html_to_text("<p>Hello <a>world</a></p>")

`count_pages(pdf_bytes: bytes) → int`

Count pages in a PDF using byte pattern matching. No external binary required. PDF only — returns None is not applicable for DOCX/HTML.

with open("paper.pdf", "rb") as f:
    content = f.read()
    n = count_pages(content)
print(f"{n} pages")

`normalize_text(text: str, level: NormalizationLevel) → tuple[str, NormalizationReport]`

Apply the normalization pipeline at the specified level.

Parameters:

text — Raw extracted text
level — NormalizationLevel.none | NormalizationLevel.standard | NormalizationLevel.academic

Returns: (normalized_text, report) tuple.

Normalization levels:

Level	Steps	Use when
`none`	—	You want raw text, no modifications
`standard`	S0-S9	General text processing (NLP, search indexing)
`academic`	S0-S9 + A1-A6	Statistical pattern matching, meta-analysis

from docpluck import normalize_text, NormalizationLevel

# Raw text
text, _ = normalize_text(raw, NormalizationLevel.none)

# General cleanup
text, report = normalize_text(raw, NormalizationLevel.standard)

# Full statistical repair (recommended for academic PDFs)
text, report = normalize_text(raw, NormalizationLevel.academic)

print(report.version)          # "1.1.0"
print(report.steps_applied)    # ["S0_smp_to_ascii", "S1_encoding_validation", ...]
print(report.changes_made)     # {"ligatures_expanded": 27, "dashes_normalized": 3, ...}

NormalizationReport fields:

Field	Type	Description
`level`	`str`	Level used: `"none"`, `"standard"`, or `"academic"`
`version`	`str`	Pipeline version (e.g. `"1.1.0"`)
`steps_applied`	`list[str]`	Step codes in order (e.g. `["S1_encoding_validation", "S3_ligature_expansion"]`)
`changes_made`	`dict[str, int]`	Character-level change counts per step

`compute_quality_score(text: str) → dict`

Compute extraction quality metrics.

Returns:

{
    "score": 85,                    # 0–100 composite score
    "common_word_ratio": 0.142,     # fraction of first 2000 words that are common English words
    "garbled": False,               # True if common_word_ratio < 0.02 (column merge / encoding failure)
    "confidence": "high",           # "high" (≥80), "medium" (≥50), "low" (<50)
    "details": {
        "ligatures_remaining": 0,   # count of ff/fi/fl ligature chars not yet expanded
        "garbled_chars": 0,         # count of U+FFFD replacement characters
        "non_ascii_ratio": 0.031,   # fraction of non-ASCII characters
    }
}

Interpreting the score:

Score	Meaning
≥ 80	High quality — proceed with analysis
50–79	Medium — check manually if precision matters
< 50	Low — likely garbled (column merge, encoding failure, or non-English)

quality = compute_quality_score(text)

if quality["garbled"]:
    print("Extraction likely failed — skipping this paper")
elif quality["score"] < 50:
    print(f"Low quality ({quality['score']}) — verify manually")

Integration Examples

ESCIcheck / effectcheck

from docpluck import extract_pdf, normalize_text, NormalizationLevel, compute_quality_score
import re

def extract_stats(pdf_path: str) -> list[dict]:
    with open(pdf_path, "rb") as f:
        text, method = extract_pdf(f.read())

    normalized, report = normalize_text(text, NormalizationLevel.academic)

    quality = compute_quality_score(normalized)
    if quality["garbled"]:
        return []  # Skip garbled papers

    # Now apply your statistical patterns to `normalized`
    # e.g. find t-tests, F-tests, correlations, p-values
    p_values = re.findall(r'p\s*[<=>]\s*\.?\d+', normalized)
    return p_values

Scimeto / MetaESCI (batch processing)

from docpluck import extract_pdf, normalize_text, NormalizationLevel, compute_quality_score
from pathlib import Path

def process_corpus(pdf_dir: str) -> list[dict]:
    results = []
    for pdf_path in Path(pdf_dir).glob("**/*.pdf"):
        with open(pdf_path, "rb") as f:
            text, method = extract_pdf(f.read())

        if text.startswith("ERROR:"):
            results.append({"file": pdf_path.name, "error": text})
            continue

        normalized, report = normalize_text(text, NormalizationLevel.academic)
        quality = compute_quality_score(normalized)

        results.append({
            "file": pdf_path.name,
            "chars": len(normalized),
            "method": method,
            "quality": quality["score"],
            "garbled": quality["garbled"],
        })

    return results

MetaMisCitations (URL-based)

import httpx
from docpluck import extract_pdf, normalize_text, NormalizationLevel

def extract_from_url(url: str) -> str:
    response = httpx.get(url, follow_redirects=True, timeout=30)
    response.raise_for_status()

    text, method = extract_pdf(response.content)
    normalized, _ = normalize_text(text, NormalizationLevel.academic)
    return normalized

What Gets Fixed

Standard normalization (`NormalizationLevel.standard`)

Artifact	Example (before → after)
Null bytes	`"Study\x00 results"` → `"Study results"`
Ligatures	`"signiﬁcant"` → `"significant"`
Unicode minus	`"r = −0.73"` → `"r = -0.73"`
Soft hyphen (invisible)	`"signifi\u00ADcant"` → `"significant"`
Non-breaking spaces	`"p\u00A0<\u00A0.001"` → `"p < .001"`
Full-width digits	`"ｐ＝０.００１"` → `"p = 0.001"`
Curly quotes	`"the "effect""` → `"the "effect""`
Hyphenation	`"signi-\nficant"` → `"significant"`
Repeated headers	Journal name repeated on every page → stripped
Page numbers	Standalone `12` on its own line → stripped

Academic normalization adds (`NormalizationLevel.academic`)

Artifact	Example (before → after)
Stat line breaks	`"p =\n.001"` → `"p = .001"`
Dropped decimals	`"p = 484"` → `"p = .484"`
European decimals	`"p = 0,05"` → `"p = 0.05"`
CI delimiters	`"[0.81; 1.92]"` → `"[0.81, 1.92]"`
Greek letters	`"η² = 0.12"` → `"eta2 = 0.12"`
Superscripts	`"r² = 0.54"` → `"r2 = 0.54"`
Footnote markers	`"p < .001¹"` → `"p < .001"`

System Requirements

Requirement	Version	Notes
Python	≥ 3.10
pdfplumber	≥ 0.11.0	Core pip dependency — installed automatically
poppler-utils	any recent	System package — for `extract_pdf()` only
mammoth	≥ 1.8.0	Optional (`[docx]`) — pure Python, no system deps
beautifulsoup4	≥ 4.12.0	Optional (`[html]`) — pure Python
lxml	≥ 5.0.0	Optional (`[html]`) — has prebuilt wheels

The normalization and quality functions (normalize_text, compute_quality_score) have no system requirements — pure Python, no external binaries. DOCX and HTML extraction are pure Python too; only PDF needs a system binary.

License

MIT. See LICENSE.

Citation

If you use docpluck in research, please cite:

Feldman, G. (2026). docpluck: PDF text extraction and normalization for academic papers.
https://github.com/giladfeldman/docpluck

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

giladfeldman

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.4.69

May 22, 2026

2.4.68

May 22, 2026

2.4.66

May 22, 2026

2.4.65

May 22, 2026

2.4.64

May 22, 2026

2.4.63

May 21, 2026

2.4.62

May 21, 2026

2.4.61

May 21, 2026

2.4.60

May 21, 2026

2.4.59

May 20, 2026

2.4.58

May 20, 2026

2.4.57

May 18, 2026

2.4.56

May 17, 2026

2.4.55

May 17, 2026

2.4.54

May 17, 2026

2.4.53

May 16, 2026

2.4.52

May 16, 2026

2.4.51

May 16, 2026

2.4.50

May 16, 2026

2.4.49

May 16, 2026

2.4.48

May 16, 2026

2.4.47

May 16, 2026

2.4.46

May 16, 2026

2.4.45

May 16, 2026

2.4.44

May 16, 2026

2.4.43

May 16, 2026

2.4.42

May 16, 2026

2.4.41

May 16, 2026

2.4.40

May 16, 2026

2.4.39

May 16, 2026

2.4.38

May 15, 2026

2.4.37

May 15, 2026

2.4.36

May 15, 2026

2.4.35

May 15, 2026

2.4.34

May 15, 2026

2.4.33

May 15, 2026

2.4.32

May 15, 2026

2.4.31

May 14, 2026

2.4.30

May 14, 2026

2.4.29

May 14, 2026

2.4.28

May 14, 2026

2.4.27

May 14, 2026

2.4.26

May 14, 2026

2.4.25

May 14, 2026

2.4.24

May 14, 2026

2.4.23

May 14, 2026

2.4.22

May 14, 2026

2.4.21

May 14, 2026

2.4.20

May 14, 2026

2.4.19

May 14, 2026

2.4.18

May 14, 2026

2.4.17

May 14, 2026

2.4.16

May 14, 2026

2.4.15

May 13, 2026

2.4.14

May 13, 2026

2.4.13

May 13, 2026

2.4.12

May 13, 2026

2.4.11

May 13, 2026

2.4.10

May 13, 2026

2.4.9

May 13, 2026

2.4.8

May 13, 2026

2.4.7

May 13, 2026

2.4.6

May 13, 2026

2.4.5

May 12, 2026

2.4.4

May 12, 2026

2.4.3

May 12, 2026

2.4.2

May 12, 2026

2.4.1

May 12, 2026

2.4.0

May 12, 2026

2.3.1

May 12, 2026

2.3.0

May 12, 2026

2.1.0

May 8, 2026

2.0.0

May 7, 2026

1.6.0

May 6, 2026

1.5.0

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpluck-2.4.69.tar.gz (3.1 MB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docpluck-2.4.69-py3-none-any.whl (194.5 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file docpluck-2.4.69.tar.gz.

File metadata

Download URL: docpluck-2.4.69.tar.gz
Upload date: May 22, 2026
Size: 3.1 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docpluck-2.4.69.tar.gz
Algorithm	Hash digest
SHA256	`f29005f58dccc54d3e354dadc7d1bcb8a1629755e3fbb15ee8f6d703d141cfa6`
MD5	`0c18118f4cae3231df78ea80e17b1b10`
BLAKE2b-256	`e3ee9e0bb70fef15cbe0636113cc07d11ab215c45b094a6c244ed22e7bda6bb9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpluck-2.4.69.tar.gz:

Publisher: publish.yml on giladfeldman/docpluck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docpluck-2.4.69.tar.gz
- Subject digest: f29005f58dccc54d3e354dadc7d1bcb8a1629755e3fbb15ee8f6d703d141cfa6
- Sigstore transparency entry: 1603480231
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: giladfeldman/docpluck@72a4d4320eade77ccb8ee015673c60ab78f2d184
- Branch / Tag: refs/tags/v2.4.69
- Owner: https://github.com/giladfeldman
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@72a4d4320eade77ccb8ee015673c60ab78f2d184
- Trigger Event: push

File details

Details for the file docpluck-2.4.69-py3-none-any.whl.

File metadata

Download URL: docpluck-2.4.69-py3-none-any.whl
Upload date: May 22, 2026
Size: 194.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docpluck-2.4.69-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fdbb5466b68f998ade3e7d52118e74353ee91ec1c230bee8af5863a0c6ea777d`
MD5	`f0739d6e0f227f0ea5b945a927b2567d`
BLAKE2b-256	`743fd64f03d2ea1705485e8b5159c500aed5e05f97d4512c74d22351ddc6e450`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpluck-2.4.69-py3-none-any.whl:

Publisher: publish.yml on giladfeldman/docpluck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docpluck-2.4.69-py3-none-any.whl
- Subject digest: fdbb5466b68f998ade3e7d52118e74353ee91ec1c230bee8af5863a0c6ea777d
- Sigstore transparency entry: 1603480458
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: giladfeldman/docpluck@72a4d4320eade77ccb8ee015673c60ab78f2d184
- Branch / Tag: refs/tags/v2.4.69
- Owner: https://github.com/giladfeldman
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@72a4d4320eade77ccb8ee015673c60ab78f2d184
- Trigger Event: push

docpluck 2.4.69

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

docpluck

Install

Quick Start

Structured extraction (v2.0)

Modes

CLI

API Reference

extract_pdf(pdf_bytes: bytes) → tuple[str, str]

extract_docx(docx_bytes: bytes) → tuple[str, str]

extract_html(html_bytes: bytes) → tuple[str, str]

count_pages(pdf_bytes: bytes) → int

normalize_text(text: str, level: NormalizationLevel) → tuple[str, NormalizationReport]

compute_quality_score(text: str) → dict

Integration Examples

ESCIcheck / effectcheck

Scimeto / MetaESCI (batch processing)

MetaMisCitations (URL-based)

What Gets Fixed

Standard normalization (NormalizationLevel.standard)

Academic normalization adds (NormalizationLevel.academic)

System Requirements

License

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`extract_pdf(pdf_bytes: bytes) → tuple[str, str]`

`extract_docx(docx_bytes: bytes) → tuple[str, str]`

`extract_html(html_bytes: bytes) → tuple[str, str]`

`count_pages(pdf_bytes: bytes) → int`

`normalize_text(text: str, level: NormalizationLevel) → tuple[str, NormalizationReport]`

`compute_quality_score(text: str) → dict`

Standard normalization (`NormalizationLevel.standard`)

Academic normalization adds (`NormalizationLevel.academic`)