Skip to main content

Extract sections and subsections from academic PDFs

Project description


███████╗███████╗ ██████╗████████╗██╗ ██████╗ ███╗   ██╗
██╔════╝██╔════╝██╔════╝╚══██╔══╝██║██╔═══██╗████╗  ██║
███████╗█████╗  ██║        ██║   ██║██║   ██║██╔██╗ ██║
╚════██║██╔══╝  ██║        ██║   ██║██║   ██║██║╚██╗██║
███████║███████╗╚██████╗   ██║   ██║╚██████╔╝██║ ╚████║
╚══════╝╚══════╝ ╚═════╝   ╚═╝   ╚═╝ ╚═════╝ ╚═╝  ╚═══╝
███╗   ███╗██╗███╗   ██╗███████╗██████╗
████╗ ████║██║████╗  ██║██╔════╝██╔══██╗
██╔████╔██║██║██╔██╗ ██║█████╗  ██████╔╝
██║╚██╔╝██║██║██║╚██╗██║██╔══╝  ██╔══██╗
██║ ╚═╝ ██║██║██║ ╚████║███████╗██║  ██║
╚═╝     ╚═╝╚═╝╚═╝  ╚═══╝╚══════╝╚═╝  ╚═╝

Extract sections and subsections from academic PDFs — powered by layout heuristics and LLM consolidation.


PyPI version Python License PyPI Downloads Code style: black


Quickstart · Installation · CLI · API Reference · Web UI · Examples



Overview

SectionMiner is a Python library for extracting structured sections and subsections from academic PDFs. It combines local layout analysis (font sizes, spans) with LLM-based tree consolidation to reliably identify section boundaries — even in complex, multi-column, or OCR-heavy documents.

PDF File  →  Text Extraction  →  Heading Detection  →  LLM Consolidation  →  Structured Tree
              (PyMuPDF / Gemini)   (font heuristics)    (OpenAI gpt-4o-mini)

Extraction Backends

Backend Description Best For
pymupdf (default) Local text extraction using PDF layout spans Clean, text-native PDFs
gemini OCR and extraction via Google Gemini Scanned docs, complex layouts

In both cases, LLM consolidation of the final section tree is handled by OpenAI.


✦ Quickstart

import json
from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")

try:
    structure, usage = miner.extract_structure(return_tokens=True)

    print(json.dumps(structure, indent=2, ensure_ascii=False))
    print(usage)  # { prompt_tokens, completion_tokens, cost_usd, ... }

    # Get text from a specific section
    print(miner.get_section_text("introduction"))

    # Or slice by character offsets
    start, end = miner.get_section_start_and_end_chars("introduction")
    print(miner.get_full_text()[start:end])
finally:
    miner.close()

⬇ Installation

From PyPI:

pip install sectionminer

From source:

git clone https://github.com/ehodiogo/SectionMiner.git
cd SectionMiner
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Requirements

  • Python 3.10+
  • OPENAI_API_KEY — required for LLM consolidation
  • GEMINI_API_KEY — required only when using extraction_backend="gemini"

API Keys

Via environment variable:

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."      # optional, Gemini backend only

Or via .env in your project root:

OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...

⌨ CLI

SectionMiner installs a sectionminer command.

sectionminer --help

Extract section structure

# Full extraction with LLM consolidation
sectionminer extract paper.pdf --tokens --pretty

# Heuristic-only (no LLM / no API key needed)
sectionminer extract paper.pdf --heuristic-only --pretty

# Show cost estimate
sectionminer extract paper.pdf --show-cost --pretty

# Save output to JSON
sectionminer extract paper.pdf --output out.json --pretty

Get text of a specific section

sectionminer section-text paper.pdf "introduction"

# With cost breakdown (printed to stderr, JSON unaffected)
sectionminer section-text paper.pdf "introduction" --show-cost

# Without LLM
sectionminer section-text paper.pdf "introduction" --heuristic-only

Note: --show-cost outputs cost info to stderr so it never pollutes JSON output.


🌐 Web UI

SectionMiner includes a FastAPI-powered visual interface with real-time PDF rendering and section highlighting.

# Start with default PyMuPDF backend
sectionminer runserver --host 127.0.0.1 --port 8000 --reload

# Use Gemini for extraction
sectionminer runserver --extraction-backend gemini --gemini-model gemini-2.0-flash

# Heuristic-only (no LLM)
sectionminer runserver --heuristic-only

Se o comando sectionminer runserver nao aparecer, atualize a instalacao local: pip install -U . ou pip install -U sectionminer dentro do seu ambiente virtual.

Open in your browser: http://127.0.0.1:8000

Features:

  • Upload any PDF and view extracted sections in real time
  • Click a section to highlight its exact location in the PDF viewer
  • Dashboard shows: backend used, page count, section count, token usage, cost

API Endpoints

Method Path Description
GET / Visual UI
POST /api/extract Upload PDF, returns structured JSON
GET /api/files/{job_id} Stream the uploaded PDF for rendering
Sample POST /api/extract response
{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "filename": "paper.pdf",
  "pdf_url": "/api/files/3fa85f64-...",
  "extraction_backend": "pymupdf",
  "heuristic_only": false,
  "pages": 10,
  "metrics": {
    "pages": 10,
    "sections": 24,
    "prompt_tokens": 1800,
    "completion_tokens": 450,
    "total_tokens": 2250,
    "cost_usd": 0.00046
  },
  "sections": [
    {
      "title": "1. Introduction",
      "level": 1,
      "start_char": 0,
      "end_char": 1200,
      "text": "...",
      "locations": [
        { "page": 0, "bbox": [72.0, 120.0, 380.0, 138.0], "text": "..." }
      ]
    }
  ]
}

Frontend styles (Tailwind)

The web UI CSS is built with Tailwind. Install the Node dev dependencies once, then build or watch:

npm install
npm run build:css   # one-off build
npm run dev:css     # watch mode

The entry stylesheet lives at sectionminer/server/static/tailwind.css and compiles to sectionminer/server/static/styles.css (served by FastAPI).


📖 API Reference

SectionMiner(path, api_key, **kwargs)

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",                     # OpenAI API key
    extraction_backend="pymupdf",         # "pymupdf" | "gemini"
    gemini_api_key="...",                 # required if backend="gemini"
    gemini_model="gemini-2.5-flash-lite", # optional, default model
)

Methods

Method Returns Description
extract_structure(return_tokens=False) dict or (dict, usage) Full extraction pipeline. Returns section tree.
get_section_text(title) str Retrieve text of a section by title (fuzzy match).
get_section_start_and_end_chars(title) (int, int) Character offsets for a section in the full text.
get_full_text() str Complete linearized text of the PDF.
get_sections() list[str] List of all detected section titles.
close() None Release the open PDF file handle.
Low-level pipeline methods
Method Description
extract_blocks() Extract raw text spans from PDF
build_full_text() Assemble linearized full text
build_sections() Run heading detection heuristics

Useful for debugging or custom pipelines.


🔌 Backends

PyMuPDF (default)

miner = SectionMiner("paper.pdf", api_key="sk-...")

Reads text directly from PDF layout data (font sizes, span positions). Fast, offline, no external API needed for extraction.

Gemini

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="...",
    gemini_model="gemini-2.5-flash-lite",
)

Sends the PDF to Google Gemini for OCR-based text extraction. Better for scanned documents or PDFs with unusual layouts.


💡 Examples

Basic extraction
from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    for section in miner.get_sections():
        print(f"→ {section}")
        print(miner.get_section_text(section)[:200])
        print()
finally:
    miner.close()
With Gemini backend
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="AIza...",
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(structure.get("title"))
finally:
    miner.close()
Slice text by character offsets
miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    miner.extract_structure()
    start, end = miner.get_section_start_and_end_chars("conclusion")
    if start is not None:
        excerpt = miner.get_full_text()[start:end]
        print(excerpt[:500])
finally:
    miner.close()

💰 Cost Reference

Measured locally on 2026-03-21 using gpt-4o-mini:

File Size Pages Tokens Cost
artigo_1.pdf 0.74 MB 21 2,297 $0.000475
artigo_2.pdf 0.04 MB 4 356 $0.000060

Section text retrieval after extraction is free — it uses local character offsets.

Reproduce with:

sectionminer extract paper.pdf --show-cost --pretty

🗂 Project Structure

SectionMiner/
├── sectionminer/
│   ├── __init__.py        # Public API
│   ├── miner.py           # SectionMiner class
│   ├── client.py          # LLM client + tree merge
│   ├── prompts.py         # Consolidation prompt
│   └── server/            # FastAPI + UI (routes, static, templates)
├── examples/
│   ├── basic_usage.py
│   └── api_smoke_test.py
├── files/                 # Sample PDFs
├── test.py                # PyMuPDF pipeline example
├── test_gemini.py         # Gemini pipeline example
└── requirements.txt

🐛 Troubleshooting

"Invalid control character" when processing PDF

The PDF contains invalid control characters that break JSON serialization. The current version sanitizes these automatically. If the error persists, try a different PDF or validate it with a PDF reader.

Sections are fragmented or broken
  • Review _is_noise_heading and _looks_like_heading in sectionminer/miner.py
  • Adjust the threshold in _detect_threshold for your PDF's font pattern
  • Two-column layouts, intrusive footers, and poor OCR quality increase detection errors
Section not found by title
  • Try a variation without accents or in lowercase (search normalizes text)
  • Inspect available titles with miner.get_sections()
OpenAI key error
  • Confirm OPENAI_API_KEY is set in the same environment as your script
  • If using .env, ensure it's in the project root

🗺 Roadmap

  • Automated tests for detect_headings, build_sections, get_section_text
  • Expose heuristic parameters via config (threshold, noise filters)
  • CLI: sectionminer extract file.pdf --output out.json
  • Heuristic-only mode (no LLM, fully offline)
  • Improved merge — keeps only valid sections/subsections without broken fragments
  • Web UI with PDF viewer and section highlighting

📄 License

MIT © ehodiogo


Made with ♥ for researchers who'd rather spend time reading papers than parsing them.

⬆ back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sectionminer-0.1.7.tar.gz (39.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sectionminer-0.1.7-py3-none-any.whl (36.9 kB view details)

Uploaded Python 3

File details

Details for the file sectionminer-0.1.7.tar.gz.

File metadata

  • Download URL: sectionminer-0.1.7.tar.gz
  • Upload date:
  • Size: 39.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for sectionminer-0.1.7.tar.gz
Algorithm Hash digest
SHA256 740ebc1a3ad41809371b2d155b9af65ae661ea4a5d55a9f097366cd40f8886b5
MD5 9dabe36ce1b2171ffb83aa77ee0edcac
BLAKE2b-256 51bc9c79653367d0bf8ea2d7b4ad3dc98e7fa522c732bdc7d9e8f0018de3ee62

See more details on using hashes here.

File details

Details for the file sectionminer-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: sectionminer-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 36.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for sectionminer-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 ffee19fbc70a637a90b583797f372fe1c97fd02a714d1e454a7783a72a1ef654
MD5 c5002311a704e54532797735262a3e21
BLAKE2b-256 64dd67dadb4d266db1e63ae8d40fbe99334d0bf35690224ee7bf026ef87ba766

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page