Skip to main content

Extract sections and subsections from academic PDFs

Project description


███████╗███████╗ ██████╗████████╗██╗ ██████╗ ███╗   ██╗
██╔════╝██╔════╝██╔════╝╚══██╔══╝██║██╔═══██╗████╗  ██║
███████╗█████╗  ██║        ██║   ██║██║   ██║██╔██╗ ██║
╚════██║██╔══╝  ██║        ██║   ██║██║   ██║██║╚██╗██║
███████║███████╗╚██████╗   ██║   ██║╚██████╔╝██║ ╚████║
╚══════╝╚══════╝ ╚═════╝   ╚═╝   ╚═╝ ╚═════╝ ╚═╝  ╚═══╝
███╗   ███╗██╗███╗   ██╗███████╗██████╗
████╗ ████║██║████╗  ██║██╔════╝██╔══██╗
██╔████╔██║██║██╔██╗ ██║█████╗  ██████╔╝
██║╚██╔╝██║██║██║╚██╗██║██╔══╝  ██╔══██╗
██║ ╚═╝ ██║██║██║ ╚████║███████╗██║  ██║
╚═╝     ╚═╝╚═╝╚═╝  ╚═══╝╚══════╝╚═╝  ╚═╝

Extract sections and subsections from academic PDFs — powered by layout heuristics and LLM consolidation.


PyPI version Python License PyPI Downloads Code style: black


Quickstart · Installation · Preset Sections · CLI · API Reference · Web UI · Examples



Overview

SectionMiner is a Python library for extracting structured sections and subsections from academic PDFs. It combines local layout analysis (font sizes, spans) with LLM-based tree consolidation to reliably identify section boundaries — even in complex, multi-column, or OCR-heavy documents.

PDF File  →  Text Extraction  →  Heading Detection  →  LLM Consolidation  →  Structured Tree
              (PyMuPDF / Gemini)   (font heuristics)    (OpenAI gpt-4o-mini)

Extraction Backends

Backend Description Best For
pymupdf (default) Local text extraction using PDF layout spans Clean, text-native PDFs
gemini OCR and extraction via Google Gemini Scanned docs, complex layouts

In both cases, LLM consolidation of the final section tree is handled by OpenAI.


✦ Quickstart

import json
from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")

try:
    structure, usage = miner.extract_structure(return_tokens=True)

    print(json.dumps(structure, indent=2, ensure_ascii=False))
    print(usage)  # { prompt_tokens, completion_tokens, cost_usd, ... }

    # Get text from a specific section
    print(miner.get_section_text("introduction"))

    # Or slice by character offsets
    start, end = miner.get_section_start_and_end_chars("introduction")
    print(miner.get_full_text()[start:end])
finally:
    miner.close()

⬇ Installation

From PyPI:

pip install sectionminer

From source:

git clone https://github.com/ehodiogo/SectionMiner.git
cd SectionMiner
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Requirements

  • Python 3.10+
  • OPENAI_API_KEY — required for LLM consolidation
  • GEMINI_API_KEY — required only when using extraction_backend="gemini"

API Keys

Via environment variable:

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."      # optional, Gemini backend only

Or via .env in your project root:

OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...

🎯 Preset Sections

By default, SectionMiner extracts all sections it detects in the PDF. When you only need specific sections, use preset_sections to activate filter mode — the library will return only the sections whose titles match your list, ignoring everything else.

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    preset_sections=["Introdução", "Metodologia", "Conclusão"],
)

try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

How matching works

Matching is flexible and normalised — it strips leading numbering, folds casing, removes diacritics, and collapses whitespace before comparing. This means a preset of "Introdução" will match headings like "-Introdução", "1. INTRODUÇÃO", "2.1 Introdução Geral", etc.

Preset Matches in PDF
"Introdução" "-Introdução", "1. INTRODUÇÃO", "Introdução Geral"
"Metodologia" "3. Metodologia", "METODOLOGIA", "2.3 Metodologia de Pesquisa"
"Conclusão" "-CONCLUSÃO", "Conclusão e Trabalhos Futuros"

Key behaviours

  • No fabrication — if a preset name has no match in the document, it is silently omitted. SectionMiner never invents sections.
  • Subsections follow their parent — subsections are included only when their parent section was matched.
  • Document order preserved — matched sections appear in the order they occur in the PDF, not in preset list order.
  • Double-filtered — the LLM is instructed to filter, and a Python post-processing step removes any hallucinated nodes before results are returned.

With Gemini backend

preset_sections works identically with both backends:

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="AIza...",
    preset_sections=["Introdução"],
)

try:
    miner.extract_structure()
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

⌨ CLI

SectionMiner installs a sectionminer command.

sectionminer --help

Extract section structure

# Full extraction with LLM consolidation
sectionminer extract paper.pdf --tokens --pretty

# Heuristic-only (no LLM / no API key needed)
sectionminer extract paper.pdf --heuristic-only --pretty

# Show cost estimate
sectionminer extract paper.pdf --show-cost --pretty

# Save output to JSON
sectionminer extract paper.pdf --output out.json --pretty

Get text of a specific section

sectionminer section-text paper.pdf "introduction"

# With cost breakdown (printed to stderr, JSON unaffected)
sectionminer section-text paper.pdf "introduction" --show-cost

# Without LLM
sectionminer section-text paper.pdf "introduction" --heuristic-only

Note: --show-cost outputs cost info to stderr so it never pollutes JSON output.


🌐 Web UI

SectionMiner includes a FastAPI-powered visual interface with real-time PDF rendering and section highlighting.

# Start with default PyMuPDF backend
sectionminer runserver --host 127.0.0.1 --port 8000 --reload

# Use Gemini for extraction
sectionminer runserver --extraction-backend gemini --gemini-model gemini-2.0-flash

# Heuristic-only (no LLM)
sectionminer runserver --heuristic-only

Se o comando sectionminer runserver nao aparecer, atualize a instalacao local: pip install -U . ou pip install -U sectionminer dentro do seu ambiente virtual.

Open in your browser: http://127.0.0.1:8000

Features:

  • Upload any PDF and view extracted sections in real time
  • Click a section to highlight its exact location in the PDF viewer
  • Dashboard shows: backend used, page count, section count, token usage, cost

API Endpoints

Method Path Description
GET / Visual UI
POST /api/extract Upload PDF, returns structured JSON
GET /api/files/{job_id} Stream the uploaded PDF for rendering
Sample POST /api/extract response
{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "filename": "paper.pdf",
  "pdf_url": "/api/files/3fa85f64-...",
  "extraction_backend": "pymupdf",
  "heuristic_only": false,
  "pages": 10,
  "metrics": {
    "pages": 10,
    "sections": 24,
    "prompt_tokens": 1800,
    "completion_tokens": 450,
    "total_tokens": 2250,
    "cost_usd": 0.00046
  },
  "sections": [
    {
      "title": "1. Introduction",
      "level": 1,
      "start_char": 0,
      "end_char": 1200,
      "text": "...",
      "locations": [
        { "page": 0, "bbox": [72.0, 120.0, 380.0, 138.0], "text": "..." }
      ]
    }
  ]
}

Frontend styles (Tailwind)

The web UI CSS is built with Tailwind. Install the Node dev dependencies once, then build or watch:

npm install
npm run build:css   # one-off build
npm run dev:css     # watch mode

The entry stylesheet lives at sectionminer/server/static/tailwind.css and compiles to sectionminer/server/static/styles.css (served by FastAPI).


📖 API Reference

SectionMiner(path, api_key, **kwargs)

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",                     # OpenAI API key
    extraction_backend="pymupdf",         # "pymupdf" | "gemini"
    gemini_api_key="...",                 # required if backend="gemini"
    gemini_model="gemini-2.5-flash-lite", # optional, default model
    preset_sections=["Introdução", "Metodologia"],  # optional filter
)

Parameters

Parameter Type Default Description
path str Path to the PDF file
api_key str OpenAI API key for LLM consolidation
model str "gpt-4o-mini" OpenAI model to use
extraction_backend str "pymupdf" "pymupdf" or "gemini"
gemini_api_key str None Google Gemini API key
gemini_model str "gemini-2.0-flash" Gemini model name
preset_sections list[str] None If provided, return only sections matching these names

Methods

Method Returns Description
extract_structure(return_tokens=False) dict or (dict, usage) Full extraction pipeline. Returns section tree.
get_section_text(title) str Retrieve text of a section by title (fuzzy match).
get_section_start_and_end_chars(title) (int, int) Character offsets for a section in the full text.
get_full_text() str Complete linearized text of the PDF.
get_sections() list[str] List of all detected section titles.
close() None Release the open PDF file handle.
Low-level pipeline methods
Method Description
extract_blocks() Extract raw text spans from PDF
build_full_text() Assemble linearized full text
build_sections() Run heading detection heuristics

Useful for debugging or custom pipelines.


🔌 Backends

PyMuPDF (default)

miner = SectionMiner("paper.pdf", api_key="sk-...")

Reads text directly from PDF layout data (font sizes, span positions). Fast, offline, no external API needed for extraction.

Gemini

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="...",
    gemini_model="gemini-2.5-flash-lite",
)

Sends the PDF to Google Gemini for OCR-based text extraction. Better for scanned documents or PDFs with unusual layouts.


💡 Examples

Basic extraction
from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    for section in miner.get_sections():
        print(f"→ {section}")
        print(miner.get_section_text(section)[:200])
        print()
finally:
    miner.close()
Extract only specific sections (preset filter)
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    preset_sections=["Introdução", "Metodologia", "Conclusão"],
)
try:
    miner.extract_structure()
    # Only matched sections are returned — no hallucination, no extras
    print(miner.get_section_text("Introdução"))
    print(miner.get_section_text("Metodologia"))
finally:
    miner.close()
Preset sections with Gemini backend
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="AIza...",
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()
With Gemini backend (full extraction)
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="AIza...",
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(structure.get("title"))
finally:
    miner.close()
Slice text by character offsets
miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    miner.extract_structure()
    start, end = miner.get_section_start_and_end_chars("conclusion")
    if start is not None:
        excerpt = miner.get_full_text()[start:end]
        print(excerpt[:500])
finally:
    miner.close()

💰 Cost Reference

Measured locally on 2026-03-21 using gpt-4o-mini:

File Size Pages Tokens Cost
artigo_1.pdf 0.74 MB 21 2,297 $0.000475
artigo_2.pdf 0.04 MB 4 356 $0.000060

Section text retrieval after extraction is free — it uses local character offsets. Using preset_sections reduces token usage further by limiting LLM output to matched sections only.

Reproduce with:

sectionminer extract paper.pdf --show-cost --pretty

🗂 Project Structure

SectionMiner/
├── sectionminer/
│   ├── __init__.py        # Public API
│   ├── miner.py           # SectionMiner class
│   ├── client.py          # LLM client + tree merge
│   ├── prompts.py         # Consolidation prompt
│   └── server/            # FastAPI + UI (routes, static, templates)
├── examples/
│   ├── basic_usage.py
│   └── api_smoke_test.py
├── files/                 # Sample PDFs
├── test.py                # PyMuPDF pipeline example
├── test_gemini.py         # Gemini pipeline example
└── requirements.txt

🐛 Troubleshooting

"Invalid control character" when processing PDF

The PDF contains invalid control characters that break JSON serialization. The current version sanitizes these automatically. If the error persists, try a different PDF or validate it with a PDF reader.

Sections are fragmented or broken
  • Review _is_noise_heading and _looks_like_heading in sectionminer/miner.py
  • Adjust the threshold in _detect_threshold for your PDF's font pattern
  • Two-column layouts, intrusive footers, and poor OCR quality increase detection errors
Section not found by title
  • Try a variation without accents or in lowercase (search normalizes text)
  • Inspect available titles with miner.get_sections()
  • If using preset_sections, confirm the section actually exists in the PDF — presets with no match are silently omitted, never fabricated
Preset section returns None text

The section was matched by the LLM but start_char is null, meaning the title in section_structures differs from what the LLM returned. Debug with:

miner.extract_structure()
for s in miner.section_structures:
    print(repr(s["title"]), s["start"])

Use the exact title shown there (or a close variation) in preset_sections.

OpenAI key error
  • Confirm OPENAI_API_KEY is set in the same environment as your script
  • If using .env, ensure it's in the project root

🗺 Roadmap

  • Automated tests for detect_headings, build_sections, get_section_text
  • Expose heuristic parameters via config (threshold, noise filters)
  • CLI: sectionminer extract file.pdf --output out.json
  • Heuristic-only mode (no LLM, fully offline)
  • Improved merge — keeps only valid sections/subsections without broken fragments
  • Web UI with PDF viewer and section highlighting
  • Preset sections filter — extract only named sections with flexible normalised matching

📄 License

MIT © ehodiogo


Made with ♥ for researchers who'd rather spend time reading papers than parsing them.

⬆ back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sectionminer-0.1.8.tar.gz (44.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sectionminer-0.1.8-py3-none-any.whl (41.2 kB view details)

Uploaded Python 3

File details

Details for the file sectionminer-0.1.8.tar.gz.

File metadata

  • Download URL: sectionminer-0.1.8.tar.gz
  • Upload date:
  • Size: 44.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for sectionminer-0.1.8.tar.gz
Algorithm Hash digest
SHA256 d05215c56c641e4f28bf65173fcb5524d7c2e1829036b428ec038019de2e0f3f
MD5 c296bfe69113e78bb13e9c3a365fc2d2
BLAKE2b-256 1a0cb56748c2af028d78f182f266840507859b64442bf793c574b9b772ef5c07

See more details on using hashes here.

File details

Details for the file sectionminer-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: sectionminer-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 41.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for sectionminer-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 d9e78e9ea92e15b6c0e92a9bcd7dd05daa8ca5af4d87ef76a4667f29ff79f544
MD5 30d7f0ebc2f4fc00881802ed1a5c1095
BLAKE2b-256 1ad25746a9b48dd5d4cc8b59037ffaeda1db00b45140193ef3c3f4dae3bbdf86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page