sectionminer

Extract sections and subsections from academic PDFs

These details have not been verified by PyPI

Project links

Project description

███████╗███████╗ ██████╗████████╗██╗ ██████╗ ███╗   ██╗
██╔════╝██╔════╝██╔════╝╚══██╔══╝██║██╔═══██╗████╗  ██║
███████╗█████╗  ██║        ██║   ██║██║   ██║██╔██╗ ██║
╚════██║██╔══╝  ██║        ██║   ██║██║   ██║██║╚██╗██║
███████║███████╗╚██████╗   ██║   ██║╚██████╔╝██║ ╚████║
╚══════╝╚══════╝ ╚═════╝   ╚═╝   ╚═╝ ╚═════╝ ╚═╝  ╚═══╝
███╗   ███╗██╗███╗   ██╗███████╗██████╗
████╗ ████║██║████╗  ██║██╔════╝██╔══██╗
██╔████╔██║██║██╔██╗ ██║█████╗  ██████╔╝
██║╚██╔╝██║██║██║╚██╗██║██╔══╝  ██╔══██╗
██║ ╚═╝ ██║██║██║ ╚████║███████╗██║  ██║
╚═╝     ╚═╝╚═╝╚═╝  ╚═══╝╚══════╝╚═╝  ╚═╝

Extract sections and subsections from academic PDFs — powered by layout heuristics and LLM consolidation.

Quickstart · Installation · Preset Sections · CLI · API Reference · Web UI · Examples

Overview

SectionMiner is a Python library for extracting structured sections and subsections from academic PDFs. It combines local layout analysis (font sizes, spans) with LLM-based tree consolidation to reliably identify section boundaries — even in complex, multi-column, or OCR-heavy documents.

PDF File  →  Text Extraction  →  Heading Detection  →  LLM Consolidation  →  Structured Tree
              (PyMuPDF / Gemini)   (font heuristics)    (OpenAI gpt-4o-mini)

Extraction Backends

Backend	Description	Best For
`pymupdf` (default)	Local text extraction using PDF layout spans	Clean, text-native PDFs
`gemini`	OCR and extraction via Google Gemini	Scanned docs, complex layouts

In both cases, LLM consolidation of the final section tree is handled by OpenAI.

✦ Quickstart

import json
from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")

try:
    structure, usage = miner.extract_structure(return_tokens=True)

    print(json.dumps(structure, indent=2, ensure_ascii=False))
    print(usage)  # { prompt_tokens, completion_tokens, cost_usd, ... }

    # Get text from a specific section
    print(miner.get_section_text("introduction"))

    # Or slice by character offsets
    start, end = miner.get_section_start_and_end_chars("introduction")
    print(miner.get_full_text()[start:end])
finally:
    miner.close()

⬇ Installation

From PyPI:

pip install sectionminer

From source:

git clone https://github.com/ehodiogo/SectionMiner.git
cd SectionMiner
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Requirements

Python 3.10+
OPENAI_API_KEY — required for LLM consolidation
GEMINI_API_KEY — required only when using extraction_backend="gemini"

API Keys

Via environment variable:

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."      # optional, Gemini backend only

Or via .env in your project root:

OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...

🎯 Preset Sections

By default, SectionMiner extracts all sections it detects in the PDF. When you only need specific sections, use preset_sections to activate filter mode — the library will return only the sections whose titles match your list, ignoring everything else.

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    preset_sections=["Introdução", "Metodologia", "Conclusão"],
)

try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

How matching works

Matching is flexible and normalised — it strips leading numbering, folds casing, removes diacritics, and collapses whitespace before comparing. This means a preset of "Introdução" will match headings like "-Introdução", "1. INTRODUÇÃO", "2.1 Introdução Geral", etc.

Preset	Matches in PDF
`"Introdução"`	`"-Introdução"`, `"1. INTRODUÇÃO"`, `"Introdução Geral"`
`"Metodologia"`	`"3. Metodologia"`, `"METODOLOGIA"`, `"2.3 Metodologia de Pesquisa"`
`"Conclusão"`	`"-CONCLUSÃO"`, `"Conclusão e Trabalhos Futuros"`

Key behaviours

No fabrication — if a preset name has no match in the document, it is silently omitted. SectionMiner never invents sections.
Subsections follow their parent — subsections are included only when their parent section was matched.
Document order preserved — matched sections appear in the order they occur in the PDF, not in preset list order.
Double-filtered — the LLM is instructed to filter, and a Python post-processing step removes any hallucinated nodes before results are returned.

With Gemini backend

preset_sections works identically with both backends:

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="AIza...",
    preset_sections=["Introdução"],
)

try:
    miner.extract_structure()
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

⌨ CLI

SectionMiner installs a sectionminer command.

sectionminer --help

Extract section structure

# Full extraction with LLM consolidation
sectionminer extract paper.pdf --tokens --pretty

# Heuristic-only (no LLM / no API key needed)
sectionminer extract paper.pdf --heuristic-only --pretty

# Show cost estimate
sectionminer extract paper.pdf --show-cost --pretty

# Save output to JSON
sectionminer extract paper.pdf --output out.json --pretty

Get text of a specific section

sectionminer section-text paper.pdf "introduction"

# With cost breakdown (printed to stderr, JSON unaffected)
sectionminer section-text paper.pdf "introduction" --show-cost

# Without LLM
sectionminer section-text paper.pdf "introduction" --heuristic-only

Note: --show-cost outputs cost info to stderr so it never pollutes JSON output.

🌐 Web UI

SectionMiner includes a FastAPI-powered visual interface with real-time PDF rendering and section highlighting.

# Start with default PyMuPDF backend
sectionminer runserver --host 127.0.0.1 --port 8000 --reload

# Use Gemini for extraction
sectionminer runserver --extraction-backend gemini --gemini-model gemini-2.0-flash

# Heuristic-only (no LLM)
sectionminer runserver --heuristic-only

Se o comando sectionminer runserver nao aparecer, atualize a instalacao local: pip install -U . ou pip install -U sectionminer dentro do seu ambiente virtual.

Open in your browser: http://127.0.0.1:8000

Features:

Upload any PDF and view extracted sections in real time
Click a section to highlight its exact location in the PDF viewer
Dashboard shows: backend used, page count, section count, token usage, cost

API Endpoints

Method	Path	Description
`GET`	`/`	Visual UI
`POST`	`/api/extract`	Upload PDF, returns structured JSON
`GET`	`/api/files/{job_id}`	Stream the uploaded PDF for rendering

Sample POST /api/extract response

{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "filename": "paper.pdf",
  "pdf_url": "/api/files/3fa85f64-...",
  "extraction_backend": "pymupdf",
  "heuristic_only": false,
  "pages": 10,
  "metrics": {
    "pages": 10,
    "sections": 24,
    "prompt_tokens": 1800,
    "completion_tokens": 450,
    "total_tokens": 2250,
    "cost_usd": 0.00046
  },
  "sections": [
    {
      "title": "1. Introduction",
      "level": 1,
      "start_char": 0,
      "end_char": 1200,
      "text": "...",
      "locations": [
        { "page": 0, "bbox": [72.0, 120.0, 380.0, 138.0], "text": "..." }
      ]
    }
  ]
}

Frontend styles (Tailwind)

The web UI CSS is built with Tailwind. Install the Node dev dependencies once, then build or watch:

npm install
npm run build:css   # one-off build
npm run dev:css     # watch mode

The entry stylesheet lives at sectionminer/server/static/tailwind.css and compiles to sectionminer/server/static/styles.css (served by FastAPI).

📖 API Reference

`SectionMiner(path, api_key, **kwargs)`

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",                     # OpenAI API key
    extraction_backend="pymupdf",         # "pymupdf" | "gemini"
    gemini_api_key="...",                 # required if backend="gemini"
    gemini_model="gemini-2.5-flash-lite", # optional, default model
    preset_sections=["Introdução", "Metodologia"],  # optional filter
)

Parameters

Parameter	Type	Default	Description
`path`	`str`	—	Path to the PDF file
`api_key`	`str`	—	OpenAI API key for LLM consolidation
`model`	`str`	`"gpt-4o-mini"`	OpenAI model to use
`extraction_backend`	`str`	`"pymupdf"`	`"pymupdf"` or `"gemini"`
`gemini_api_key`	`str`	`None`	Google Gemini API key
`gemini_model`	`str`	`"gemini-2.0-flash"`	Gemini model name
`preset_sections`	`list[str]`	`None`	If provided, return only sections matching these names

Methods

Method	Returns	Description
`extract_structure(return_tokens=False)`	`dict` or `(dict, usage)`	Full extraction pipeline. Returns section tree.
`get_section_text(title)`	`str`	Retrieve text of a section by title (fuzzy match).
`get_section_start_and_end_chars(title)`	`(int, int)`	Character offsets for a section in the full text.
`get_full_text()`	`str`	Complete linearized text of the PDF.
`get_sections()`	`list[str]`	List of all detected section titles.
`close()`	`None`	Release the open PDF file handle.

Low-level pipeline methods

Method	Description
`extract_blocks()`	Extract raw text spans from PDF
`build_full_text()`	Assemble linearized full text
`build_sections()`	Run heading detection heuristics

Useful for debugging or custom pipelines.

🔌 Backends

PyMuPDF (default)

miner = SectionMiner("paper.pdf", api_key="sk-...")

Reads text directly from PDF layout data (font sizes, span positions). Fast, offline, no external API needed for extraction.

Gemini

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="...",
    gemini_model="gemini-2.5-flash-lite",
)

Sends the PDF to Google Gemini for OCR-based text extraction. Better for scanned documents or PDFs with unusual layouts.

💡 Examples

Basic extraction

from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    for section in miner.get_sections():
        print(f"→ {section}")
        print(miner.get_section_text(section)[:200])
        print()
finally:
    miner.close()

Extract only specific sections (preset filter)

from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    preset_sections=["Introdução", "Metodologia", "Conclusão"],
)
try:
    miner.extract_structure()
    # Only matched sections are returned — no hallucination, no extras
    print(miner.get_section_text("Introdução"))
    print(miner.get_section_text("Metodologia"))
finally:
    miner.close()

Preset sections with Gemini backend

from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="AIza...",
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

With Gemini backend (full extraction)

from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="AIza...",
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(structure.get("title"))
finally:
    miner.close()

Slice text by character offsets

miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    miner.extract_structure()
    start, end = miner.get_section_start_and_end_chars("conclusion")
    if start is not None:
        excerpt = miner.get_full_text()[start:end]
        print(excerpt[:500])
finally:
    miner.close()

💰 Cost Reference

Measured locally on 2026-03-21 using gpt-4o-mini:

File	Size	Pages	Tokens	Cost
`artigo_1.pdf`	0.74 MB	21	2,297	`$0.000475`
`artigo_2.pdf`	0.04 MB	4	356	`$0.000060`

Section text retrieval after extraction is free — it uses local character offsets. Using preset_sections reduces token usage further by limiting LLM output to matched sections only.

Reproduce with:

sectionminer extract paper.pdf --show-cost --pretty

🗂 Project Structure

SectionMiner/
├── sectionminer/
│   ├── __init__.py        # Public API
│   ├── miner.py           # SectionMiner class
│   ├── client.py          # LLM client + tree merge
│   ├── prompts.py         # Consolidation prompt
│   └── server/            # FastAPI + UI (routes, static, templates)
├── examples/
│   ├── basic_usage.py
│   └── api_smoke_test.py
├── files/                 # Sample PDFs
├── test.py                # PyMuPDF pipeline example
├── test_gemini.py         # Gemini pipeline example
└── requirements.txt

🐛 Troubleshooting

"Invalid control character" when processing PDF

The PDF contains invalid control characters that break JSON serialization. The current version sanitizes these automatically. If the error persists, try a different PDF or validate it with a PDF reader.

Sections are fragmented or broken

Review _is_noise_heading and _looks_like_heading in sectionminer/miner.py
Adjust the threshold in _detect_threshold for your PDF's font pattern
Two-column layouts, intrusive footers, and poor OCR quality increase detection errors

Section not found by title

Try a variation without accents or in lowercase (search normalizes text)
Inspect available titles with miner.get_sections()
If using preset_sections, confirm the section actually exists in the PDF — presets with no match are silently omitted, never fabricated

Preset section returns None text

The section was matched by the LLM but start_char is null, meaning the title in section_structures differs from what the LLM returned. Debug with:

miner.extract_structure()
for s in miner.section_structures:
    print(repr(s["title"]), s["start"])

Use the exact title shown there (or a close variation) in preset_sections.

OpenAI key error

Confirm OPENAI_API_KEY is set in the same environment as your script
If using .env, ensure it's in the project root

🗺 Roadmap

Automated tests for detect_headings, build_sections, get_section_text
Expose heuristic parameters via config (threshold, noise filters)
CLI: sectionminer extract file.pdf --output out.json
Heuristic-only mode (no LLM, fully offline)
Improved merge — keeps only valid sections/subsections without broken fragments
Web UI with PDF viewer and section highlighting
Preset sections filter — extract only named sections with flexible normalised matching

📄 License

MIT © ehodiogo

Made with ♥ for researchers who'd rather spend time reading papers than parsing them.

⬆ back to top

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.13

May 15, 2026

0.1.12

Apr 28, 2026

0.1.11

Apr 20, 2026

0.1.10

Apr 17, 2026

0.1.9

Mar 31, 2026

This version

0.1.8

Mar 30, 2026

0.1.7

Mar 27, 2026

0.1.6

Mar 24, 2026

0.1.5

Mar 24, 2026

0.1.4

Mar 23, 2026

0.1.3

Mar 23, 2026

0.1.2

Mar 23, 2026

0.1.1

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sectionminer-0.1.8.tar.gz (44.4 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sectionminer-0.1.8-py3-none-any.whl (41.2 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file sectionminer-0.1.8.tar.gz.

File metadata

Download URL: sectionminer-0.1.8.tar.gz
Upload date: Mar 30, 2026
Size: 44.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for sectionminer-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`d05215c56c641e4f28bf65173fcb5524d7c2e1829036b428ec038019de2e0f3f`
MD5	`c296bfe69113e78bb13e9c3a365fc2d2`
BLAKE2b-256	`1a0cb56748c2af028d78f182f266840507859b64442bf793c574b9b772ef5c07`

See more details on using hashes here.

File details

Details for the file sectionminer-0.1.8-py3-none-any.whl.

File metadata

Download URL: sectionminer-0.1.8-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 41.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for sectionminer-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d9e78e9ea92e15b6c0e92a9bcd7dd05daa8ca5af4d87ef76a4667f29ff79f544`
MD5	`30d7f0ebc2f4fc00881802ed1a5c1095`
BLAKE2b-256	`1ad25746a9b48dd5d4cc8b59037ffaeda1db00b45140193ef3c3f4dae3bbdf86`

See more details on using hashes here.

sectionminer 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Extraction Backends

✦ Quickstart

⬇ Installation

Requirements

API Keys

🎯 Preset Sections

How matching works

Key behaviours

With Gemini backend

⌨ CLI

Extract section structure

Get text of a specific section

🌐 Web UI

API Endpoints

Frontend styles (Tailwind)

📖 API Reference

SectionMiner(path, api_key, **kwargs)

Parameters

Methods

🔌 Backends

PyMuPDF (default)

Gemini

💡 Examples

💰 Cost Reference

🗂 Project Structure

🐛 Troubleshooting

🗺 Roadmap

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`SectionMiner(path, api_key, **kwargs)`