sectionminer

Extract sections and subsections from academic PDFs

These details have not been verified by PyPI

Project links

Project description

███████╗███████╗ ██████╗████████╗██╗ ██████╗ ███╗   ██╗
██╔════╝██╔════╝██╔════╝╚══██╔══╝██║██╔═══██╗████╗  ██║
███████╗█████╗  ██║        ██║   ██║██║   ██║██╔██╗ ██║
╚════██║██╔══╝  ██║        ██║   ██║██║   ██║██║╚██╗██║
███████║███████╗╚██████╗   ██║   ██║╚██████╔╝██║ ╚████║
╚══════╝╚══════╝ ╚═════╝   ╚═╝   ╚═╝ ╚═════╝ ╚═╝  ╚═══╝
███╗   ███╗██╗███╗   ██╗███████╗██████╗
████╗ ████║██║████╗  ██║██╔════╝██╔══██╗
██╔████╔██║██║██╔██╗ ██║█████╗  ██████╔╝
██║╚██╔╝██║██║██║╚██╗██║██╔══╝  ██╔══██╗
██║ ╚═╝ ██║██║██║ ╚████║███████╗██║  ██║
╚═╝     ╚═╝╚═╝╚═╝  ╚═══╝╚══════╝╚═╝  ╚═╝

Extract sections and subsections from academic PDFs — powered by layout heuristics and LLM consolidation.

Quickstart · Installation · Preset Sections · LiteLLM · CLI · API Reference · Web UI · Examples

Overview

SectionMiner is a Python library for extracting structured sections and subsections from academic PDFs. It combines local layout analysis (font sizes, spans) with LLM-based tree consolidation to reliably identify section boundaries — even in complex, multi-column, or OCR-heavy documents.

PDF File  →  Text Extraction  →  Heading Detection  →  LLM Consolidation  →  Structured Tree
              (PyMuPDF / Gemini)   (font heuristics)    (OpenAI / LiteLLM)

Extraction Backends

Backend	Description	Best For
`pymupdf` (default)	Local text extraction using PDF layout spans	Clean, text-native PDFs
`gemini`	OCR and extraction via Google Gemini	Scanned docs, complex layouts

LLM Consolidation Backends

Backend	Description
OpenAI (default)	Uses `ChatOpenAI` with any OpenAI model
LiteLLM	Uses `ChatLiteLLM` — supports OpenAI, Anthropic, Groq, Azure, Gemini, and more via a unified interface

✦ Quickstart

import json
from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")

try:
    structure, usage = miner.extract_structure(return_tokens=True)

    print(json.dumps(structure, indent=2, ensure_ascii=False))
    print(usage)  # { prompt_tokens, completion_tokens, cost_usd, ... }

    # Get text from a specific section
    print(miner.get_section_text("introduction"))

    # Or slice by character offsets
    start, end = miner.get_section_start_and_end_chars("introduction")
    print(miner.get_full_text()[start:end])
finally:
    miner.close()

⬇ Installation

From PyPI:

pip install sectionminer

From source:

git clone https://github.com/ehodiogo/SectionMiner.git
cd SectionMiner
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

With LiteLLM support:

pip install sectionminer litellm langchain-community

Requirements

Python 3.10+
OPENAI_API_KEY — required for LLM consolidation (unless using LiteLLM with a different provider)
GEMINI_API_KEY — required only when using extraction_backend="gemini"

API Keys

Via environment variable:

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."      # optional, Gemini backend only
export LITELLM_API_KEY="..."     # optional, LiteLLM with non-OpenAI providers
export LITELLM_MODEL="openai/gpt-4o-mini"  # optional, LiteLLM model with provider prefix

Or via .env in your project root:

OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
LITELLM_API_KEY=...
LITELLM_MODEL=openai/gpt-4o-mini

🎯 Preset Sections

By default, SectionMiner extracts all sections it detects in the PDF. When you only need specific sections, use preset_sections to activate filter mode — the library will return only the sections whose titles match your list, ignoring everything else.

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    preset_sections=["Introdução", "Metodologia", "Conclusão"],
)

try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

How matching works

Matching is flexible and normalised — it strips leading numbering, folds casing, removes diacritics, and collapses whitespace before comparing. This means a preset of "Introdução" will match headings like "-Introdução", "1. INTRODUÇÃO", "2.1 Introdução Geral", etc.

Preset	Matches in PDF
`"Introdução"`	`"-Introdução"`, `"1. INTRODUÇÃO"`, `"Introdução Geral"`
`"Metodologia"`	`"3. Metodologia"`, `"METODOLOGIA"`, `"2.3 Metodologia de Pesquisa"`
`"Conclusão"`	`"-CONCLUSÃO"`, `"Conclusão e Trabalhos Futuros"`

Key behaviours

No fabrication — if a preset name has no match in the document, it is silently omitted. SectionMiner never invents sections.
Subsections follow their parent — subsections are included only when their parent section was matched.
Document order preserved — matched sections appear in the order they occur in the PDF, not in preset list order.
Double-filtered — the LLM is instructed to filter, and a Python post-processing step removes any hallucinated nodes before results are returned.

🔀 LiteLLM Support

LiteLLM lets you swap the LLM consolidation provider without changing your code — just set a model name with the appropriate provider prefix.

Supported providers (examples)

Provider	`model` value
OpenAI	`openai/gpt-4o-mini`
Anthropic	`anthropic/claude-3-haiku-20240307`
Groq	`groq/llama3-8b-8192`
Azure OpenAI	`azure/your-deployment-name`
Google Gemini	`gemini/gemini-2.0-flash`

Python

from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="your-provider-api-key",
    model="anthropic/claude-3-haiku-20240307",
    use_litellm=True,
    preset_sections=["Introdução"],
)

try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

LiteLLM + Gemini extraction backend

Use Gemini for PDF text extraction and LiteLLM for tree consolidation simultaneously:

miner = SectionMiner(
    "paper.pdf",
    api_key="your-litellm-provider-key",
    model="openai/gpt-4o-mini",         # LiteLLM: merge consolidation
    extraction_backend="gemini",         # Gemini: PDF text extraction
    gemini_api_key="AIza...",
    use_litellm=True,
    preset_sections=["Introdução"],
)

Via environment variables

LITELLM_MODEL=groq/llama3-8b-8192
LITELLM_API_KEY=gsk_...

Note: get_openai_callback in _run tracks token usage via OpenAI's SDK internals. When using LiteLLM with non-OpenAI providers, token counts may be reported as zero. Cost tracking works reliably only with OpenAI-compatible backends.

⌨ CLI

SectionMiner installs a sectionminer command.

sectionminer --help

Extract section structure

# Full extraction with LLM consolidation (OpenAI)
sectionminer extract paper.pdf --tokens --pretty

# Heuristic-only (no LLM / no API key needed)
sectionminer extract paper.pdf --heuristic-only --pretty

# Show cost estimate
sectionminer extract paper.pdf --show-cost --pretty

# Save output to JSON
sectionminer extract paper.pdf --output out.json --pretty

# Use Gemini for extraction
sectionminer extract paper.pdf --extraction-backend gemini --gemini-api-key AIza... --pretty

# Use LiteLLM for consolidation
sectionminer extract paper.pdf --use-litellm --litellm-model groq/llama3-8b-8192 --litellm-api-key gsk_...

# Gemini extraction + LiteLLM consolidation
sectionminer extract paper.pdf \
  --extraction-backend gemini --gemini-api-key AIza... \
  --use-litellm --litellm-model openai/gpt-4o-mini

Get text of a specific section

sectionminer section-text paper.pdf "introduction"

# With cost breakdown (printed to stderr, JSON unaffected)
sectionminer section-text paper.pdf "introduction" --show-cost

# Without LLM
sectionminer section-text paper.pdf "introduction" --heuristic-only

# With LiteLLM
sectionminer section-text paper.pdf "Introdução" \
  --use-litellm --litellm-model anthropic/claude-3-haiku-20240307

Note: --show-cost outputs cost info to stderr so it never pollutes JSON output.

LiteLLM CLI flags (available in all subcommands)

Flag	Description
`--use-litellm`	Enable LiteLLM backend (replaces OpenAI)
`--litellm-model`	Model with provider prefix (e.g. `groq/llama3-8b-8192`). Fallback: `LITELLM_MODEL` env var
`--litellm-api-key`	Provider API key. Fallback: `LITELLM_API_KEY` → `OPENAI_API_KEY`
`--preset-section` / `--preset-sections`	Optional section title filter that can be repeated

🌐 Web UI

SectionMiner includes a FastAPI-powered dashboard with real-time PDF rendering, section cards, a detail modal, and social links.

# Start with default PyMuPDF backend
sectionminer runserver --host 127.0.0.1 --port 8000 --reload

# Or, if you prefer the module entrypoint
python -m sectionminer runserver --host 127.0.0.1 --port 8000 --reload

# Use Gemini for extraction
sectionminer runserver --extraction-backend gemini --gemini-model gemini-2.0-flash

# Use LiteLLM for consolidation
sectionminer runserver --use-litellm --litellm-model groq/llama3-8b-8192

# Heuristic-only (no LLM)
sectionminer runserver --heuristic-only

Se o comando sectionminer runserver nao aparecer, sua instalação local está desatualizada. Rode pip install -e . no projeto ou pip install -U sectionminer no ambiente virtual ativo.

Open in your browser: http://127.0.0.1:8000

Features:

Upload any PDF and view extracted sections in real time
Click a section to highlight its exact location in the PDF viewer
Open "Ver detalhe" to read the full section text in a modal
Dashboard shows: backend used, page count, section count, and extraction mode
Preset sections can be passed from the UI, CLI, API, or Python code

API Endpoints

Method	Path	Description
`GET`	`/`	Visual UI
`POST`	`/api/extract`	Upload PDF, returns structured JSON
`GET`	`/api/files/{job_id}`	Stream the uploaded PDF for rendering

Sample POST /api/extract response

{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "filename": "paper.pdf",
  "pdf_url": "/api/files/3fa85f64-...",
  "extraction_backend": "pymupdf",
  "heuristic_only": false,
  "pages": 10,
  "metrics": {
    "pages": 10,
    "sections": 24,
    "prompt_tokens": 1800,
    "completion_tokens": 450,
    "total_tokens": 2250,
    "cost_usd": 0.00046
  },
  "sections": [
    {
      "title": "1. Introduction",
      "level": 1,
      "start_char": 0,
      "end_char": 1200,
      "text": "...",
      "locations": [
        { "page": 0, "bbox": [72.0, 120.0, 380.0, 138.0], "text": "..." }
      ]
    }
  ]
}

Frontend styles (Tailwind)

The dashboard uses Tailwind utilities. If you want to customize the stylesheet build pipeline, install the Node dev dependencies and run:

npm install
npm run build:css   # one-off build
npm run dev:css     # watch mode

The entry stylesheet lives at sectionminer/server/static/tailwind.css and compiles to sectionminer/server/static/styles.css (served by FastAPI).

📖 API Reference

`SectionMiner(path, api_key, **kwargs)`

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",                     # API key for LLM consolidation
    model="gpt-4o-mini",                  # Model name (OpenAI) or provider/model (LiteLLM)
    extraction_backend="pymupdf",         # "pymupdf" | "gemini"
    gemini_api_key="...",                 # required if backend="gemini"
    gemini_model="gemini-2.5-flash-lite", # optional, default model
    preset_sections=["Introdução", "Metodologia"],  # optional filter
    use_litellm=False,                    # set True to use LiteLLM instead of OpenAI
)

Parameters

Parameter	Type	Default	Description
`path`	`str`	—	Path to the PDF file
`api_key`	`str`	—	API key for LLM consolidation
`model`	`str`	`"gpt-4o-mini"`	Model name. For LiteLLM, include provider prefix (e.g. `"groq/llama3-8b-8192"`)
`extraction_backend`	`str`	`"pymupdf"`	`"pymupdf"` or `"gemini"`
`gemini_api_key`	`str`	`None`	Google Gemini API key
`gemini_model`	`str`	`"gemini-2.0-flash"`	Gemini model name
`preset_sections`	`list[str]`	`None`	If provided, return only sections matching these names
`use_litellm`	`bool`	`False`	Use LiteLLM instead of direct OpenAI for LLM consolidation

Methods

Method	Returns	Description
`extract_structure(return_tokens=False)`	`dict` or `(dict, usage)`	Full extraction pipeline. Returns section tree.
`get_section_text(title)`	`str`	Retrieve text of a section by title (fuzzy match).
`get_section_start_and_end_chars(title)`	`(int, int)`	Character offsets for a section in the full text.
`get_full_text()`	`str`	Complete linearized text of the PDF.
`get_sections()`	`list[str]`	List of all detected section titles.
`close()`	`None`	Release the open PDF file handle.

Low-level pipeline methods

Method	Description
`extract_blocks()`	Extract raw text spans from PDF
`build_full_text()`	Assemble linearized full text
`build_sections()`	Run heading detection heuristics

Useful for debugging or custom pipelines.

🔌 Backends

PyMuPDF (default)

miner = SectionMiner("paper.pdf", api_key="sk-...")

Reads text directly from PDF layout data (font sizes, span positions). Fast, offline, no external API needed for extraction.

Gemini

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="...",
    gemini_model="gemini-2.5-flash-lite",
)

Sends the PDF to Google Gemini for OCR-based text extraction. Better for scanned documents or PDFs with unusual layouts.

💡 Examples

Basic extraction

from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    for section in miner.get_sections():
        print(f"→ {section}")
        print(miner.get_section_text(section)[:200])
        print()
finally:
    miner.close()

Extract only specific sections (preset filter)

from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    preset_sections=["Introdução", "Metodologia", "Conclusão"],
)
try:
    miner.extract_structure()
    print(miner.get_section_text("Introdução"))
    print(miner.get_section_text("Metodologia"))
finally:
    miner.close()

LiteLLM — swap provider without changing code

from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="gsk_...",   # Groq API key
    model="groq/llama3-8b-8192",
    use_litellm=True,
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

Gemini extraction + LiteLLM consolidation

from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",                    # LiteLLM provider key
    model="openai/gpt-4o-mini",          # LiteLLM model
    extraction_backend="gemini",          # Gemini for PDF text extraction
    gemini_api_key="AIza...",
    use_litellm=True,
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

Preset sections with Gemini backend

from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="AIza...",
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

Slice text by character offsets

miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    miner.extract_structure()
    start, end = miner.get_section_start_and_end_chars("conclusion")
    if start is not None:
        excerpt = miner.get_full_text()[start:end]
        print(excerpt[:500])
finally:
    miner.close()

💰 Cost Reference

Measured locally on 2026-03-21 using gpt-4o-mini:

File	Size	Pages	Tokens	Cost
`artigo_1.pdf`	0.74 MB	21	2,297	`$0.000475`
`artigo_2.pdf`	0.04 MB	4	356	`$0.000060`

Section text retrieval after extraction is free — it uses local character offsets. Using preset_sections reduces token usage further by limiting LLM output to matched sections only.

Reproduce with:

sectionminer extract paper.pdf --show-cost --pretty

🗂 Project Structure

SectionMiner/
├── sectionminer/
│   ├── __init__.py        # Public API
│   ├── miner.py           # SectionMiner class
│   ├── client.py          # LLM client + tree merge (OpenAI / LiteLLM)
│   ├── prompts.py         # Consolidation prompt
│   └── server/            # FastAPI + UI (routes, static, templates)
├── examples/
│   ├── basic_usage.py
│   └── api_smoke_test.py
├── files/                 # Sample PDFs
├── test.py                # PyMuPDF + OpenAI pipeline example
├── test_litellm.py        # LiteLLM pipeline example
├── test_gemini_litellm.py # Gemini extraction + LiteLLM consolidation example
└── requirements.txt

🐛 Troubleshooting

"Invalid control character" when processing PDF

The PDF contains invalid control characters that break JSON serialization. The current version sanitizes these automatically. If the error persists, try a different PDF or validate it with a PDF reader.

Sections are fragmented or broken

Review _is_noise_heading and _looks_like_heading in sectionminer/miner.py
Adjust the threshold in _detect_threshold for your PDF's font pattern
Two-column layouts, intrusive footers, and poor OCR quality increase detection errors

Section not found by title

Try a variation without accents or in lowercase (search normalizes text)
Inspect available titles with miner.get_sections()
If using preset_sections, confirm the section actually exists in the PDF — presets with no match are silently omitted, never fabricated

Preset section returns None text

The section was matched by the LLM but start_char is null, meaning the title in section_structures differs from what the LLM returned. Debug with:

miner.extract_structure()
for s in miner.section_structures:
    print(repr(s["title"]), s["start"])

Use the exact title shown there (or a close variation) in preset_sections.

LiteLLM: "LLM Provider NOT provided"

You passed a model name without the provider prefix (e.g. "gpt-4o-mini" instead of "openai/gpt-4o-mini"). LiteLLM requires the prefix to identify the provider. Always use provider/model-name format.

LiteLLM: token usage shows zeros

get_openai_callback only captures usage from OpenAI-compatible calls. With non-OpenAI providers via LiteLLM, token counts will report as zero. This is a known limitation — the extraction itself works correctly.

OpenAI key error

Confirm OPENAI_API_KEY is set in the same environment as your script
If using .env, ensure it's in the project root

🗺 Roadmap

Automated tests for detect_headings, build_sections, get_section_text
Expose heuristic parameters via config (threshold, noise filters)
Native LiteLLM token/cost tracking (replace get_openai_callback)
LiteLLM support — use any provider for LLM consolidation
CLI: sectionminer extract file.pdf --output out.json
Heuristic-only mode (no LLM, fully offline)
Improved merge — keeps only valid sections/subsections without broken fragments
Web UI with PDF viewer and section highlighting
Preset sections filter — extract only named sections with flexible normalised matching

📄 License

MIT © ehodiogo

Made with ♥ for researchers who'd rather spend time reading papers than parsing them.

⬆ back to top

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.13

May 15, 2026

0.1.12

Apr 28, 2026

0.1.11

Apr 20, 2026

0.1.10

Apr 17, 2026

0.1.9

Mar 31, 2026

0.1.8

Mar 30, 2026

0.1.7

Mar 27, 2026

0.1.6

Mar 24, 2026

0.1.5

Mar 24, 2026

0.1.4

Mar 23, 2026

0.1.3

Mar 23, 2026

0.1.2

Mar 23, 2026

0.1.1

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sectionminer-0.1.13.tar.gz (58.3 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sectionminer-0.1.13-py3-none-any.whl (56.9 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file sectionminer-0.1.13.tar.gz.

File metadata

Download URL: sectionminer-0.1.13.tar.gz
Upload date: May 15, 2026
Size: 58.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for sectionminer-0.1.13.tar.gz
Algorithm	Hash digest
SHA256	`d9f79100ee60a38710cf0b26d64cab8a5493d0d49ae3836efa79325f073308e4`
MD5	`1e15ccc0fe78083193a2380f6a4996da`
BLAKE2b-256	`370afdd319e75928af6f332e6a6b6521a3a7f46309ad68c63f2bcf201519b370`

See more details on using hashes here.

File details

Details for the file sectionminer-0.1.13-py3-none-any.whl.

File metadata

Download URL: sectionminer-0.1.13-py3-none-any.whl
Upload date: May 15, 2026
Size: 56.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for sectionminer-0.1.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8d7c70a7049c1ee202fa6d6cb90d4034b8c2f0c59675190d56038ddf14a8596c`
MD5	`767ed84f235190cd3ab0d31fff458451`
BLAKE2b-256	`0b0795fa94140c4fd467bb17e53cada161bca17a6cbbafe633ad3886c7b02122`

See more details on using hashes here.

sectionminer 0.1.13

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Extraction Backends

LLM Consolidation Backends

✦ Quickstart

⬇ Installation

Requirements

API Keys

🎯 Preset Sections

How matching works

Key behaviours

🔀 LiteLLM Support

Supported providers (examples)

Python

LiteLLM + Gemini extraction backend

Via environment variables

⌨ CLI

Extract section structure

Get text of a specific section

LiteLLM CLI flags (available in all subcommands)

🌐 Web UI

API Endpoints

Frontend styles (Tailwind)

📖 API Reference

SectionMiner(path, api_key, **kwargs)

Parameters

Methods

🔌 Backends

PyMuPDF (default)

Gemini

💡 Examples

💰 Cost Reference

🗂 Project Structure

🐛 Troubleshooting

🗺 Roadmap

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`SectionMiner(path, api_key, **kwargs)`