Skip to main content

Extract sections and subsections from academic PDFs

Project description


███████╗███████╗ ██████╗████████╗██╗ ██████╗ ███╗   ██╗
██╔════╝██╔════╝██╔════╝╚══██╔══╝██║██╔═══██╗████╗  ██║
███████╗█████╗  ██║        ██║   ██║██║   ██║██╔██╗ ██║
╚════██║██╔══╝  ██║        ██║   ██║██║   ██║██║╚██╗██║
███████║███████╗╚██████╗   ██║   ██║╚██████╔╝██║ ╚████║
╚══════╝╚══════╝ ╚═════╝   ╚═╝   ╚═╝ ╚═════╝ ╚═╝  ╚═══╝
███╗   ███╗██╗███╗   ██╗███████╗██████╗
████╗ ████║██║████╗  ██║██╔════╝██╔══██╗
██╔████╔██║██║██╔██╗ ██║█████╗  ██████╔╝
██║╚██╔╝██║██║██║╚██╗██║██╔══╝  ██╔══██╗
██║ ╚═╝ ██║██║██║ ╚████║███████╗██║  ██║
╚═╝     ╚═╝╚═╝╚═╝  ╚═══╝╚══════╝╚═╝  ╚═╝

Extract sections and subsections from academic PDFs — powered by layout heuristics and LLM consolidation.


PyPI version Python License PyPI Downloads Code style: black


Quickstart · Installation · Preset Sections · LiteLLM · CLI · API Reference · Web UI · Examples



Overview

SectionMiner is a Python library for extracting structured sections and subsections from academic PDFs. It combines local layout analysis (font sizes, spans) with LLM-based tree consolidation to reliably identify section boundaries — even in complex, multi-column, or OCR-heavy documents.

PDF File  →  Text Extraction  →  Heading Detection  →  LLM Consolidation  →  Structured Tree
              (PyMuPDF / Gemini)   (font heuristics)    (OpenAI / LiteLLM)

Extraction Backends

Backend Description Best For
pymupdf (default) Local text extraction using PDF layout spans Clean, text-native PDFs
gemini OCR and extraction via Google Gemini Scanned docs, complex layouts

LLM Consolidation Backends

Backend Description
OpenAI (default) Uses ChatOpenAI with any OpenAI model
LiteLLM Uses ChatLiteLLM — supports OpenAI, Anthropic, Groq, Azure, Gemini, and more via a unified interface

✦ Quickstart

import json
from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")

try:
    structure, usage = miner.extract_structure(return_tokens=True)

    print(json.dumps(structure, indent=2, ensure_ascii=False))
    print(usage)  # { prompt_tokens, completion_tokens, cost_usd, ... }

    # Get text from a specific section
    print(miner.get_section_text("introduction"))

    # Or slice by character offsets
    start, end = miner.get_section_start_and_end_chars("introduction")
    print(miner.get_full_text()[start:end])
finally:
    miner.close()

⬇ Installation

From PyPI:

pip install sectionminer

From source:

git clone https://github.com/ehodiogo/SectionMiner.git
cd SectionMiner
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

With LiteLLM support:

pip install sectionminer litellm langchain-community

Requirements

  • Python 3.10+
  • OPENAI_API_KEY — required for LLM consolidation (unless using LiteLLM with a different provider)
  • GEMINI_API_KEY — required only when using extraction_backend="gemini"

API Keys

Via environment variable:

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."      # optional, Gemini backend only
export LITELLM_API_KEY="..."     # optional, LiteLLM with non-OpenAI providers
export LITELLM_MODEL="openai/gpt-4o-mini"  # optional, LiteLLM model with provider prefix

Or via .env in your project root:

OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
LITELLM_API_KEY=...
LITELLM_MODEL=openai/gpt-4o-mini

🎯 Preset Sections

By default, SectionMiner extracts all sections it detects in the PDF. When you only need specific sections, use preset_sections to activate filter mode — the library will return only the sections whose titles match your list, ignoring everything else.

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    preset_sections=["Introdução", "Metodologia", "Conclusão"],
)

try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

How matching works

Matching is flexible and normalised — it strips leading numbering, folds casing, removes diacritics, and collapses whitespace before comparing. This means a preset of "Introdução" will match headings like "-Introdução", "1. INTRODUÇÃO", "2.1 Introdução Geral", etc.

Preset Matches in PDF
"Introdução" "-Introdução", "1. INTRODUÇÃO", "Introdução Geral"
"Metodologia" "3. Metodologia", "METODOLOGIA", "2.3 Metodologia de Pesquisa"
"Conclusão" "-CONCLUSÃO", "Conclusão e Trabalhos Futuros"

Key behaviours

  • No fabrication — if a preset name has no match in the document, it is silently omitted. SectionMiner never invents sections.
  • Subsections follow their parent — subsections are included only when their parent section was matched.
  • Document order preserved — matched sections appear in the order they occur in the PDF, not in preset list order.
  • Double-filtered — the LLM is instructed to filter, and a Python post-processing step removes any hallucinated nodes before results are returned.

🔀 LiteLLM Support

LiteLLM lets you swap the LLM consolidation provider without changing your code — just set a model name with the appropriate provider prefix.

Supported providers (examples)

Provider model value
OpenAI openai/gpt-4o-mini
Anthropic anthropic/claude-3-haiku-20240307
Groq groq/llama3-8b-8192
Azure OpenAI azure/your-deployment-name
Google Gemini gemini/gemini-2.0-flash

Python

from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="your-provider-api-key",
    model="anthropic/claude-3-haiku-20240307",
    use_litellm=True,
    preset_sections=["Introdução"],
)

try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()

LiteLLM + Gemini extraction backend

Use Gemini for PDF text extraction and LiteLLM for tree consolidation simultaneously:

miner = SectionMiner(
    "paper.pdf",
    api_key="your-litellm-provider-key",
    model="openai/gpt-4o-mini",         # LiteLLM: merge consolidation
    extraction_backend="gemini",         # Gemini: PDF text extraction
    gemini_api_key="AIza...",
    use_litellm=True,
    preset_sections=["Introdução"],
)

Via environment variables

LITELLM_MODEL=groq/llama3-8b-8192
LITELLM_API_KEY=gsk_...

Note: get_openai_callback in _run tracks token usage via OpenAI's SDK internals. When using LiteLLM with non-OpenAI providers, token counts may be reported as zero. Cost tracking works reliably only with OpenAI-compatible backends.


⌨ CLI

SectionMiner installs a sectionminer command.

sectionminer --help

Extract section structure

# Full extraction with LLM consolidation (OpenAI)
sectionminer extract paper.pdf --tokens --pretty

# Heuristic-only (no LLM / no API key needed)
sectionminer extract paper.pdf --heuristic-only --pretty

# Show cost estimate
sectionminer extract paper.pdf --show-cost --pretty

# Save output to JSON
sectionminer extract paper.pdf --output out.json --pretty

# Use Gemini for extraction
sectionminer extract paper.pdf --extraction-backend gemini --gemini-api-key AIza... --pretty

# Use LiteLLM for consolidation
sectionminer extract paper.pdf --use-litellm --litellm-model groq/llama3-8b-8192 --litellm-api-key gsk_...

# Gemini extraction + LiteLLM consolidation
sectionminer extract paper.pdf \
  --extraction-backend gemini --gemini-api-key AIza... \
  --use-litellm --litellm-model openai/gpt-4o-mini

Get text of a specific section

sectionminer section-text paper.pdf "introduction"

# With cost breakdown (printed to stderr, JSON unaffected)
sectionminer section-text paper.pdf "introduction" --show-cost

# Without LLM
sectionminer section-text paper.pdf "introduction" --heuristic-only

# With LiteLLM
sectionminer section-text paper.pdf "Introdução" \
  --use-litellm --litellm-model anthropic/claude-3-haiku-20240307

Note: --show-cost outputs cost info to stderr so it never pollutes JSON output.

LiteLLM CLI flags (available in all subcommands)

Flag Description
--use-litellm Enable LiteLLM backend (replaces OpenAI)
--litellm-model Model with provider prefix (e.g. groq/llama3-8b-8192). Fallback: LITELLM_MODEL env var
--litellm-api-key Provider API key. Fallback: LITELLM_API_KEYOPENAI_API_KEY
--preset-section / --preset-sections Optional section title filter that can be repeated

🌐 Web UI

SectionMiner includes a FastAPI-powered dashboard with real-time PDF rendering, section cards, a detail modal, and social links.

# Start with default PyMuPDF backend
sectionminer runserver --host 127.0.0.1 --port 8000 --reload

# Or, if you prefer the module entrypoint
python -m sectionminer runserver --host 127.0.0.1 --port 8000 --reload

# Use Gemini for extraction
sectionminer runserver --extraction-backend gemini --gemini-model gemini-2.0-flash

# Use LiteLLM for consolidation
sectionminer runserver --use-litellm --litellm-model groq/llama3-8b-8192

# Heuristic-only (no LLM)
sectionminer runserver --heuristic-only

Se o comando sectionminer runserver nao aparecer, sua instalação local está desatualizada. Rode pip install -e . no projeto ou pip install -U sectionminer no ambiente virtual ativo.

Open in your browser: http://127.0.0.1:8000

Features:

  • Upload any PDF and view extracted sections in real time
  • Click a section to highlight its exact location in the PDF viewer
  • Open "Ver detalhe" to read the full section text in a modal
  • Dashboard shows: backend used, page count, section count, and extraction mode
  • Preset sections can be passed from the UI, CLI, API, or Python code

API Endpoints

Method Path Description
GET / Visual UI
POST /api/extract Upload PDF, returns structured JSON
GET /api/files/{job_id} Stream the uploaded PDF for rendering
Sample POST /api/extract response
{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "filename": "paper.pdf",
  "pdf_url": "/api/files/3fa85f64-...",
  "extraction_backend": "pymupdf",
  "heuristic_only": false,
  "pages": 10,
  "metrics": {
    "pages": 10,
    "sections": 24,
    "prompt_tokens": 1800,
    "completion_tokens": 450,
    "total_tokens": 2250,
    "cost_usd": 0.00046
  },
  "sections": [
    {
      "title": "1. Introduction",
      "level": 1,
      "start_char": 0,
      "end_char": 1200,
      "text": "...",
      "locations": [
        { "page": 0, "bbox": [72.0, 120.0, 380.0, 138.0], "text": "..." }
      ]
    }
  ]
}

Frontend styles (Tailwind)

The dashboard uses Tailwind utilities. If you want to customize the stylesheet build pipeline, install the Node dev dependencies and run:

npm install
npm run build:css   # one-off build
npm run dev:css     # watch mode

The entry stylesheet lives at sectionminer/server/static/tailwind.css and compiles to sectionminer/server/static/styles.css (served by FastAPI).


📖 API Reference

SectionMiner(path, api_key, **kwargs)

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",                     # API key for LLM consolidation
    model="gpt-4o-mini",                  # Model name (OpenAI) or provider/model (LiteLLM)
    extraction_backend="pymupdf",         # "pymupdf" | "gemini"
    gemini_api_key="...",                 # required if backend="gemini"
    gemini_model="gemini-2.5-flash-lite", # optional, default model
    preset_sections=["Introdução", "Metodologia"],  # optional filter
    use_litellm=False,                    # set True to use LiteLLM instead of OpenAI
)

Parameters

Parameter Type Default Description
path str Path to the PDF file
api_key str API key for LLM consolidation
model str "gpt-4o-mini" Model name. For LiteLLM, include provider prefix (e.g. "groq/llama3-8b-8192")
extraction_backend str "pymupdf" "pymupdf" or "gemini"
gemini_api_key str None Google Gemini API key
gemini_model str "gemini-2.0-flash" Gemini model name
preset_sections list[str] None If provided, return only sections matching these names
use_litellm bool False Use LiteLLM instead of direct OpenAI for LLM consolidation

Methods

Method Returns Description
extract_structure(return_tokens=False) dict or (dict, usage) Full extraction pipeline. Returns section tree.
get_section_text(title) str Retrieve text of a section by title (fuzzy match).
get_section_start_and_end_chars(title) (int, int) Character offsets for a section in the full text.
get_full_text() str Complete linearized text of the PDF.
get_sections() list[str] List of all detected section titles.
close() None Release the open PDF file handle.
Low-level pipeline methods
Method Description
extract_blocks() Extract raw text spans from PDF
build_full_text() Assemble linearized full text
build_sections() Run heading detection heuristics

Useful for debugging or custom pipelines.


🔌 Backends

PyMuPDF (default)

miner = SectionMiner("paper.pdf", api_key="sk-...")

Reads text directly from PDF layout data (font sizes, span positions). Fast, offline, no external API needed for extraction.

Gemini

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="...",
    gemini_model="gemini-2.5-flash-lite",
)

Sends the PDF to Google Gemini for OCR-based text extraction. Better for scanned documents or PDFs with unusual layouts.


💡 Examples

Basic extraction
from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    for section in miner.get_sections():
        print(f"→ {section}")
        print(miner.get_section_text(section)[:200])
        print()
finally:
    miner.close()
Extract only specific sections (preset filter)
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    preset_sections=["Introdução", "Metodologia", "Conclusão"],
)
try:
    miner.extract_structure()
    print(miner.get_section_text("Introdução"))
    print(miner.get_section_text("Metodologia"))
finally:
    miner.close()
LiteLLM — swap provider without changing code
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="gsk_...",   # Groq API key
    model="groq/llama3-8b-8192",
    use_litellm=True,
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()
Gemini extraction + LiteLLM consolidation
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",                    # LiteLLM provider key
    model="openai/gpt-4o-mini",          # LiteLLM model
    extraction_backend="gemini",          # Gemini for PDF text extraction
    gemini_api_key="AIza...",
    use_litellm=True,
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()
Preset sections with Gemini backend
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="AIza...",
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()
Slice text by character offsets
miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    miner.extract_structure()
    start, end = miner.get_section_start_and_end_chars("conclusion")
    if start is not None:
        excerpt = miner.get_full_text()[start:end]
        print(excerpt[:500])
finally:
    miner.close()

💰 Cost Reference

Measured locally on 2026-03-21 using gpt-4o-mini:

File Size Pages Tokens Cost
artigo_1.pdf 0.74 MB 21 2,297 $0.000475
artigo_2.pdf 0.04 MB 4 356 $0.000060

Section text retrieval after extraction is free — it uses local character offsets. Using preset_sections reduces token usage further by limiting LLM output to matched sections only.

Reproduce with:

sectionminer extract paper.pdf --show-cost --pretty

🗂 Project Structure

SectionMiner/
├── sectionminer/
│   ├── __init__.py        # Public API
│   ├── miner.py           # SectionMiner class
│   ├── client.py          # LLM client + tree merge (OpenAI / LiteLLM)
│   ├── prompts.py         # Consolidation prompt
│   └── server/            # FastAPI + UI (routes, static, templates)
├── examples/
│   ├── basic_usage.py
│   └── api_smoke_test.py
├── files/                 # Sample PDFs
├── test.py                # PyMuPDF + OpenAI pipeline example
├── test_litellm.py        # LiteLLM pipeline example
├── test_gemini_litellm.py # Gemini extraction + LiteLLM consolidation example
└── requirements.txt

🐛 Troubleshooting

"Invalid control character" when processing PDF

The PDF contains invalid control characters that break JSON serialization. The current version sanitizes these automatically. If the error persists, try a different PDF or validate it with a PDF reader.

Sections are fragmented or broken
  • Review _is_noise_heading and _looks_like_heading in sectionminer/miner.py
  • Adjust the threshold in _detect_threshold for your PDF's font pattern
  • Two-column layouts, intrusive footers, and poor OCR quality increase detection errors
Section not found by title
  • Try a variation without accents or in lowercase (search normalizes text)
  • Inspect available titles with miner.get_sections()
  • If using preset_sections, confirm the section actually exists in the PDF — presets with no match are silently omitted, never fabricated
Preset section returns None text

The section was matched by the LLM but start_char is null, meaning the title in section_structures differs from what the LLM returned. Debug with:

miner.extract_structure()
for s in miner.section_structures:
    print(repr(s["title"]), s["start"])

Use the exact title shown there (or a close variation) in preset_sections.

LiteLLM: "LLM Provider NOT provided"

You passed a model name without the provider prefix (e.g. "gpt-4o-mini" instead of "openai/gpt-4o-mini"). LiteLLM requires the prefix to identify the provider. Always use provider/model-name format.

LiteLLM: token usage shows zeros

get_openai_callback only captures usage from OpenAI-compatible calls. With non-OpenAI providers via LiteLLM, token counts will report as zero. This is a known limitation — the extraction itself works correctly.

OpenAI key error
  • Confirm OPENAI_API_KEY is set in the same environment as your script
  • If using .env, ensure it's in the project root

🗺 Roadmap

  • Automated tests for detect_headings, build_sections, get_section_text
  • Expose heuristic parameters via config (threshold, noise filters)
  • Native LiteLLM token/cost tracking (replace get_openai_callback)
  • LiteLLM support — use any provider for LLM consolidation
  • CLI: sectionminer extract file.pdf --output out.json
  • Heuristic-only mode (no LLM, fully offline)
  • Improved merge — keeps only valid sections/subsections without broken fragments
  • Web UI with PDF viewer and section highlighting
  • Preset sections filter — extract only named sections with flexible normalised matching

📄 License

MIT © ehodiogo


Made with ♥ for researchers who'd rather spend time reading papers than parsing them.

⬆ back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sectionminer-0.1.13.tar.gz (58.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sectionminer-0.1.13-py3-none-any.whl (56.9 kB view details)

Uploaded Python 3

File details

Details for the file sectionminer-0.1.13.tar.gz.

File metadata

  • Download URL: sectionminer-0.1.13.tar.gz
  • Upload date:
  • Size: 58.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for sectionminer-0.1.13.tar.gz
Algorithm Hash digest
SHA256 d9f79100ee60a38710cf0b26d64cab8a5493d0d49ae3836efa79325f073308e4
MD5 1e15ccc0fe78083193a2380f6a4996da
BLAKE2b-256 370afdd319e75928af6f332e6a6b6521a3a7f46309ad68c63f2bcf201519b370

See more details on using hashes here.

File details

Details for the file sectionminer-0.1.13-py3-none-any.whl.

File metadata

  • Download URL: sectionminer-0.1.13-py3-none-any.whl
  • Upload date:
  • Size: 56.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for sectionminer-0.1.13-py3-none-any.whl
Algorithm Hash digest
SHA256 8d7c70a7049c1ee202fa6d6cb90d4034b8c2f0c59675190d56038ddf14a8596c
MD5 767ed84f235190cd3ab0d31fff458451
BLAKE2b-256 0b0795fa94140c4fd467bb17e53cada161bca17a6cbbafe633ad3886c7b02122

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page