Extract sections and subsections from academic PDFs
Project description
███████╗███████╗ ██████╗████████╗██╗ ██████╗ ███╗ ██╗
██╔════╝██╔════╝██╔════╝╚══██╔══╝██║██╔═══██╗████╗ ██║
███████╗█████╗ ██║ ██║ ██║██║ ██║██╔██╗ ██║
╚════██║██╔══╝ ██║ ██║ ██║██║ ██║██║╚██╗██║
███████║███████╗╚██████╗ ██║ ██║╚██████╔╝██║ ╚████║
╚══════╝╚══════╝ ╚═════╝ ╚═╝ ╚═╝ ╚═════╝ ╚═╝ ╚═══╝
███╗ ███╗██╗███╗ ██╗███████╗██████╗
████╗ ████║██║████╗ ██║██╔════╝██╔══██╗
██╔████╔██║██║██╔██╗ ██║█████╗ ██████╔╝
██║╚██╔╝██║██║██║╚██╗██║██╔══╝ ██╔══██╗
██║ ╚═╝ ██║██║██║ ╚████║███████╗██║ ██║
╚═╝ ╚═╝╚═╝╚═╝ ╚═══╝╚══════╝╚═╝ ╚═╝
Extract sections and subsections from academic PDFs — powered by layout heuristics and LLM consolidation.
Quickstart · Installation · Preset Sections · LiteLLM · CLI · API Reference · Web UI · Examples
Overview
SectionMiner is a Python library for extracting structured sections and subsections from academic PDFs. It combines local layout analysis (font sizes, spans) with LLM-based tree consolidation to reliably identify section boundaries — even in complex, multi-column, or OCR-heavy documents.
PDF File → Text Extraction → Heading Detection → LLM Consolidation → Structured Tree
(PyMuPDF / Gemini) (font heuristics) (OpenAI / LiteLLM)
Extraction Backends
| Backend | Description | Best For |
|---|---|---|
pymupdf (default) |
Local text extraction using PDF layout spans | Clean, text-native PDFs |
gemini |
OCR and extraction via Google Gemini | Scanned docs, complex layouts |
LLM Consolidation Backends
| Backend | Description |
|---|---|
| OpenAI (default) | Uses ChatOpenAI with any OpenAI model |
| LiteLLM | Uses ChatLiteLLM — supports OpenAI, Anthropic, Groq, Azure, Gemini, and more via a unified interface |
✦ Quickstart
import json
from sectionminer import SectionMiner
miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
structure, usage = miner.extract_structure(return_tokens=True)
print(json.dumps(structure, indent=2, ensure_ascii=False))
print(usage) # { prompt_tokens, completion_tokens, cost_usd, ... }
# Get text from a specific section
print(miner.get_section_text("introduction"))
# Or slice by character offsets
start, end = miner.get_section_start_and_end_chars("introduction")
print(miner.get_full_text()[start:end])
finally:
miner.close()
⬇ Installation
From PyPI:
pip install sectionminer
From source:
git clone https://github.com/ehodiogo/SectionMiner.git
cd SectionMiner
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .
With LiteLLM support:
pip install sectionminer litellm langchain-community
Requirements
- Python 3.10+
OPENAI_API_KEY— required for LLM consolidation (unless using LiteLLM with a different provider)GEMINI_API_KEY— required only when usingextraction_backend="gemini"
API Keys
Via environment variable:
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..." # optional, Gemini backend only
export LITELLM_API_KEY="..." # optional, LiteLLM with non-OpenAI providers
export LITELLM_MODEL="openai/gpt-4o-mini" # optional, LiteLLM model with provider prefix
Or via .env in your project root:
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
LITELLM_API_KEY=...
LITELLM_MODEL=openai/gpt-4o-mini
🎯 Preset Sections
By default, SectionMiner extracts all sections it detects in the PDF. When you only need specific sections, use preset_sections to activate filter mode — the library will return only the sections whose titles match your list, ignoring everything else.
miner = SectionMiner(
"paper.pdf",
api_key="sk-...",
preset_sections=["Introdução", "Metodologia", "Conclusão"],
)
try:
structure, usage = miner.extract_structure(return_tokens=True)
print(miner.get_section_text("Introdução"))
finally:
miner.close()
How matching works
Matching is flexible and normalised — it strips leading numbering, folds casing, removes diacritics, and collapses whitespace before comparing. This means a preset of "Introdução" will match headings like "-Introdução", "1. INTRODUÇÃO", "2.1 Introdução Geral", etc.
| Preset | Matches in PDF |
|---|---|
"Introdução" |
"-Introdução", "1. INTRODUÇÃO", "Introdução Geral" |
"Metodologia" |
"3. Metodologia", "METODOLOGIA", "2.3 Metodologia de Pesquisa" |
"Conclusão" |
"-CONCLUSÃO", "Conclusão e Trabalhos Futuros" |
Key behaviours
- No fabrication — if a preset name has no match in the document, it is silently omitted. SectionMiner never invents sections.
- Subsections follow their parent — subsections are included only when their parent section was matched.
- Document order preserved — matched sections appear in the order they occur in the PDF, not in preset list order.
- Double-filtered — the LLM is instructed to filter, and a Python post-processing step removes any hallucinated nodes before results are returned.
🔀 LiteLLM Support
LiteLLM lets you swap the LLM consolidation provider without changing your code — just set a model name with the appropriate provider prefix.
Supported providers (examples)
| Provider | model value |
|---|---|
| OpenAI | openai/gpt-4o-mini |
| Anthropic | anthropic/claude-3-haiku-20240307 |
| Groq | groq/llama3-8b-8192 |
| Azure OpenAI | azure/your-deployment-name |
| Google Gemini | gemini/gemini-2.0-flash |
Python
from sectionminer import SectionMiner
miner = SectionMiner(
"paper.pdf",
api_key="your-provider-api-key",
model="anthropic/claude-3-haiku-20240307",
use_litellm=True,
preset_sections=["Introdução"],
)
try:
structure, usage = miner.extract_structure(return_tokens=True)
print(miner.get_section_text("Introdução"))
finally:
miner.close()
LiteLLM + Gemini extraction backend
Use Gemini for PDF text extraction and LiteLLM for tree consolidation simultaneously:
miner = SectionMiner(
"paper.pdf",
api_key="your-litellm-provider-key",
model="openai/gpt-4o-mini", # LiteLLM: merge consolidation
extraction_backend="gemini", # Gemini: PDF text extraction
gemini_api_key="AIza...",
use_litellm=True,
preset_sections=["Introdução"],
)
Via environment variables
LITELLM_MODEL=groq/llama3-8b-8192
LITELLM_API_KEY=gsk_...
Note:
get_openai_callbackin_runtracks token usage via OpenAI's SDK internals. When using LiteLLM with non-OpenAI providers, token counts may be reported as zero. Cost tracking works reliably only with OpenAI-compatible backends.
⌨ CLI
SectionMiner installs a sectionminer command.
sectionminer --help
Extract section structure
# Full extraction with LLM consolidation (OpenAI)
sectionminer extract paper.pdf --tokens --pretty
# Heuristic-only (no LLM / no API key needed)
sectionminer extract paper.pdf --heuristic-only --pretty
# Show cost estimate
sectionminer extract paper.pdf --show-cost --pretty
# Save output to JSON
sectionminer extract paper.pdf --output out.json --pretty
# Use Gemini for extraction
sectionminer extract paper.pdf --extraction-backend gemini --gemini-api-key AIza... --pretty
# Use LiteLLM for consolidation
sectionminer extract paper.pdf --use-litellm --litellm-model groq/llama3-8b-8192 --litellm-api-key gsk_...
# Gemini extraction + LiteLLM consolidation
sectionminer extract paper.pdf \
--extraction-backend gemini --gemini-api-key AIza... \
--use-litellm --litellm-model openai/gpt-4o-mini
Get text of a specific section
sectionminer section-text paper.pdf "introduction"
# With cost breakdown (printed to stderr, JSON unaffected)
sectionminer section-text paper.pdf "introduction" --show-cost
# Without LLM
sectionminer section-text paper.pdf "introduction" --heuristic-only
# With LiteLLM
sectionminer section-text paper.pdf "Introdução" \
--use-litellm --litellm-model anthropic/claude-3-haiku-20240307
Note:
--show-costoutputs cost info tostderrso it never pollutes JSON output.
LiteLLM CLI flags (available in all subcommands)
| Flag | Description |
|---|---|
--use-litellm |
Enable LiteLLM backend (replaces OpenAI) |
--litellm-model |
Model with provider prefix (e.g. groq/llama3-8b-8192). Fallback: LITELLM_MODEL env var |
--litellm-api-key |
Provider API key. Fallback: LITELLM_API_KEY → OPENAI_API_KEY |
--preset-section / --preset-sections |
Optional section title filter that can be repeated |
🌐 Web UI
SectionMiner includes a FastAPI-powered dashboard with real-time PDF rendering, section cards, a detail modal, and social links.
# Start with default PyMuPDF backend
sectionminer runserver --host 127.0.0.1 --port 8000 --reload
# Or, if you prefer the module entrypoint
python -m sectionminer runserver --host 127.0.0.1 --port 8000 --reload
# Use Gemini for extraction
sectionminer runserver --extraction-backend gemini --gemini-model gemini-2.0-flash
# Use LiteLLM for consolidation
sectionminer runserver --use-litellm --litellm-model groq/llama3-8b-8192
# Heuristic-only (no LLM)
sectionminer runserver --heuristic-only
Se o comando
sectionminer runservernao aparecer, sua instalação local está desatualizada. Rodepip install -e .no projeto oupip install -U sectionminerno ambiente virtual ativo.
Open in your browser: http://127.0.0.1:8000
Features:
- Upload any PDF and view extracted sections in real time
- Click a section to highlight its exact location in the PDF viewer
- Open "Ver detalhe" to read the full section text in a modal
- Dashboard shows: backend used, page count, section count, and extraction mode
- Preset sections can be passed from the UI, CLI, API, or Python code
API Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/ |
Visual UI |
POST |
/api/extract |
Upload PDF, returns structured JSON |
GET |
/api/files/{job_id} |
Stream the uploaded PDF for rendering |
Sample POST /api/extract response
{
"job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"filename": "paper.pdf",
"pdf_url": "/api/files/3fa85f64-...",
"extraction_backend": "pymupdf",
"heuristic_only": false,
"pages": 10,
"metrics": {
"pages": 10,
"sections": 24,
"prompt_tokens": 1800,
"completion_tokens": 450,
"total_tokens": 2250,
"cost_usd": 0.00046
},
"sections": [
{
"title": "1. Introduction",
"level": 1,
"start_char": 0,
"end_char": 1200,
"text": "...",
"locations": [
{ "page": 0, "bbox": [72.0, 120.0, 380.0, 138.0], "text": "..." }
]
}
]
}
Frontend styles (Tailwind)
The dashboard uses Tailwind utilities. If you want to customize the stylesheet build pipeline, install the Node dev dependencies and run:
npm install
npm run build:css # one-off build
npm run dev:css # watch mode
The entry stylesheet lives at sectionminer/server/static/tailwind.css and compiles to sectionminer/server/static/styles.css (served by FastAPI).
📖 API Reference
SectionMiner(path, api_key, **kwargs)
miner = SectionMiner(
"paper.pdf",
api_key="sk-...", # API key for LLM consolidation
model="gpt-4o-mini", # Model name (OpenAI) or provider/model (LiteLLM)
extraction_backend="pymupdf", # "pymupdf" | "gemini"
gemini_api_key="...", # required if backend="gemini"
gemini_model="gemini-2.5-flash-lite", # optional, default model
preset_sections=["Introdução", "Metodologia"], # optional filter
use_litellm=False, # set True to use LiteLLM instead of OpenAI
)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str |
— | Path to the PDF file |
api_key |
str |
— | API key for LLM consolidation |
model |
str |
"gpt-4o-mini" |
Model name. For LiteLLM, include provider prefix (e.g. "groq/llama3-8b-8192") |
extraction_backend |
str |
"pymupdf" |
"pymupdf" or "gemini" |
gemini_api_key |
str |
None |
Google Gemini API key |
gemini_model |
str |
"gemini-2.0-flash" |
Gemini model name |
preset_sections |
list[str] |
None |
If provided, return only sections matching these names |
use_litellm |
bool |
False |
Use LiteLLM instead of direct OpenAI for LLM consolidation |
Methods
| Method | Returns | Description |
|---|---|---|
extract_structure(return_tokens=False) |
dict or (dict, usage) |
Full extraction pipeline. Returns section tree. |
get_section_text(title) |
str |
Retrieve text of a section by title (fuzzy match). |
get_section_start_and_end_chars(title) |
(int, int) |
Character offsets for a section in the full text. |
get_full_text() |
str |
Complete linearized text of the PDF. |
get_sections() |
list[str] |
List of all detected section titles. |
close() |
None |
Release the open PDF file handle. |
Low-level pipeline methods
| Method | Description |
|---|---|
extract_blocks() |
Extract raw text spans from PDF |
build_full_text() |
Assemble linearized full text |
build_sections() |
Run heading detection heuristics |
Useful for debugging or custom pipelines.
🔌 Backends
PyMuPDF (default)
miner = SectionMiner("paper.pdf", api_key="sk-...")
Reads text directly from PDF layout data (font sizes, span positions). Fast, offline, no external API needed for extraction.
Gemini
miner = SectionMiner(
"paper.pdf",
api_key="sk-...",
extraction_backend="gemini",
gemini_api_key="...",
gemini_model="gemini-2.5-flash-lite",
)
Sends the PDF to Google Gemini for OCR-based text extraction. Better for scanned documents or PDFs with unusual layouts.
💡 Examples
Basic extraction
from sectionminer import SectionMiner
miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
structure, usage = miner.extract_structure(return_tokens=True)
for section in miner.get_sections():
print(f"→ {section}")
print(miner.get_section_text(section)[:200])
print()
finally:
miner.close()
Extract only specific sections (preset filter)
from sectionminer import SectionMiner
miner = SectionMiner(
"paper.pdf",
api_key="sk-...",
preset_sections=["Introdução", "Metodologia", "Conclusão"],
)
try:
miner.extract_structure()
print(miner.get_section_text("Introdução"))
print(miner.get_section_text("Metodologia"))
finally:
miner.close()
LiteLLM — swap provider without changing code
from sectionminer import SectionMiner
miner = SectionMiner(
"paper.pdf",
api_key="gsk_...", # Groq API key
model="groq/llama3-8b-8192",
use_litellm=True,
preset_sections=["Introdução"],
)
try:
structure, usage = miner.extract_structure(return_tokens=True)
print(miner.get_section_text("Introdução"))
finally:
miner.close()
Gemini extraction + LiteLLM consolidation
from sectionminer import SectionMiner
miner = SectionMiner(
"paper.pdf",
api_key="sk-...", # LiteLLM provider key
model="openai/gpt-4o-mini", # LiteLLM model
extraction_backend="gemini", # Gemini for PDF text extraction
gemini_api_key="AIza...",
use_litellm=True,
preset_sections=["Introdução"],
)
try:
structure, usage = miner.extract_structure(return_tokens=True)
print(usage)
print(miner.get_section_text("Introdução"))
finally:
miner.close()
Preset sections with Gemini backend
from sectionminer import SectionMiner
miner = SectionMiner(
"paper.pdf",
api_key="sk-...",
extraction_backend="gemini",
gemini_api_key="AIza...",
preset_sections=["Introdução"],
)
try:
structure, usage = miner.extract_structure(return_tokens=True)
print(usage)
print(miner.get_section_text("Introdução"))
finally:
miner.close()
Slice text by character offsets
miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
miner.extract_structure()
start, end = miner.get_section_start_and_end_chars("conclusion")
if start is not None:
excerpt = miner.get_full_text()[start:end]
print(excerpt[:500])
finally:
miner.close()
💰 Cost Reference
Measured locally on 2026-03-21 using gpt-4o-mini:
| File | Size | Pages | Tokens | Cost |
|---|---|---|---|---|
artigo_1.pdf |
0.74 MB | 21 | 2,297 | $0.000475 |
artigo_2.pdf |
0.04 MB | 4 | 356 | $0.000060 |
Section text retrieval after extraction is free — it uses local character offsets. Using
preset_sectionsreduces token usage further by limiting LLM output to matched sections only.
Reproduce with:
sectionminer extract paper.pdf --show-cost --pretty
🗂 Project Structure
SectionMiner/
├── sectionminer/
│ ├── __init__.py # Public API
│ ├── miner.py # SectionMiner class
│ ├── client.py # LLM client + tree merge (OpenAI / LiteLLM)
│ ├── prompts.py # Consolidation prompt
│ └── server/ # FastAPI + UI (routes, static, templates)
├── examples/
│ ├── basic_usage.py
│ └── api_smoke_test.py
├── files/ # Sample PDFs
├── test.py # PyMuPDF + OpenAI pipeline example
├── test_litellm.py # LiteLLM pipeline example
├── test_gemini_litellm.py # Gemini extraction + LiteLLM consolidation example
└── requirements.txt
🐛 Troubleshooting
"Invalid control character" when processing PDF
The PDF contains invalid control characters that break JSON serialization. The current version sanitizes these automatically. If the error persists, try a different PDF or validate it with a PDF reader.
Sections are fragmented or broken
- Review
_is_noise_headingand_looks_like_headinginsectionminer/miner.py - Adjust the threshold in
_detect_thresholdfor your PDF's font pattern - Two-column layouts, intrusive footers, and poor OCR quality increase detection errors
Section not found by title
- Try a variation without accents or in lowercase (search normalizes text)
- Inspect available titles with
miner.get_sections() - If using
preset_sections, confirm the section actually exists in the PDF — presets with no match are silently omitted, never fabricated
Preset section returns None text
The section was matched by the LLM but start_char is null, meaning the title in section_structures differs from what the LLM returned. Debug with:
miner.extract_structure()
for s in miner.section_structures:
print(repr(s["title"]), s["start"])
Use the exact title shown there (or a close variation) in preset_sections.
LiteLLM: "LLM Provider NOT provided"
You passed a model name without the provider prefix (e.g. "gpt-4o-mini" instead of "openai/gpt-4o-mini"). LiteLLM requires the prefix to identify the provider. Always use provider/model-name format.
LiteLLM: token usage shows zeros
get_openai_callback only captures usage from OpenAI-compatible calls. With non-OpenAI providers via LiteLLM, token counts will report as zero. This is a known limitation — the extraction itself works correctly.
OpenAI key error
- Confirm
OPENAI_API_KEYis set in the same environment as your script - If using
.env, ensure it's in the project root
🗺 Roadmap
- Automated tests for
detect_headings,build_sections,get_section_text - Expose heuristic parameters via config (threshold, noise filters)
- Native LiteLLM token/cost tracking (replace
get_openai_callback) - LiteLLM support — use any provider for LLM consolidation
- CLI:
sectionminer extract file.pdf --output out.json - Heuristic-only mode (no LLM, fully offline)
- Improved merge — keeps only valid sections/subsections without broken fragments
- Web UI with PDF viewer and section highlighting
- Preset sections filter — extract only named sections with flexible normalised matching
📄 License
Made with ♥ for researchers who'd rather spend time reading papers than parsing them.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sectionminer-0.1.11.tar.gz.
File metadata
- Download URL: sectionminer-0.1.11.tar.gz
- Upload date:
- Size: 54.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
671074ce23f909c961200cf29038635839ab0401176386a1a8eabb15a1204ea2
|
|
| MD5 |
2cbf288a6196f96e4b65e33d371035ca
|
|
| BLAKE2b-256 |
c32a92ffacce8e1838c489bbf6e5b4015a9d22e93677f03d2fd4beeb39a66479
|
File details
Details for the file sectionminer-0.1.11-py3-none-any.whl.
File metadata
- Download URL: sectionminer-0.1.11-py3-none-any.whl
- Upload date:
- Size: 53.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc6addfc773cd5be00ae585d8f22188c785673707753e5e75cab5b10b668defc
|
|
| MD5 |
4404c5779f47eb761e1cc101947d8414
|
|
| BLAKE2b-256 |
2b8c232de468f0e73c5cea21099e2b6f792135f56e72fa5098bda17ca2ef536e
|