PDF processing pipeline for academic theses.
Project description
ytcc-pipeline
A synchronous Python library that converts an academic-thesis PDF into a structured JSON document plus a tar bundle of cropped figures, tables, and formulas. Ships an optional FastAPI wrapper for service deployments and Docker images for four deployment profiles.
What it does
Given a PDF, the pipeline runs eight stages in fixed order:
render -> metadata -> layout -> blocks -> table -> formula -> reference -> bundle
- render -- decode every page to an image (
pdf-oxide,cfg.render_workersprocesses). - metadata -- sha256, byte size, XMP fields. Cheap; runs before model load so I/O failures surface early.
- layout -- PP-DocLayoutV3 emits one
LayoutDetectionper detected block withlabel,bbox,confidence,reading_order. - blocks -- per-page dispatcher routes each detection by
Route. Text comes from the PDF text layer (digital-born) or RapidOCR over the rendered crop (scanned). Figures, tables, and formulas are cropped and saved. - table (opt-in) -- RapidTable SLANet+ recovers cell grids for
TABLEblocks. - formula (default on) -- PP-FormulaNet-L recovers LaTeX from every
FORMULAblock's crop. - reference (opt-in) -- batched POST to an externally-managed GROBID server enriches
REFERENCEblocks with parsedReferencerecords. - bundle -- pack
document.jsonand every saved crop into a single uncompressed tar.
Disabled stages never load their model. Each stage emits one INFO log line on completion; skips short-circuit with skipped reason=....
Features
- One public entry point.
process_pdf(pdf_path, language=..., ...)-- everything else is implementation detail. - Auto-detected digital-born vs scanned. The orchestrator probes the PDF's text layer once and picks the path. Override with
digital_born=True/False. - Two text-extraction paths, asymmetric workers.
digital_born_workers(cheappdf_oxideprocesses) andocr_workers(RapidOCR + CUDA, ~2 GiB VRAM each) are tuned independently. - Bucketed formula batching. Sorts crops by bbox area into small/medium/large buckets with per-bucket
max_new_tokenscaps. Measured 1.63x speedup over flat batching. - fp16 + cv2 fast preproc on layout. ~1.7x and ~2.3x isolated layout-stage speedups on the SafeTensors backend.
- Auto-DPI on the digital-born path. Renders at 150 DPI for digital-born (layout downsamples to 800x800 anyway,
pdf_oxideis resolution-independent), 300 DPI for scanned. Halves render wall. - Injectable resident models. Pass pre-loaded
LayoutAnalyzer,FormulaRecognizer, andTableEngineto skip the ~5s + ~3s + ~1s reload between calls. The FastAPI service does this inlifespan. - Frozen-dataclass schema.
Document/Page/Block/Cell/Referenceare immutable;dataclasses.replaceis the only rewrite path. JSON serialisation is stable across runs (modulouuid4crop filenames). - Streaming-first tar bundle.
document.jsonis the first archive member; consumers parse the index before the image bytes arrive. The FastAPI service ships it directly viaFileResponse. - Three-layer config.
config.toml(service-wide),YTCC_*env vars (per-field overrides),PipelineConfig(...)keyword args (per-call). Unknown TOML keys raiseValueError.
Requirements
- Python 3.14+.
- A CUDA GPU for any non-trivial throughput. CPU works but is neither tested nor recommended.
- An externally-managed GROBID server when
references_enabled=true(not bundled).
The project pins CUDA-enabled torch and onnxruntime-gpu. Substitute these for CPU wheels if you target CPU-only environments; nothing else in the codebase assumes CUDA at import time.
Installation
pip install ytcc-pipeline # library only
pip install ytcc-pipeline[api] # + FastAPI service
pip install ytcc-pipeline[dev] # + tests + lint + typing + benchmarks (includes [api])
Library quickstart
from ytcc_pipeline import process_pdf
# paper.tar: contains document.json + images/*
bundle_path = process_pdf("paper.pdf", language="en")
process_pdf is synchronous and blocking -- internally it uses multiprocessing.spawn pools, not asyncio. Call it from a thread (asyncio.to_thread(process_pdf, ...)) if you need to integrate with an event loop.
Service quickstart
pip install ytcc-pipeline[api]
uvicorn ytcc_pipeline.api.app:app --host 0.0.0.0 --port 8000
The service reads config.toml at startup -- override the path with YTCC_CONFIG=/path/to/config.toml. One process, one GPU, one PDF at a time; concurrency is serialised on an asyncio.Lock.
curl -X POST http://localhost:8000/process \
-F "pdf=@paper.pdf" \
-F "language=en" \
-o paper.tar
GET /health reports liveness + readiness ({"status":"ok","model_loaded":true}). POST /process accepts pdf (file), language (ISO 639-1), and optional digital_born (true/false). The response body is the tar bundle; X-Processing-Time carries the server-side wall in seconds.
[!CAUTION] Run one uvicorn worker per GPU.
--workers N>1multi-loads every resident model and contends for VRAM.
Docker
Pre-built images for four deployment profiles are published to GitHub Container Registry, each available as a slim variant (~6 GB, models fetched on first request) or a baked variant (~11 GB, models pre-downloaded). Pin to a versioned tag in production.
| Tag | Profile | Use case |
|---|---|---|
:scanned[-baked] |
Mixed (digital-born + scanned) | Default; loads OCR engines |
:digital-born[-baked] |
scanned_enabled=false |
Rejects scanned PDFs with HTTP 415; saves ~12 GiB VRAM |
:digital-born-a100[-baked] |
A100-tuned | Larger batches, formula_torch_compile=true |
:text-extract[-baked] |
48GB VRAM-tuned, formula + table OFF | Text + image + references only; ~10x faster on math-heavy theses |
Each profile ships a compose file with a GROBID sidecar:
docker compose -f docker/compose.scanned.yml up -d
curl -X POST http://localhost:8000/process \
-F "pdf=@paper.pdf" \
-F "language=en" \
-o paper.tar
Override config via env vars (every PipelineConfig field is reachable via YTCC_<UPPERCASE_FIELD>) or by mounting your own TOML over /app/config.toml. See docker/README.md for the full image matrix, build instructions, and troubleshooting.
Output format
The output is one uncompressed tar:
paper.tar
├── document.json # the schema document
└── images/ # cropped block images
├── 0001-image-{uuid}.png
├── 0014-formula-{uuid}.png
├── 0014-table-{uuid}.png
└── 0027-formula-MISS-{uuid}.png
document.json is the first archive member -- consumers can stream-parse it before image bytes arrive. Image filenames sort by page (1-based, zero-padded), then by layout label, then by random UUID. The -MISS- marker identifies fallback crops written when primary extraction failed.
import json, tarfile
with tarfile.open("paper.tar") as tf:
doc = json.loads(tf.extractfile("document.json").read())
for page in doc["pages"]:
for block in page["blocks"]:
print(block["reading_order"], block["type"], (block["text"] or "")[:60])
The schema mirrors the Document / Page / Block / Cell / Reference dataclasses. bbox floats are rounded to two decimals; pixel coordinates are in the effective render DPI (150 for digital-born, 300 for scanned), origin top-left.
Per-block invariants:
| Block kind | text |
image_path |
miss |
|---|---|---|---|
| TEXT, success | extracted text | null |
false |
| TEXT, MISS | null |
crop (if bundled) or null |
true |
| REFERENCE, success | extracted text | null |
false |
| REFERENCE, MISS | null |
crop (if bundled) or null |
true |
| IMAGE | null |
crop path | false |
| FORMULA, success | LaTeX | null (crop deleted) |
false |
| FORMULA, MISS | null |
-MISS- marked crop |
true |
| TABLE, structured | null |
crop path + cells / n_rows / n_cols set |
false |
| TABLE, image-only fallback | null |
crop path, cells=null |
false |
miss=True always means the primary representation is unavailable. TABLE blocks never carry miss=True -- a structure failure degrades silently to image-only.
Full schema reference: docs/output-format.md.
Configuration
Three layers, broadest to narrowest:
config.tomlat the project root -- single source of truth for the service and benchmarks. Loaded byload_service_config().YTCC_*environment variables -- per-field overrides viaPipelineConfig.from_env().PipelineConfig(...)keyword arguments -- explicit, per-call.
[!IMPORTANT] The three layers don't compose automatically. The TOML and env vars are read only by
load_service_config()andPipelineConfig.from_env(). Usedataclasses.replace(loaded.pipeline, ...)to layer overrides on top of a TOML-loaded config.
TOML resolution walks: path arg -> YTCC_CONFIG env -> ./config.toml -> installed-package config.toml -> dataclass defaults. Unknown TOML keys raise ValueError -- typos don't pass silently.
Every PipelineConfig field has a matching YTCC_<UPPERCASE_NAME> env var. Bool parser accepts 1/true/yes/on (case-insensitive). Comma-list fields strip whitespace and drop empties.
Full knob reference and tuning guidance: docs/configuration.md and docs/performance.md.
Digital-born vs scanned
The two paths share rendering, layout, formula recognition, table extraction, and bundling. They diverge only inside the block stage:
| Digital-born | Scanned | |
|---|---|---|
| Text source | pdf_oxide text layer |
RapidOCR over rendered crops |
| Render DPI | 150 (default) | 300 (default) |
| Per-worker cost | ~10 MiB RSS (one PdfDocument handle) |
~2 GiB VRAM (one RapidOCR engine + CUDA context) |
| Typical wall on RTX 3090 | ~15s for 150 pages | ~5-10x that |
Auto-detect samples 5 pages and checks per-page non-whitespace character counts; override with digital_born=True/False. Set scanned_enabled=false to reject scanned PDFs entirely (saves ~12 GiB VRAM; FastAPI returns HTTP 415).
References (GROBID)
The reference stage requires an externally-managed GROBID server -- the pipeline never spawns the JVM. Start it separately:
docker run --rm -p 8070:8070 grobid/grobid:0.9.0
Or use the bundled helper which generates a citation-only config (drops startup from ~10s to ~3s, saves ~1 GiB RSS):
scripts/grobid_start.sh
GROBID_PORT=9090 scripts/grobid_start.sh
scripts/grobid_stop.sh
When enabled, run_reference_stage does one batched POST to /api/processCitationList per PDF. Failures (server unreachable, timeout, HTTP error, malformed XML) are logged at WARNING and the page list flows through unchanged -- references are an enrichment, not a hard requirement. The raw reference string always survives on Block.text.
Design principles
- Synchronous by default.
process_pdfis blocking. Async is layered on top in the FastAPI service viaasyncio.to_thread. No async leakage into the core pipeline. multiprocessing.spawn, neverfork. The parent process may hold a CUDA context;forkcorrupts it. Workers re-import their module from scratch, which is whypdf_oxideis imported inside the digital-born worker entry rather than at module top.- One process, one GPU, one PDF at a time. Concurrency is serialised at the
asyncio.Lockin the service layer; the pipeline itself is sequential. - Resident models are injectable, not global.
LayoutAnalyzer,FormulaRecognizer, andTableEngineare constructor-injected with idempotentclose(). The FastAPI lifespan loads them once; library callers either inject manually or let the orchestrator own the per-call lifecycle. - Opt-in opt-in opt-in. Heavy stages (
table_enabled,references_enabled) and slow tradeoffs (formula_torch_compile,layout_fp16) are off by default. Defaults are library-safe; production callers flip them on explicitly. - Streaming-first output. Tar over zip because tar writes sequentially without seeking back for a central directory -- the bundle can be a pipe, socket, or HTTP response body.
document.jsonis written first so consumers parse the index before image bytes arrive. - No silent failures. Unknown TOML keys raise. MISS extractions are flagged on the block (
miss=true) and preserve reading order + bbox. GROBID failures degrade the reference stage but never fail the pipeline.
Limitations
- Single-GPU, single-PDF concurrency. The service serialises on a lock. Throughput scales with replicas, not workers.
- Python 3.14+ only. The project uses PEP 649/749 deferred-annotation semantics and modern stdlib features. No backport path.
- No CPU path is supported. CPU works but is untested and unoptimised. Production deployments need CUDA.
forknot supported. Mixing this library withmultiprocessing.forkcorrupts CUDA contexts.- Tuned for academic theses. The PP-DocLayoutV3 label set and routing rules target academic documents (abstracts, references, formulas, tables). General-purpose PDFs may produce surprising layouts.
- GROBID is external. The reference stage requires a separately-managed GROBID server. The pipeline never bundles or starts the JVM.
- Reference output may have weird XMP keys.
pdf_infocomes directly from the PDF's XMP block, cleaned of UTF-16 BOMs and null bytes -- but anything else (encrypted PDFs, IPTC, RDF) is out of scope. - Bundle filenames don't sort by reading order.
document.jsonis the authoritative reading order; image filenames sort by page + label + UUID.
Benchmarks
Knob-sweep benchmarks live in the benchmarks/ package. Each sweep varies one PipelineConfig field across a range of values and records per-stage wall, process-tree CPU/RSS, device VRAM, and quality metrics (block counts, MISS counts, formulas recovered, references parsed). Standalone scripts cover cold-start, sustained load, API concurrency, GROBID payload scaling, and torch.compile amortisation.
python benchmarks/run_all.py # every sweep (cached CSVs skipped)
python benchmarks/run_all.py --only formula # just the sweeps matching "formula"
python -m benchmarks.plot # generate plots from existing CSVs
Committed reference results (benchmarks/results/summary.md, sweeps/*.{csv,md}, plots/*.png) live in git. Full catalogue: benchmarks/README.md.
Documentation
| File | Topic |
|---|---|
docs/quickstart.md |
Install, first run, library + service modes |
docs/architecture.md |
Stage-by-stage pipeline, module layout, resource lifecycle |
docs/output-format.md |
Tar layout, document.json schema, MISS semantics |
docs/configuration.md |
PipelineConfig knobs, TOML, env-var overrides |
docs/stages.md |
Per-stage behaviour, knobs, skip / no-op semantics |
docs/performance.md |
Recommended config, per-knob impact, VRAM budget, tuning checklist |
docs/digital-born-vs-scanned.md |
Auto-detect heuristic, when to override, scanned-only deployments |
docs/api-service.md |
FastAPI contract, lifespan, concurrency model |
docs/references.md |
GROBID setup, parsed Reference shape, failure modes |
docs/gotchas.md |
Common pitfalls, MISS handling, OOM recovery, log conventions |
Samples
Six PDFs under samples/ cover English / Turkish / Arabic, digital-born and scanned, good and bad quality. Use 904599.pdf (English, digital-born, good) as the first sanity check -- it exercises every stage except OCR.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ytcc_pipeline-0.2.0.tar.gz.
File metadata
- Download URL: ytcc_pipeline-0.2.0.tar.gz
- Upload date:
- Size: 68.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0912d24ebb423f1b5b4d3f159b0080a0b0c2459fc285164d4b8615f0086c600c
|
|
| MD5 |
cff616082ed84d09cd296dd6d1e4feac
|
|
| BLAKE2b-256 |
accf094f1c88943cc088c46fdb83f9fd2ef4a52119ba84aa2c4562c6454bcaf0
|
Provenance
The following attestation bundles were made for ytcc_pipeline-0.2.0.tar.gz:
Publisher:
release.yml on ozefe/ytcc-pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ytcc_pipeline-0.2.0.tar.gz -
Subject digest:
0912d24ebb423f1b5b4d3f159b0080a0b0c2459fc285164d4b8615f0086c600c - Sigstore transparency entry: 1593284919
- Sigstore integration time:
-
Permalink:
ozefe/ytcc-pipeline@d4c23f82c6f8925dd520a7bdd50824f44674db93 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/ozefe
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d4c23f82c6f8925dd520a7bdd50824f44674db93 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ytcc_pipeline-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ytcc_pipeline-0.2.0-py3-none-any.whl
- Upload date:
- Size: 93.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac94e5e0e738f7b1f42e8334d8019b3721f50b1ebaa248c7ed78d73c9d408727
|
|
| MD5 |
2c1d772cca49e74377502f7dfc4ff387
|
|
| BLAKE2b-256 |
8565a500ace0529ebb0560ffcfe72354b67cdf1d2c9ab64eabec3b39ff4ff68a
|
Provenance
The following attestation bundles were made for ytcc_pipeline-0.2.0-py3-none-any.whl:
Publisher:
release.yml on ozefe/ytcc-pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ytcc_pipeline-0.2.0-py3-none-any.whl -
Subject digest:
ac94e5e0e738f7b1f42e8334d8019b3721f50b1ebaa248c7ed78d73c9d408727 - Sigstore transparency entry: 1593284984
- Sigstore integration time:
-
Permalink:
ozefe/ytcc-pipeline@d4c23f82c6f8925dd520a7bdd50824f44674db93 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/ozefe
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d4c23f82c6f8925dd520a7bdd50824f44674db93 -
Trigger Event:
push
-
Statement type: