Skip to main content

PDF processing pipeline for academic theses.

Project description

ytcc-pipeline

PyPI - Python Version PyPI - License PyPI - Status PyPI - Downloads

ytcc-pipeline mascot generated by Google's Nano Banana 2

A synchronous Python library that converts an academic-thesis PDF into a structured JSON document plus a tar bundle of cropped figures, tables, and formulas. Ships an optional FastAPI wrapper for service deployments and Docker images for four deployment profiles.

What it does

Given a PDF, the pipeline runs eight stages in fixed order:

render -> metadata -> layout -> blocks -> table -> formula -> reference -> bundle
  1. render -- decode every page to an image (pdf-oxide, cfg.render_workers processes).
  2. metadata -- sha256, byte size, XMP fields. Cheap; runs before model load so I/O failures surface early.
  3. layout -- PP-DocLayoutV3 emits one LayoutDetection per detected block with label, bbox, confidence, reading_order.
  4. blocks -- per-page dispatcher routes each detection by Route. Text comes from the PDF text layer (digital-born) or RapidOCR over the rendered crop (scanned). Figures, tables, and formulas are cropped and saved.
  5. table (opt-in) -- RapidTable SLANet+ recovers cell grids for TABLE blocks.
  6. formula (default on) -- PP-FormulaNet-L recovers LaTeX from every FORMULA block's crop.
  7. reference (opt-in) -- batched POST to an externally-managed GROBID server enriches REFERENCE blocks with parsed Reference records.
  8. bundle -- pack document.json and every saved crop into a single uncompressed tar.

Disabled stages never load their model. Each stage emits one INFO log line on completion; skips short-circuit with skipped reason=....

Features

  • One public entry point. process_pdf(pdf_path, language=..., ...) -- everything else is implementation detail.
  • Auto-detected digital-born vs scanned. The orchestrator probes the PDF's text layer once and picks the path. Override with digital_born=True/False.
  • Two text-extraction paths, asymmetric workers. digital_born_workers (cheap pdf_oxide processes) and ocr_workers (RapidOCR + CUDA, ~2 GiB VRAM each) are tuned independently.
  • Bucketed formula batching. Sorts crops by bbox area into small/medium/large buckets with per-bucket max_new_tokens caps. Measured 1.63x speedup over flat batching.
  • fp16 + cv2 fast preproc on layout. ~1.7x and ~2.3x isolated layout-stage speedups on the SafeTensors backend.
  • Auto-DPI on the digital-born path. Renders at 150 DPI for digital-born (layout downsamples to 800x800 anyway, pdf_oxide is resolution-independent), 300 DPI for scanned. Halves render wall.
  • Injectable resident models. Pass pre-loaded LayoutAnalyzer, FormulaRecognizer, and TableEngine to skip the ~5s + ~3s + ~1s reload between calls. The FastAPI service does this in lifespan.
  • Frozen-dataclass schema. Document / Page / Block / Cell / Reference are immutable; dataclasses.replace is the only rewrite path. JSON serialisation is stable across runs (modulo uuid4 crop filenames).
  • Streaming-first tar bundle. document.json is the first archive member; consumers parse the index before the image bytes arrive. The FastAPI service ships it directly via FileResponse.
  • Three-layer config. config.toml (service-wide), YTCC_* env vars (per-field overrides), PipelineConfig(...) keyword args (per-call). Unknown TOML keys raise ValueError.

Requirements

  • Python 3.14+.
  • A CUDA GPU for any non-trivial throughput. CPU works but is neither tested nor recommended.
  • An externally-managed GROBID server when references_enabled=true (not bundled).

The project pins CUDA-enabled torch and onnxruntime-gpu. Substitute these for CPU wheels if you target CPU-only environments; nothing else in the codebase assumes CUDA at import time.

Installation

pip install ytcc-pipeline       # library only
pip install ytcc-pipeline[api]  # + FastAPI service
pip install ytcc-pipeline[dev]  # + tests + lint + typing + benchmarks (includes [api])

Library quickstart

from ytcc_pipeline import process_pdf

# paper.tar: contains document.json + images/*
bundle_path = process_pdf("paper.pdf", language="en")

process_pdf is synchronous and blocking -- internally it uses multiprocessing.spawn pools, not asyncio. Call it from a thread (asyncio.to_thread(process_pdf, ...)) if you need to integrate with an event loop.

Service quickstart

pip install ytcc-pipeline[api]
uvicorn ytcc_pipeline.api.app:app --host 0.0.0.0 --port 8000

The service reads config.toml at startup -- override the path with YTCC_CONFIG=/path/to/config.toml. One process, one GPU, one PDF at a time; concurrency is serialised on an asyncio.Lock.

curl -X POST http://localhost:8000/process \
  -F "pdf=@paper.pdf" \
  -F "language=en" \
  -o paper.tar

GET /health reports liveness + readiness ({"status":"ok","model_loaded":true}). POST /process accepts pdf (file), language (ISO 639-1), and optional digital_born (true/false). The response body is the tar bundle; X-Processing-Time carries the server-side wall in seconds.

[!CAUTION] Run one uvicorn worker per GPU. --workers N>1 multi-loads every resident model and contends for VRAM.

Docker

Pre-built images for four deployment profiles are published to GitHub Container Registry, each available as a slim variant (~6 GB, models fetched on first request) or a baked variant (~11 GB, models pre-downloaded). Pin to a versioned tag in production.

Tag Profile Use case
:scanned[-baked] Mixed (digital-born + scanned) Default; loads OCR engines
:digital-born[-baked] scanned_enabled=false Rejects scanned PDFs with HTTP 415; saves ~12 GiB VRAM
:digital-born-a100[-baked] A100-tuned Larger batches, formula_torch_compile=true
:text-extract[-baked] 48GB VRAM-tuned, formula + table OFF Text + image + references only; ~10x faster on math-heavy theses

Each profile ships a compose file with a GROBID sidecar:

docker compose -f docker/compose.scanned.yml up -d
curl -X POST http://localhost:8000/process \
    -F "pdf=@paper.pdf" \
    -F "language=en" \
    -o paper.tar

Override config via env vars (every PipelineConfig field is reachable via YTCC_<UPPERCASE_FIELD>) or by mounting your own TOML over /app/config.toml. See docker/README.md for the full image matrix, build instructions, and troubleshooting.

Output format

The output is one uncompressed tar:

paper.tar
├── document.json              # the schema document
└── images/                    # cropped block images
    ├── 0001-image-{uuid}.png
    ├── 0014-formula-{uuid}.png
    ├── 0014-table-{uuid}.png
    └── 0027-formula-MISS-{uuid}.png

document.json is the first archive member -- consumers can stream-parse it before image bytes arrive. Image filenames sort by page (1-based, zero-padded), then by layout label, then by random UUID. The -MISS- marker identifies fallback crops written when primary extraction failed.

import json, tarfile

with tarfile.open("paper.tar") as tf:
    doc = json.loads(tf.extractfile("document.json").read())

for page in doc["pages"]:
    for block in page["blocks"]:
        print(block["reading_order"], block["type"], (block["text"] or "")[:60])

The schema mirrors the Document / Page / Block / Cell / Reference dataclasses. bbox floats are rounded to two decimals; pixel coordinates are in the effective render DPI (150 for digital-born, 300 for scanned), origin top-left.

Per-block invariants:

Block kind text image_path miss
TEXT, success extracted text null false
TEXT, MISS null crop (if bundled) or null true
REFERENCE, success extracted text null false
REFERENCE, MISS null crop (if bundled) or null true
IMAGE null crop path false
FORMULA, success LaTeX null (crop deleted) false
FORMULA, MISS null -MISS- marked crop true
TABLE, structured null crop path + cells / n_rows / n_cols set false
TABLE, image-only fallback null crop path, cells=null false

miss=True always means the primary representation is unavailable. TABLE blocks never carry miss=True -- a structure failure degrades silently to image-only.

Full schema reference: docs/output-format.md.

Configuration

Three layers, broadest to narrowest:

  1. config.toml at the project root -- single source of truth for the service and benchmarks. Loaded by load_service_config().
  2. YTCC_* environment variables -- per-field overrides via PipelineConfig.from_env().
  3. PipelineConfig(...) keyword arguments -- explicit, per-call.

[!IMPORTANT] The three layers don't compose automatically. The TOML and env vars are read only by load_service_config() and PipelineConfig.from_env(). Use dataclasses.replace(loaded.pipeline, ...) to layer overrides on top of a TOML-loaded config.

TOML resolution walks: path arg -> YTCC_CONFIG env -> ./config.toml -> installed-package config.toml -> dataclass defaults. Unknown TOML keys raise ValueError -- typos don't pass silently.

Every PipelineConfig field has a matching YTCC_<UPPERCASE_NAME> env var. Bool parser accepts 1/true/yes/on (case-insensitive). Comma-list fields strip whitespace and drop empties.

Full knob reference and tuning guidance: docs/configuration.md and docs/performance.md.

Digital-born vs scanned

The two paths share rendering, layout, formula recognition, table extraction, and bundling. They diverge only inside the block stage:

Digital-born Scanned
Text source pdf_oxide text layer RapidOCR over rendered crops
Render DPI 150 (default) 300 (default)
Per-worker cost ~10 MiB RSS (one PdfDocument handle) ~2 GiB VRAM (one RapidOCR engine + CUDA context)
Typical wall on RTX 3090 ~15s for 150 pages ~5-10x that

Auto-detect samples 5 pages and checks per-page non-whitespace character counts; override with digital_born=True/False. Set scanned_enabled=false to reject scanned PDFs entirely (saves ~12 GiB VRAM; FastAPI returns HTTP 415).

References (GROBID)

The reference stage requires an externally-managed GROBID server -- the pipeline never spawns the JVM. Start it separately:

docker run --rm -p 8070:8070 grobid/grobid:0.9.0

Or use the bundled helper which generates a citation-only config (drops startup from ~10s to ~3s, saves ~1 GiB RSS):

scripts/grobid_start.sh
GROBID_PORT=9090 scripts/grobid_start.sh
scripts/grobid_stop.sh

When enabled, run_reference_stage does one batched POST to /api/processCitationList per PDF. Failures (server unreachable, timeout, HTTP error, malformed XML) are logged at WARNING and the page list flows through unchanged -- references are an enrichment, not a hard requirement. The raw reference string always survives on Block.text.

Design principles

  • Synchronous by default. process_pdf is blocking. Async is layered on top in the FastAPI service via asyncio.to_thread. No async leakage into the core pipeline.
  • multiprocessing.spawn, never fork. The parent process may hold a CUDA context; fork corrupts it. Workers re-import their module from scratch, which is why pdf_oxide is imported inside the digital-born worker entry rather than at module top.
  • One process, one GPU, one PDF at a time. Concurrency is serialised at the asyncio.Lock in the service layer; the pipeline itself is sequential.
  • Resident models are injectable, not global. LayoutAnalyzer, FormulaRecognizer, and TableEngine are constructor-injected with idempotent close(). The FastAPI lifespan loads them once; library callers either inject manually or let the orchestrator own the per-call lifecycle.
  • Opt-in opt-in opt-in. Heavy stages (table_enabled, references_enabled) and slow tradeoffs (formula_torch_compile, layout_fp16) are off by default. Defaults are library-safe; production callers flip them on explicitly.
  • Streaming-first output. Tar over zip because tar writes sequentially without seeking back for a central directory -- the bundle can be a pipe, socket, or HTTP response body. document.json is written first so consumers parse the index before image bytes arrive.
  • No silent failures. Unknown TOML keys raise. MISS extractions are flagged on the block (miss=true) and preserve reading order + bbox. GROBID failures degrade the reference stage but never fail the pipeline.

Limitations

  • Single-GPU, single-PDF concurrency. The service serialises on a lock. Throughput scales with replicas, not workers.
  • Python 3.14+ only. The project uses PEP 649/749 deferred-annotation semantics and modern stdlib features. No backport path.
  • No CPU path is supported. CPU works but is untested and unoptimised. Production deployments need CUDA.
  • fork not supported. Mixing this library with multiprocessing.fork corrupts CUDA contexts.
  • Tuned for academic theses. The PP-DocLayoutV3 label set and routing rules target academic documents (abstracts, references, formulas, tables). General-purpose PDFs may produce surprising layouts.
  • GROBID is external. The reference stage requires a separately-managed GROBID server. The pipeline never bundles or starts the JVM.
  • Reference output may have weird XMP keys. pdf_info comes directly from the PDF's XMP block, cleaned of UTF-16 BOMs and null bytes -- but anything else (encrypted PDFs, IPTC, RDF) is out of scope.
  • Bundle filenames don't sort by reading order. document.json is the authoritative reading order; image filenames sort by page + label + UUID.

Benchmarks

Knob-sweep benchmarks live in the benchmarks/ package. Each sweep varies one PipelineConfig field across a range of values and records per-stage wall, process-tree CPU/RSS, device VRAM, and quality metrics (block counts, MISS counts, formulas recovered, references parsed). Standalone scripts cover cold-start, sustained load, API concurrency, GROBID payload scaling, and torch.compile amortisation.

python benchmarks/run_all.py                # every sweep (cached CSVs skipped)
python benchmarks/run_all.py --only formula # just the sweeps matching "formula"
python -m benchmarks.plot                   # generate plots from existing CSVs

Committed reference results (benchmarks/results/summary.md, sweeps/*.{csv,md}, plots/*.png) live in git. Full catalogue: benchmarks/README.md.

Documentation

File Topic
docs/quickstart.md Install, first run, library + service modes
docs/architecture.md Stage-by-stage pipeline, module layout, resource lifecycle
docs/output-format.md Tar layout, document.json schema, MISS semantics
docs/configuration.md PipelineConfig knobs, TOML, env-var overrides
docs/stages.md Per-stage behaviour, knobs, skip / no-op semantics
docs/performance.md Recommended config, per-knob impact, VRAM budget, tuning checklist
docs/digital-born-vs-scanned.md Auto-detect heuristic, when to override, scanned-only deployments
docs/api-service.md FastAPI contract, lifespan, concurrency model
docs/references.md GROBID setup, parsed Reference shape, failure modes
docs/gotchas.md Common pitfalls, MISS handling, OOM recovery, log conventions

Samples

Six PDFs under samples/ cover English / Turkish / Arabic, digital-born and scanned, good and bad quality. Use 904599.pdf (English, digital-born, good) as the first sanity check -- it exercises every stage except OCR.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ytcc_pipeline-0.2.0.tar.gz (68.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ytcc_pipeline-0.2.0-py3-none-any.whl (93.9 kB view details)

Uploaded Python 3

File details

Details for the file ytcc_pipeline-0.2.0.tar.gz.

File metadata

  • Download URL: ytcc_pipeline-0.2.0.tar.gz
  • Upload date:
  • Size: 68.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ytcc_pipeline-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0912d24ebb423f1b5b4d3f159b0080a0b0c2459fc285164d4b8615f0086c600c
MD5 cff616082ed84d09cd296dd6d1e4feac
BLAKE2b-256 accf094f1c88943cc088c46fdb83f9fd2ef4a52119ba84aa2c4562c6454bcaf0

See more details on using hashes here.

Provenance

The following attestation bundles were made for ytcc_pipeline-0.2.0.tar.gz:

Publisher: release.yml on ozefe/ytcc-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ytcc_pipeline-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ytcc_pipeline-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 93.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ytcc_pipeline-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ac94e5e0e738f7b1f42e8334d8019b3721f50b1ebaa248c7ed78d73c9d408727
MD5 2c1d772cca49e74377502f7dfc4ff387
BLAKE2b-256 8565a500ace0529ebb0560ffcfe72354b67cdf1d2c9ab64eabec3b39ff4ff68a

See more details on using hashes here.

Provenance

The following attestation bundles were made for ytcc_pipeline-0.2.0-py3-none-any.whl:

Publisher: release.yml on ozefe/ytcc-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page