PDF processing pipeline for academic theses.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

0zefe

These details have not been verified by PyPI

Project description

ytcc-pipeline

PyPI - Python Version PyPI - License PyPI - Status PyPI - Downloads

A synchronous Python library that converts an academic-thesis PDF into a structured JSON document plus a tar bundle of cropped figures, tables, and formulas. Ships an optional FastAPI wrapper for service deployments and Docker images for four deployment profiles.

What it does

Given a PDF, the pipeline runs eight stages in fixed order:

render -> metadata -> layout -> blocks -> table -> formula -> reference -> bundle

render -- decode every page to an image (pdf-oxide, cfg.render_workers processes).
metadata -- sha256, byte size, XMP fields. Cheap; runs before model load so I/O failures surface early.
layout -- PP-DocLayoutV3 emits one LayoutDetection per detected block with label, bbox, confidence, reading_order.
blocks -- per-page dispatcher routes each detection by Route. Text comes from the PDF text layer (digital-born) or RapidOCR over the rendered crop (scanned). Figures, tables, and formulas are cropped and saved.
table (opt-in) -- RapidTable SLANet+ recovers cell grids for TABLE blocks.
formula (default on) -- PP-FormulaNet-L recovers LaTeX from every FORMULA block's crop.
reference (opt-in) -- batched POST to an externally-managed GROBID server enriches REFERENCE blocks with parsed Reference records.
bundle -- pack document.json and every saved crop into a single uncompressed tar.

Disabled stages never load their model. Each stage emits one INFO log line on completion; skips short-circuit with skipped reason=....

Features

One public entry point. process_pdf(pdf_path, language=..., ...) -- everything else is implementation detail.
Auto-detected digital-born vs scanned. The orchestrator probes the PDF's text layer once and picks the path. Override with digital_born=True/False.
Two text-extraction paths, asymmetric workers. digital_born_workers (cheap pdf_oxide processes) and ocr_workers (RapidOCR + CUDA, ~2 GiB VRAM each) are tuned independently.
Bucketed formula batching. Sorts crops by bbox area into small/medium/large buckets with per-bucket max_new_tokens caps. Measured 1.63x speedup over flat batching.
fp16 + cv2 fast preproc on layout. ~1.7x and ~2.3x isolated layout-stage speedups on the SafeTensors backend.
Auto-DPI on the digital-born path. Renders at 150 DPI for digital-born (layout downsamples to 800x800 anyway, pdf_oxide is resolution-independent), 300 DPI for scanned. Halves render wall.
Injectable resident models. Pass pre-loaded LayoutAnalyzer, FormulaRecognizer, and TableEngine to skip the ~5s + ~3s + ~1s reload between calls. The FastAPI service does this in lifespan.
Frozen-dataclass schema. Document / Page / Block / Cell / Reference are immutable; dataclasses.replace is the only rewrite path. JSON serialisation is stable across runs (modulo uuid4 crop filenames).
Streaming-first tar bundle. document.json is the first archive member; consumers parse the index before the image bytes arrive. The FastAPI service ships it directly via FileResponse.
Three-layer config. config.toml (service-wide), YTCC_* env vars (per-field overrides), PipelineConfig(...) keyword args (per-call). Unknown TOML keys raise ValueError.

Requirements

Python 3.14+.
A CUDA GPU for any non-trivial throughput. CPU works but is neither tested nor recommended.
An externally-managed GROBID server when references_enabled=true (not bundled).

The project pins CUDA-enabled torch and onnxruntime-gpu. Substitute these for CPU wheels if you target CPU-only environments; nothing else in the codebase assumes CUDA at import time.

Installation

pip install ytcc-pipeline       # library only
pip install ytcc-pipeline[api]  # + FastAPI service
pip install ytcc-pipeline[dev]  # + tests + lint + typing + benchmarks (includes [api])

Library quickstart

from ytcc_pipeline import process_pdf

# paper.tar: contains document.json + images/*
bundle_path = process_pdf("paper.pdf", language="en")

process_pdf is synchronous and blocking -- internally it uses multiprocessing.spawn pools, not asyncio. Call it from a thread (asyncio.to_thread(process_pdf, ...)) if you need to integrate with an event loop.

Service quickstart

pip install ytcc-pipeline[api]
uvicorn ytcc_pipeline.api.app:app --host 0.0.0.0 --port 8000

The service reads config.toml at startup -- override the path with YTCC_CONFIG=/path/to/config.toml. One process, one GPU, one PDF at a time; concurrency is serialised on an asyncio.Lock.

curl -X POST http://localhost:8000/process \
  -F "pdf=@paper.pdf" \
  -F "language=en" \
  -o paper.tar

GET /health reports liveness + readiness ({"status":"ok","model_loaded":true}). POST /process accepts pdf (file), language (ISO 639-1), and optional digital_born (true/false). The response body is the tar bundle; X-Processing-Time carries the server-side wall in seconds.

[!CAUTION] Run one uvicorn worker per GPU. --workers N>1 multi-loads every resident model and contends for VRAM.

Docker

Pre-built images for four deployment profiles are published to GitHub Container Registry, each available as a slim variant (~6 GB, models fetched on first request) or a baked variant (~11 GB, models pre-downloaded). Pin to a versioned tag in production.

Tag	Profile	Use case
`:scanned[-baked]`	Mixed (digital-born + scanned)	Default; loads OCR engines
`:digital-born[-baked]`	`scanned_enabled=false`	Rejects scanned PDFs with HTTP 415; saves ~12 GiB VRAM
`:digital-born-a100[-baked]`	A100-tuned	Larger batches, `formula_torch_compile=true`
`:text-extract[-baked]`	48GB VRAM-tuned, formula + table OFF	Text + image + references only; ~10x faster on math-heavy theses

Each profile ships a compose file with a GROBID sidecar:

docker compose -f docker/compose.scanned.yml up -d
curl -X POST http://localhost:8000/process \
    -F "pdf=@paper.pdf" \
    -F "language=en" \
    -o paper.tar

Override config via env vars (every PipelineConfig field is reachable via YTCC_<UPPERCASE_FIELD>) or by mounting your own TOML over /app/config.toml. See docker/README.md for the full image matrix, build instructions, and troubleshooting.

Output format

The output is one uncompressed tar:

paper.tar
├── document.json              # the schema document
└── images/                    # cropped block images
    ├── 0001-image-{uuid}.png
    ├── 0014-formula-{uuid}.png
    ├── 0014-table-{uuid}.png
    └── 0027-formula-MISS-{uuid}.png

document.json is the first archive member -- consumers can stream-parse it before image bytes arrive. Image filenames sort by page (1-based, zero-padded), then by layout label, then by random UUID. The -MISS- marker identifies fallback crops written when primary extraction failed.

import json, tarfile

with tarfile.open("paper.tar") as tf:
    doc = json.loads(tf.extractfile("document.json").read())

for page in doc["pages"]:
    for block in page["blocks"]:
        print(block["reading_order"], block["type"], (block["text"] or "")[:60])

The schema mirrors the Document / Page / Block / Cell / Reference dataclasses. bbox floats are rounded to two decimals; pixel coordinates are in the effective render DPI (150 for digital-born, 300 for scanned), origin top-left.

Per-block invariants:

Block kind	`text`	`image_path`	`miss`
TEXT, success	extracted text	`null`	`false`
TEXT, MISS	`null`	crop (if bundled) or `null`	`true`
REFERENCE, success	extracted text	`null`	`false`
REFERENCE, MISS	`null`	crop (if bundled) or `null`	`true`
IMAGE	`null`	crop path	`false`
FORMULA, success	LaTeX	`null` (crop deleted)	`false`
FORMULA, MISS	`null`	`-MISS-` marked crop	`true`
TABLE, structured	`null`	crop path + `cells` / `n_rows` / `n_cols` set	`false`
TABLE, image-only fallback	`null`	crop path, `cells=null`	`false`

miss=True always means the primary representation is unavailable. TABLE blocks never carry miss=True -- a structure failure degrades silently to image-only.

Full schema reference: docs/output-format.md.

Configuration

Three layers, broadest to narrowest:

config.toml at the project root -- single source of truth for the service and benchmarks. Loaded by load_service_config().
YTCC_* environment variables -- per-field overrides via PipelineConfig.from_env().
PipelineConfig(...) keyword arguments -- explicit, per-call.

[!IMPORTANT] The three layers don't compose automatically. The TOML and env vars are read only by load_service_config() and PipelineConfig.from_env(). Use dataclasses.replace(loaded.pipeline, ...) to layer overrides on top of a TOML-loaded config.

TOML resolution walks: path arg -> YTCC_CONFIG env -> ./config.toml -> installed-package config.toml -> dataclass defaults. Unknown TOML keys raise ValueError -- typos don't pass silently.

Every PipelineConfig field has a matching YTCC_<UPPERCASE_NAME> env var. Bool parser accepts 1/true/yes/on (case-insensitive). Comma-list fields strip whitespace and drop empties.

Full knob reference and tuning guidance: docs/configuration.md and docs/performance.md.

Digital-born vs scanned

The two paths share rendering, layout, formula recognition, table extraction, and bundling. They diverge only inside the block stage:

	Digital-born	Scanned
Text source	`pdf_oxide` text layer	`RapidOCR` over rendered crops
Render DPI	150 (default)	300 (default)
Per-worker cost	~10 MiB RSS (one `PdfDocument` handle)	~2 GiB VRAM (one RapidOCR engine + CUDA context)
Typical wall on RTX 3090	~15s for 150 pages	~5-10x that

Auto-detect samples 5 pages and checks per-page non-whitespace character counts; override with digital_born=True/False. Set scanned_enabled=false to reject scanned PDFs entirely (saves ~12 GiB VRAM; FastAPI returns HTTP 415).

References (GROBID)

The reference stage requires an externally-managed GROBID server -- the pipeline never spawns the JVM. Start it separately:

docker run --rm -p 8070:8070 grobid/grobid:0.9.0

Or use the bundled helper which generates a citation-only config (drops startup from ~10s to ~3s, saves ~1 GiB RSS):

scripts/grobid_start.sh
GROBID_PORT=9090 scripts/grobid_start.sh
scripts/grobid_stop.sh

When enabled, run_reference_stage does one batched POST to /api/processCitationList per PDF. Failures (server unreachable, timeout, HTTP error, malformed XML) are logged at WARNING and the page list flows through unchanged -- references are an enrichment, not a hard requirement. The raw reference string always survives on Block.text.

Design principles

Synchronous by default. process_pdf is blocking. Async is layered on top in the FastAPI service via asyncio.to_thread. No async leakage into the core pipeline.
multiprocessing.spawn, never fork. The parent process may hold a CUDA context; fork corrupts it. Workers re-import their module from scratch, which is why pdf_oxide is imported inside the digital-born worker entry rather than at module top.
One process, one GPU, one PDF at a time. Concurrency is serialised at the asyncio.Lock in the service layer; the pipeline itself is sequential.
Resident models are injectable, not global. LayoutAnalyzer, FormulaRecognizer, and TableEngine are constructor-injected with idempotent close(). The FastAPI lifespan loads them once; library callers either inject manually or let the orchestrator own the per-call lifecycle.
Opt-in opt-in opt-in. Heavy stages (table_enabled, references_enabled) and slow tradeoffs (formula_torch_compile, layout_fp16) are off by default. Defaults are library-safe; production callers flip them on explicitly.
Streaming-first output. Tar over zip because tar writes sequentially without seeking back for a central directory -- the bundle can be a pipe, socket, or HTTP response body. document.json is written first so consumers parse the index before image bytes arrive.
No silent failures. Unknown TOML keys raise. MISS extractions are flagged on the block (miss=true) and preserve reading order + bbox. GROBID failures degrade the reference stage but never fail the pipeline.

Limitations

Single-GPU, single-PDF concurrency. The service serialises on a lock. Throughput scales with replicas, not workers.
Python 3.14+ only. The project uses PEP 649/749 deferred-annotation semantics and modern stdlib features. No backport path.
No CPU path is supported. CPU works but is untested and unoptimised. Production deployments need CUDA.
fork not supported. Mixing this library with multiprocessing.fork corrupts CUDA contexts.
Tuned for academic theses. The PP-DocLayoutV3 label set and routing rules target academic documents (abstracts, references, formulas, tables). General-purpose PDFs may produce surprising layouts.
GROBID is external. The reference stage requires a separately-managed GROBID server. The pipeline never bundles or starts the JVM.
Reference output may have weird XMP keys. pdf_info comes directly from the PDF's XMP block, cleaned of UTF-16 BOMs and null bytes -- but anything else (encrypted PDFs, IPTC, RDF) is out of scope.
Bundle filenames don't sort by reading order. document.json is the authoritative reading order; image filenames sort by page + label + UUID.

Benchmarks

Knob-sweep benchmarks live in the benchmarks/ package. Each sweep varies one PipelineConfig field across a range of values and records per-stage wall, process-tree CPU/RSS, device VRAM, and quality metrics (block counts, MISS counts, formulas recovered, references parsed). Standalone scripts cover cold-start, sustained load, API concurrency, GROBID payload scaling, and torch.compile amortisation.

python benchmarks/run_all.py                # every sweep (cached CSVs skipped)
python benchmarks/run_all.py --only formula # just the sweeps matching "formula"
python -m benchmarks.plot                   # generate plots from existing CSVs

Committed reference results (benchmarks/results/summary.md, sweeps/*.{csv,md}, plots/*.png) live in git. Full catalogue: benchmarks/README.md.

Documentation

File	Topic
`docs/quickstart.md`	Install, first run, library + service modes
`docs/architecture.md`	Stage-by-stage pipeline, module layout, resource lifecycle
`docs/output-format.md`	Tar layout, `document.json` schema, MISS semantics
`docs/configuration.md`	`PipelineConfig` knobs, TOML, env-var overrides
`docs/stages.md`	Per-stage behaviour, knobs, skip / no-op semantics
`docs/performance.md`	Recommended config, per-knob impact, VRAM budget, tuning checklist
`docs/digital-born-vs-scanned.md`	Auto-detect heuristic, when to override, scanned-only deployments
`docs/api-service.md`	FastAPI contract, lifespan, concurrency model
`docs/references.md`	GROBID setup, parsed `Reference` shape, failure modes
`docs/gotchas.md`	Common pitfalls, MISS handling, OOM recovery, log conventions

Samples

Six PDFs under samples/ cover English / Turkish / Arabic, digital-born and scanned, good and bad quality. Use 904599.pdf (English, digital-born, good) as the first sanity check -- it exercises every stage except OCR.

License

MIT. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

0zefe

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ytcc_pipeline-0.2.0.tar.gz (68.2 MB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ytcc_pipeline-0.2.0-py3-none-any.whl (93.9 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file ytcc_pipeline-0.2.0.tar.gz.

File metadata

Download URL: ytcc_pipeline-0.2.0.tar.gz
Upload date: May 21, 2026
Size: 68.2 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ytcc_pipeline-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0912d24ebb423f1b5b4d3f159b0080a0b0c2459fc285164d4b8615f0086c600c`
MD5	`cff616082ed84d09cd296dd6d1e4feac`
BLAKE2b-256	`accf094f1c88943cc088c46fdb83f9fd2ef4a52119ba84aa2c4562c6454bcaf0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ytcc_pipeline-0.2.0.tar.gz:

Publisher: release.yml on ozefe/ytcc-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ytcc_pipeline-0.2.0.tar.gz
- Subject digest: 0912d24ebb423f1b5b4d3f159b0080a0b0c2459fc285164d4b8615f0086c600c
- Sigstore transparency entry: 1593284919
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: ozefe/ytcc-pipeline@d4c23f82c6f8925dd520a7bdd50824f44674db93
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ozefe
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d4c23f82c6f8925dd520a7bdd50824f44674db93
- Trigger Event: push

File details

Details for the file ytcc_pipeline-0.2.0-py3-none-any.whl.

File metadata

Download URL: ytcc_pipeline-0.2.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 93.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ytcc_pipeline-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ac94e5e0e738f7b1f42e8334d8019b3721f50b1ebaa248c7ed78d73c9d408727`
MD5	`2c1d772cca49e74377502f7dfc4ff387`
BLAKE2b-256	`8565a500ace0529ebb0560ffcfe72354b67cdf1d2c9ab64eabec3b39ff4ff68a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ytcc_pipeline-0.2.0-py3-none-any.whl:

Publisher: release.yml on ozefe/ytcc-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ytcc_pipeline-0.2.0-py3-none-any.whl
- Subject digest: ac94e5e0e738f7b1f42e8334d8019b3721f50b1ebaa248c7ed78d73c9d408727
- Sigstore transparency entry: 1593284984
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: ozefe/ytcc-pipeline@d4c23f82c6f8925dd520a7bdd50824f44674db93
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ozefe
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d4c23f82c6f8925dd520a7bdd50824f44674db93
- Trigger Event: push

ytcc-pipeline 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ytcc-pipeline

What it does

Features

Requirements

Installation

Library quickstart

Service quickstart

Docker

Output format

Configuration

Digital-born vs scanned

References (GROBID)

Design principles

Limitations

Benchmarks

Documentation

Samples

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance