A structured pipeline for transforming PDFs into **searchable, metadata-rich, web-ready content**, combining OCR, page-level analysis, metadata generation, and static site scaffolding.
Project description
🔹 🔧 REWRITTEN README (ABSTRACT_PDFS — SYSTEM LEVEL)
abstract_pdfs — Document Processing & SEO Pipeline for PDF-Based Content
A structured pipeline for transforming PDFs into searchable, metadata-rich, web-ready content, combining OCR, page-level analysis, metadata generation, and static site scaffolding.
Designed for:
- large PDF collections
- SEO-driven content indexing
- document-to-web publishing pipelines
- structured ingestion of unstructured media
🔹 What This System Is
abstract_pdfs is not a PDF utility — it is a full document processing pipeline:
- ingests raw PDFs
- decomposes them into pages, images, and text
- extracts and generates metadata
- enriches content via NLP APIs
- builds structured outputs (JSON + HTML)
- generates navigable web content (galleries + viewers)
The result is a fully browsable, searchable document corpus.
🔹 Pipeline Overview
PDF Input
↓
Slice / Decompose (images + text per page)
↓
OCR + Text Extraction (layout-aware engines)
↓
Metadata Generation
├─ summaries
├─ keywords
├─ descriptions
↓
Manifest Creation (per-page + per-document)
↓
HTML Generation
├─ PDF viewer pages
├─ gallery index pages
↓
Static Site Output (SEO-ready)
🔹 Core Capabilities
Document Decomposition
-
Splits PDFs into:
- page images
- extracted text
- structured page directories
-
Maintains consistent directory structure for downstream processing
Metadata & SEO Enrichment
-
Generates:
- summaries
- keywords
- descriptions
-
Integrates with NLP endpoints for:
- text analysis
- keyword refinement
- summarization
Example: page-level analysis via API calls
Manifest Generation
-
Produces structured JSON per page:
- metadata
- text
- image references
- SEO fields
-
Aggregates into document-level manifests
Static Site Generation
-
Generates:
- PDF viewer pages (page-by-page navigation)
- gallery index pages (directory browsing)
-
Automatically builds:
- thumbnails
- descriptions
- keyword tags
Example: dynamic card generation for directories
Path ↔ URL Mapping
-
Converts filesystem structure into web-accessible URLs
-
Maintains consistency between:
- local storage (
/srv/media/...) - public endpoints (
/pdfs/...)
- local storage (
Content Structuring
-
Page-level:
- text
- summary
- keywords
-
Document-level:
- aggregated metadata
- full-text indexing
🔹 Architecture
The system is composed of modular components:
-
DocumentPipeline
- orchestrates ingestion → processing → output
-
SliceManager
- handles PDF decomposition and OCR
-
Manifest Generators
- build structured JSON representations
-
HTML Generators
- render viewer and gallery pages
-
Metadata Utilities
- enrich content via external NLP services
Each stage is:
- independent
- composable
- replaceable
🔹 Key Design Decisions
Page-Level First
All processing happens per-page, enabling:
- granular indexing
- targeted metadata
- scalable processing
Structured Over Raw
Outputs are always:
- JSON manifests
- structured metadata
- normalized fields
Not just raw text dumps.
SEO as a First-Class Concern
Every page includes:
- meta tags
- OpenGraph / social metadata
- keyword tagging
- canonical URLs
Filesystem as Source of Truth
- directory structure = content hierarchy
- no database required
- easily deployable as static site
🔹 Why This Exists
Traditional PDF workflows:
- store documents as opaque blobs
- lack searchability
- lack metadata
- are not web-native
abstract_pdfs transforms PDFs into:
- structured, indexable content
- web-ready assets
- searchable knowledge bases
🔹 Example Use Cases
- PDF → website publishing pipelines
- document archives (research, legal, media)
- SEO-driven content platforms
- knowledge base generation
- preprocessing for LLM / search systems
🔹 Integration Context
This system integrates with:
- OCR pipelines (layout_ocr / abstract_ocr)
- NLP systems (abstract_hugpy)
- static hosting (Nginx / CDN)
- search indexing systems
🔹 Design Philosophy
- Documents are data, not files
- Structure before presentation
- Metadata is as important as content
- Static outputs scale better than dynamic systems
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file abstract_pdfs-0.0.33.tar.gz.
File metadata
- Download URL: abstract_pdfs-0.0.33.tar.gz
- Upload date:
- Size: 54.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7dacaef02ca581d5bb8c8288a07aad417d59c2683de3c02b75669920de37500
|
|
| MD5 |
27027b2c11d725df181064e836c58332
|
|
| BLAKE2b-256 |
387304fd07ddf341ac246b4e07d65ffdeb48a2f41ef0313d171e76f8ac0a57a0
|
File details
Details for the file abstract_pdfs-0.0.33-py3-none-any.whl.
File metadata
- Download URL: abstract_pdfs-0.0.33-py3-none-any.whl
- Upload date:
- Size: 73.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42338afd3446a7ee1af51871fe1040520f686e6f07de7a23d42025b8c933075b
|
|
| MD5 |
75a7a88c3e20df388b17b41ac3f84ea9
|
|
| BLAKE2b-256 |
ebf8340e9f6ebb4cbefdaf6a03404d94c21bf1727a2c59ec5c342c6628272ea1
|