A structured pipeline for transforming PDFs into **searchable, metadata-rich, web-ready content**, combining OCR, page-level analysis, metadata generation, and static site scaffolding.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.11

Project description

abstract_pdfs — Document Processing & SEO Pipeline for PDF-Based Content

A structured pipeline for transforming PDFs into searchable, metadata-rich, web-ready content, combining OCR, page-level analysis, metadata generation, and static site scaffolding.

Designed for:

large PDF collections
SEO-driven content indexing
document-to-web publishing pipelines
structured ingestion of unstructured media

🔹 What This System Is

abstract_pdfs is not a PDF utility — it is a full document processing pipeline:

ingests raw PDFs
decomposes them into pages, images, and text
extracts and generates metadata
enriches content via NLP APIs
builds structured outputs (JSON + HTML)
generates navigable web content (galleries + viewers)

The result is a fully browsable, searchable document corpus.

🔹 Pipeline Overview

PDF Input
    ↓
Slice / Decompose (images + text per page)
    ↓
OCR + Text Extraction (layout-aware engines)
    ↓
Metadata Generation
    ├─ summaries
    ├─ keywords
    ├─ descriptions
    ↓
Manifest Creation (per-page + per-document)
    ↓
HTML Generation
    ├─ PDF viewer pages
    ├─ gallery index pages
    ↓
Static Site Output (SEO-ready)

`abstract_pdfs` diagram

flowchart TD
    A[PDF Input]
    B[DocumentPipeline]
    C[SliceManager\nPage Images + Text + OCR]
    D[Per-Page Assets\nThumbnails / Text / Info JSON]
    E[Manifest Generation\nPage + Document Metadata]
    F[NLP Enrichment\nSummaries + Keywords + Descriptions]
    G[HTML Generation\nViewer Pages + Gallery Indexes]
    H[Static Output\nSearchable / SEO-ready PDF Corpus]

    A --> B --> C --> D --> E --> F --> G --> H

🔹 Core Capabilities

Document Decomposition

Splits PDFs into:
- page images
- extracted text
- structured page directories
Maintains consistent directory structure for downstream processing

Metadata & SEO Enrichment

Generates:
- summaries
- keywords
- descriptions
Integrates with NLP endpoints for:
- text analysis
- keyword refinement
- summarization

Example: page-level analysis via API calls

Manifest Generation

Produces structured JSON per page:
- metadata
- text
- image references
- SEO fields
Aggregates into document-level manifests

Static Site Generation

Generates:
- PDF viewer pages (page-by-page navigation)
- gallery index pages (directory browsing)
Automatically builds:
- thumbnails
- descriptions
- keyword tags

Example: dynamic card generation for directories

Path ↔ URL Mapping

Converts filesystem structure into web-accessible URLs
Maintains consistency between:
- local storage (/srv/media/...)
- public endpoints (/pdfs/...)

Content Structuring

Page-level:
- text
- summary
- keywords
Document-level:
- aggregated metadata
- full-text indexing

🔹 Architecture

The system is composed of modular components:

DocumentPipeline
- orchestrates ingestion → processing → output
SliceManager
- handles PDF decomposition and OCR
Manifest Generators
- build structured JSON representations
HTML Generators
- render viewer and gallery pages
Metadata Utilities
- enrich content via external NLP services

Each stage is:

independent
composable
replaceable

🔹 Key Design Decisions

Page-Level First

All processing happens per-page, enabling:

granular indexing
targeted metadata
scalable processing

Structured Over Raw

Outputs are always:

JSON manifests
structured metadata
normalized fields

Not just raw text dumps.

SEO as a First-Class Concern

Every page includes:

meta tags
OpenGraph / social metadata
keyword tagging
canonical URLs

Filesystem as Source of Truth

directory structure = content hierarchy
no database required
easily deployable as static site

🔹 Why This Exists

Traditional PDF workflows:

store documents as opaque blobs
lack searchability
lack metadata
are not web-native

abstract_pdfs transforms PDFs into:

structured, indexable content
web-ready assets
searchable knowledge bases

🔹 Example Use Cases

PDF → website publishing pipelines
document archives (research, legal, media)
SEO-driven content platforms
knowledge base generation
preprocessing for LLM / search systems

🔹 Integration Context

This system integrates with:

OCR pipelines (layout_ocr / abstract_ocr)
NLP systems (abstract_hugpy)
static hosting (Nginx / CDN)
search indexing systems

🔹 Design Philosophy

Documents are data, not files
Structure before presentation
Metadata is as important as content
Static outputs scale better than dynamic systems

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.11

Release history Release notifications | RSS feed

0.0.39

Jun 25, 2026

0.0.38

Jun 25, 2026

0.0.37

Jun 6, 2026

0.0.36

Jun 6, 2026

0.0.35

Jun 6, 2026

This version

0.0.34

Jun 6, 2026

0.0.33

Apr 6, 2026

0.0.32

Mar 28, 2026

0.0.31

Mar 28, 2026

0.0.30

Mar 28, 2026

0.0.29

Mar 28, 2026

0.0.28

Mar 28, 2026

0.0.27

Mar 28, 2026

0.0.26

Mar 28, 2026

0.0.25

Mar 28, 2026

0.0.24

Mar 28, 2026

0.0.23

Mar 28, 2026

0.0.22

Mar 28, 2026

0.0.21

Mar 17, 2026

0.0.20

Mar 17, 2026

0.0.19

Mar 17, 2026

0.0.18

Mar 16, 2026

0.0.17

Mar 15, 2026

0.0.16

Mar 15, 2026

0.0.15

Mar 15, 2026

0.0.14

Mar 15, 2026

0.0.13

Mar 15, 2026

0.0.12

Mar 15, 2026

0.0.11

Mar 15, 2026

0.0.10

Mar 15, 2026

0.0.9

Mar 15, 2026

0.0.8

Mar 12, 2026

0.0.7

Mar 11, 2026

0.0.6

Mar 11, 2026

0.0.5

Mar 10, 2026

0.0.4

Oct 21, 2025

0.0.3

Oct 21, 2025

0.0.2

Oct 21, 2025

0.0.1

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abstract_pdfs-0.0.34.tar.gz (4.6 kB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

abstract_pdfs-0.0.34-py3-none-any.whl (3.4 kB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file abstract_pdfs-0.0.34.tar.gz.

File metadata

Download URL: abstract_pdfs-0.0.34.tar.gz
Upload date: Jun 6, 2026
Size: 4.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for abstract_pdfs-0.0.34.tar.gz
Algorithm	Hash digest
SHA256	`f3dada4319dddcc89f7a0e10267729b90c4b3c4e7b3d918b83898b6b61f27937`
MD5	`5d38ea62b0f2cb2d6e59032aaac1d2a1`
BLAKE2b-256	`bd6fcc4fc854446befa677d2b97d3e7488f76b9e99bdc606dec606830479fb9f`

See more details on using hashes here.

File details

Details for the file abstract_pdfs-0.0.34-py3-none-any.whl.

File metadata

Download URL: abstract_pdfs-0.0.34-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 3.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for abstract_pdfs-0.0.34-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c14d64b35c67c6881ad7827767c10a8545ebb52ad9e392ab0e80fc437e8b5ef6`
MD5	`2626ddcc3ef492296bdf33e73ea1962f`
BLAKE2b-256	`2de0332e5bd60e8ebba783be744994d77e01ee71f7f1cc93e7d660b0eb280964`

See more details on using hashes here.

abstract-pdfs 0.0.34

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

abstract_pdfs — Document Processing & SEO Pipeline for PDF-Based Content

🔹 What This System Is

🔹 Pipeline Overview

abstract_pdfs diagram

🔹 Core Capabilities

Document Decomposition

Metadata & SEO Enrichment

Manifest Generation

Static Site Generation

Path ↔ URL Mapping

Content Structuring

🔹 Architecture

🔹 Key Design Decisions

Page-Level First

Structured Over Raw

SEO as a First-Class Concern

Filesystem as Source of Truth

🔹 Why This Exists

🔹 Example Use Cases

🔹 Integration Context

🔹 Design Philosophy

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`abstract_pdfs` diagram