Skip to main content

A structured pipeline for transforming PDFs into **searchable, metadata-rich, web-ready content**, combining OCR, page-level analysis, metadata generation, and static site scaffolding.

Project description

Part of the Abstract Intelligence Platform

This module is part of a unified system for transforming raw media into structured, searchable, and SEO-optimized data.

abstract_pdfs handles document ingestion and publishing:

  • PDF → structured pages (text + images)
  • metadata + manifest generation
  • static HTML output (viewer + gallery)

Full system: https://github.com/AbstractEndeavors/abstract-intelligence


abstract_pdfs — Document Processing & SEO Pipeline for PDF-Based Content

A structured pipeline for transforming PDFs into searchable, metadata-rich, web-ready content, combining OCR, page-level analysis, metadata generation, and static site scaffolding.

Designed for:

  • large PDF collections
  • SEO-driven content indexing
  • document-to-web publishing pipelines
  • structured ingestion of unstructured media

🔹 What This System Is

abstract_pdfs is not a PDF utility — it is a full document processing pipeline:

  • ingests raw PDFs
  • decomposes them into pages, images, and text
  • extracts and generates metadata
  • enriches content via NLP APIs
  • builds structured outputs (JSON + HTML)
  • generates navigable web content (galleries + viewers)

The result is a fully browsable, searchable document corpus.


🔹 Pipeline Overview

PDF Input
    ↓
Slice / Decompose (images + text per page)
    ↓
OCR + Text Extraction (layout-aware engines)
    ↓
Metadata Generation
    ├─ summaries
    ├─ keywords
    ├─ descriptions
    ↓
Manifest Creation (per-page + per-document)
    ↓
HTML Generation
    ├─ PDF viewer pages
    ├─ gallery index pages
    ↓
Static Site Output (SEO-ready)
flowchart TD
    A[PDF Input]
    B[DocumentPipeline]
    C[SliceManager\nPage Images + Text + OCR]
    D[Per-Page Assets\nThumbnails / Text / Info JSON]
    E[Manifest Generation\nPage + Document Metadata]
    F[NLP Enrichment\nSummaries + Keywords + Descriptions]
    G[HTML Generation\nViewer Pages + Gallery Indexes]
    H[Static Output\nSearchable / SEO-ready PDF Corpus]

    A --> B --> C --> D --> E --> F --> G --> H

🔹 Core Capabilities

Document Decomposition

  • Splits PDFs into:

    • page images
    • extracted text
    • structured page directories
  • Maintains consistent directory structure for downstream processing


Metadata & SEO Enrichment

  • Generates:

    • summaries
    • keywords
    • descriptions
  • Integrates with NLP endpoints for:

    • text analysis
    • keyword refinement
    • summarization

Example: page-level analysis via API calls


Manifest Generation

  • Produces structured JSON per page:

    • metadata
    • text
    • image references
    • SEO fields
  • Aggregates into document-level manifests


Static Site Generation

  • Generates:

    • PDF viewer pages (page-by-page navigation)
    • gallery index pages (directory browsing)
  • Automatically builds:

    • thumbnails
    • descriptions
    • keyword tags

Example: dynamic card generation for directories


Path ↔ URL Mapping

  • Converts filesystem structure into web-accessible URLs

  • Maintains consistency between:

    • local storage (/srv/media/...)
    • public endpoints (/pdfs/...)

Content Structuring

  • Page-level:

    • text
    • summary
    • keywords
  • Document-level:

    • aggregated metadata
    • full-text indexing

🔹 Architecture

The system is composed of modular components:

  • DocumentPipeline

    • orchestrates ingestion → processing → output
  • SliceManager

    • handles PDF decomposition and OCR
  • Manifest Generators

    • build structured JSON representations
  • HTML Generators

    • render viewer and gallery pages
  • Metadata Utilities

    • enrich content via external NLP services

Each stage is:

  • independent
  • composable
  • replaceable

🔹 Key Design Decisions

Page-Level First

All processing happens per-page, enabling:

  • granular indexing
  • targeted metadata
  • scalable processing

Structured Over Raw

Outputs are always:

  • JSON manifests
  • structured metadata
  • normalized fields

Not just raw text dumps.


SEO as a First-Class Concern

Every page includes:

  • meta tags
  • OpenGraph / social metadata
  • keyword tagging
  • canonical URLs

Filesystem as Source of Truth

  • directory structure = content hierarchy
  • no database required
  • easily deployable as static site

🔹 Why This Exists

Traditional PDF workflows:

  • store documents as opaque blobs
  • lack searchability
  • lack metadata
  • are not web-native

abstract_pdfs transforms PDFs into:

  • structured, indexable content
  • web-ready assets
  • searchable knowledge bases

🔹 Example Use Cases

  • PDF → website publishing pipelines
  • document archives (research, legal, media)
  • SEO-driven content platforms
  • knowledge base generation
  • preprocessing for LLM / search systems

🔹 Integration Context

This system integrates with:

  • OCR pipelines (layout_ocr / abstract_ocr)
  • NLP systems (abstract_hugpy)
  • static hosting (Nginx / CDN)
  • search indexing systems

🔹 Design Philosophy

  • Documents are data, not files
  • Structure before presentation
  • Metadata is as important as content
  • Static outputs scale better than dynamic systems

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abstract_pdfs-0.0.37.tar.gz (117.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abstract_pdfs-0.0.37-py3-none-any.whl (165.0 kB view details)

Uploaded Python 3

File details

Details for the file abstract_pdfs-0.0.37.tar.gz.

File metadata

  • Download URL: abstract_pdfs-0.0.37.tar.gz
  • Upload date:
  • Size: 117.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for abstract_pdfs-0.0.37.tar.gz
Algorithm Hash digest
SHA256 1e95c4c2ea8b6618c4008bc782cd4a7a36f2e011027c70525c6fc7f2412185b1
MD5 27c17a7ebe538777a8bdb1a84a550ba8
BLAKE2b-256 b8136c5d790ddb65cb9f16e060561b5410f00ea48d23adae271941a39eee7b06

See more details on using hashes here.

File details

Details for the file abstract_pdfs-0.0.37-py3-none-any.whl.

File metadata

  • Download URL: abstract_pdfs-0.0.37-py3-none-any.whl
  • Upload date:
  • Size: 165.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for abstract_pdfs-0.0.37-py3-none-any.whl
Algorithm Hash digest
SHA256 3c416cb88dc29cebe2e735652f5156b1ace8335b1f884e39f7947ea62103e81a
MD5 83976456d0ed5a8922bb74770c915042
BLAKE2b-256 d498e8ece6aef8adb1391b7fd3cb072d72d1d3b174db936591d2bd2533d41917

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page