Skip to main content

Abstract Intelligence Platform — a unified, layered pipeline that turns raw media (PDFs, images, video) into structured, searchable, SEO-ready data.

Project description

media_intelligence — Abstract Intelligence Platform

A unified, layered facade that turns raw media — PDFs, images, and video — into structured, searchable, SEO-ready data. It does not reimplement any engine: it selects the best function of each sibling package and exposes it behind one clean, lazy API, plus an orchestrated pipeline.

Raw Media (PDF / Image / Video / URL)
   │
   ▼
ingest  → extract → structure → enrich → persist → publish
(webtools) (ocr/    (typed     (hugpy)  (FS / DB) (react/
            pdfs/    metadata)                      nginx)
            videos)

Layers → canonical owners

Layer Owner package What it does
ingest abstract_webtools scrape pages, download video (yt-dlp/ffmpeg)
ocr abstract_ocr layout-aware, multi-engine OCR
documents abstract_pdfs PDF decomposition + manifests + HTML
video abstract_videos registry pipeline: download/frames/transcribe
transcribe hugpy (→ abstract_ocr fallback) Whisper speech-to-text
enrich hugpy summaries, keywords, vision captioning, SEO
persist filesystem (DB-pluggable) typed JSON/JSONB manifests
publish abstract_react + abstract_nginx SEO/OG metadata + static HTML

Overlapping capabilities are resolved to one owner (Whisper → hugpy; video download → webtools; summarize/keywords → hugpy).

Install

media_intelligence is just this src/ facade — it contains none of the engines. Each layer's owner is its own PyPI package, declared as an optional extra, so you install only what you use:

pip install media_intelligence              # zero third-party deps — facade only
pip install "media_intelligence[ocr,enrich]"  # just those layers
pip install "media_intelligence[all]"       # the full platform

The package has no required third-party dependencies: importing it is cheap (~20 ms) and pulls none of the backing packages. Each sibling is imported lazily, only when its layer is actually called; a missing one raises a clear MissingDependency naming the extra to install.

Check what's usable in the current environment without importing anything:

import media_intelligence as mi
mi.available()            # {'ingest': True, 'ocr': True, 'publish': False, ...}
mi.available("enrich")    # True / False

Usage

Direct namespace access

import media_intelligence as mi

text = mi.ocr.image_to_text("page.png")
kw   = mi.enrich.keywords(text)
mi.documents.process_pdf("doc.pdf")
mi.ingest.download_video("https://site.com/v.mp4", download_directory="/data")

Orchestrated pipeline (idempotent + resumable)

from media_intelligence import MediaPipeline

pipe = MediaPipeline("https://site.com/video.mp4", out_root="/data")
pipe.ingest().extract().structure().enrich().persist().publish()
print(pipe.report.summary)
#   ... or simply:
pipe.run()

The pipeline autodetects media kind, dispatches each stage accordingly, skips stages already satisfied (idempotent), and rehydrates from a prior manifest on re-run (resumable). Results land in out_root/<media_id>/manifest.json.

Persistence (DB-pluggable, two records)

Each item is persisted as two records so indexing stays cheap while aggregation stays simple:

  • manifest.json — lean index: ids, counts, text_chars, summary, keywords, SEO, asset pointers. (The JSONB metadata row.)
  • document.json — canonical content: full text, pages/segments, transcript. The single source of truth for search / aggregation / LLM datasets — one read per item, no re-stitching of per-owner on-disk files.
store = mi.persist.FileStore("/data")
store.save_manifest(item.media_id, manifest)   # lean index
store.save_document(item.media_id, document)   # full body
doc = store.load_document(item.media_id)        # aggregation reads this

# later, identical interface, JSONB backend:
# store = mi.persist.PgStore(dsn=...)   # planned (abstract_database)
#   -> metadata in JSONB, body text in a full-text-indexed column

MediaPipeline.persist() writes both. On re-run, the body is rehydrated from document.json, so extract/enrich skip (no re-OCR / re-transcribe).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

media_intelligence-0.1.0.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

media_intelligence-0.1.0-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file media_intelligence-0.1.0.tar.gz.

File metadata

  • Download URL: media_intelligence-0.1.0.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for media_intelligence-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e704f532f1a6f17dac758a7cb7754be822f44a919f6910f5f6b637a82dadac82
MD5 c0ec221a68e54f79ae8b96d29a5bf96a
BLAKE2b-256 d1213d8f9726babfae5dd8ea66e55c5c9c8b48a7d7509f02c7bcb282722d064e

See more details on using hashes here.

File details

Details for the file media_intelligence-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for media_intelligence-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 47334e10963aa29d7ece891bbd24d61ef3981ecaa7c21d048984ae8d37b7f6b6
MD5 74f89b31ca3d5d2510c6ab3218872e12
BLAKE2b-256 789db9b27df2d3c9ff058dc344ff8fa6d6a4816d94d5e3b9d24e666d415e21d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page