Skip to main content

Abstract Intelligence Platform — a unified, layered pipeline that turns raw media (PDFs, images, video) into structured, searchable, SEO-ready data.

Project description

media_intelligence — Abstract Intelligence Platform

A unified, layered facade that turns raw media — PDFs, images, and video — into structured, searchable, SEO-ready data. It does not reimplement any engine: it selects the best function of each sibling package and exposes it behind one clean, lazy API, plus an orchestrated pipeline.

Raw Media (PDF / Image / Video / URL)
   │
   ▼
ingest  → extract → structure → enrich → persist → publish
(webtools) (ocr/    (typed     (hugpy)  (FS / DB) (react/
            pdfs/    metadata)                      nginx)
            videos)

Layers → canonical owners

Layer Owner package What it does
ingest abstract_webtools scrape pages, download video (yt-dlp/ffmpeg)
ocr abstract_ocr layout-aware, multi-engine OCR
documents abstract_pdfs PDF decomposition + manifests + HTML
video abstract_videos registry pipeline: download/frames/transcribe
transcribe hugpy (→ abstract_ocr fallback) Whisper speech-to-text
enrich hugpy summaries, keywords, vision captioning, SEO
persist filesystem (DB-pluggable) typed JSON/JSONB manifests
publish abstract_react + abstract_nginx SEO/OG metadata + static HTML

Overlapping capabilities are resolved to one owner (Whisper → hugpy; video download → webtools; summarize/keywords → hugpy).

Install

media_intelligence is just this src/ facade — it contains none of the engines. Each layer's owner is its own PyPI package, declared as an optional extra, so you install only what you use:

pip install media_intelligence              # zero third-party deps — facade only
pip install "media_intelligence[ocr,enrich]"  # just those layers
pip install "media_intelligence[all]"       # the full platform

The package has no required third-party dependencies: importing it is cheap (~20 ms) and pulls none of the backing packages. Each sibling is imported lazily, only when its layer is actually called; a missing one raises a clear MissingDependency naming the extra to install.

Check what's usable in the current environment without importing anything:

import media_intelligence as mi
mi.available()            # {'ingest': True, 'ocr': True, 'publish': False, ...}
mi.available("enrich")    # True / False

Usage

Direct namespace access

import media_intelligence as mi

text = mi.ocr.image_to_text("page.png")
kw   = mi.enrich.keywords(text)
mi.documents.process_pdf("doc.pdf")
mi.ingest.download_video("https://site.com/v.mp4", download_directory="/data")

Orchestrated pipeline (idempotent + resumable)

from media_intelligence import MediaPipeline

pipe = MediaPipeline("https://site.com/video.mp4", out_root="/data")
pipe.ingest().extract().structure().enrich().persist().publish()
print(pipe.report.summary)
#   ... or simply:
pipe.run()

The pipeline autodetects media kind, dispatches each stage accordingly, skips stages already satisfied (idempotent), and rehydrates from a prior manifest on re-run (resumable). Results land in out_root/<media_id>/manifest.json.

Persistence (DB-pluggable, two records)

Each item is persisted as two records so indexing stays cheap while aggregation stays simple:

  • manifest.json — lean index: ids, counts, text_chars, summary, keywords, SEO, asset pointers. (The JSONB metadata row.)
  • document.json — canonical content: full text, pages/segments, transcript. The single source of truth for search / aggregation / LLM datasets — one read per item, no re-stitching of per-owner on-disk files.
store = mi.persist.FileStore("/data")
store.save_manifest(item.media_id, manifest)   # lean index
store.save_document(item.media_id, document)   # full body
doc = store.load_document(item.media_id)        # aggregation reads this

# later, identical interface, JSONB backend:
# store = mi.persist.PgStore(dsn=...)   # planned (abstract_database)
#   -> metadata in JSONB, body text in a full-text-indexed column

MediaPipeline.persist() writes both. On re-run, the body is rehydrated from document.json, so extract/enrich skip (no re-OCR / re-transcribe).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

media_intelligence-0.1.2.tar.gz (26.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

media_intelligence-0.1.2-py3-none-any.whl (31.4 kB view details)

Uploaded Python 3

File details

Details for the file media_intelligence-0.1.2.tar.gz.

File metadata

  • Download URL: media_intelligence-0.1.2.tar.gz
  • Upload date:
  • Size: 26.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for media_intelligence-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3f15b03347e01b81006c47b0e5dfc757005c04ad14c53a06ff84237b408defc9
MD5 1da98fe2f622df7b6677c58f6864fcd6
BLAKE2b-256 be02da03cd78a7c90cf95f5d9166f7fc8ee2052eb0c609dea7bd91dcb77e3dc0

See more details on using hashes here.

File details

Details for the file media_intelligence-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for media_intelligence-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f2ea0719eb3f14a41f5e0915d34aa0080d6ac805e29dbf9c7f3adc8ca1c5e064
MD5 e66156af07dfdb468f6c1f7a99090890
BLAKE2b-256 6bd483132aa3a018b81c2979e2efa86b63a634c7856b027d891eaadd9e4fe047

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page