Abstract Intelligence Platform — a unified, layered pipeline that turns raw media (PDFs, images, video) into structured, searchable, SEO-ready data.

These details have not been verified by PyPI

Project description

media_intelligence — Abstract Intelligence Platform

A unified, layered facade that turns raw media — PDFs, images, and video — into structured, searchable, SEO-ready data. It does not reimplement any engine: it selects the best function of each sibling package and exposes it behind one clean, lazy API, plus an orchestrated pipeline.

Raw Media (PDF / Image / Video / URL)
   │
   ▼
ingest  → extract → structure → enrich → persist → publish
(webtools) (ocr/    (typed     (hugpy)  (FS / DB) (react/
            pdfs/    metadata)                      nginx)
            videos)

Layers → canonical owners

Layer	Owner package	What it does
`ingest`	`abstract_webtools`	scrape pages, download video (yt-dlp/ffmpeg)
`ocr`	`abstract_ocr`	layout-aware, multi-engine OCR
`documents`	`abstract_pdfs`	PDF decomposition + manifests + HTML
`video`	`abstract_videos`	registry pipeline: download/frames/transcribe
`transcribe`	`hugpy` (→ `abstract_ocr` fallback)	Whisper speech-to-text
`enrich`	`hugpy`	summaries, keywords, vision captioning, SEO
`persist`	filesystem (DB-pluggable)	typed JSON/JSONB manifests
`publish`	`abstract_react` + `abstract_nginx`	SEO/OG metadata + static HTML

Overlapping capabilities are resolved to one owner (Whisper → hugpy; video download → webtools; summarize/keywords → hugpy).

Install

media_intelligence is just this src/ facade — it contains none of the engines. Each layer's owner is its own PyPI package, declared as an optional extra, so you install only what you use:

pip install media_intelligence              # zero third-party deps — facade only
pip install "media_intelligence[ocr,enrich]"  # just those layers
pip install "media_intelligence[all]"       # the full platform

The package has no required third-party dependencies: importing it is cheap (~20 ms) and pulls none of the backing packages. Each sibling is imported lazily, only when its layer is actually called; a missing one raises a clear MissingDependency naming the extra to install.

Check what's usable in the current environment without importing anything:

import media_intelligence as mi
mi.available()            # {'ingest': True, 'ocr': True, 'publish': False, ...}
mi.available("enrich")    # True / False

Usage

Direct namespace access

import media_intelligence as mi

text = mi.ocr.image_to_text("page.png")
kw   = mi.enrich.keywords(text)
mi.documents.process_pdf("doc.pdf")
mi.ingest.download_video("https://site.com/v.mp4", download_directory="/data")

Orchestrated pipeline (idempotent + resumable)

from media_intelligence import MediaPipeline

pipe = MediaPipeline("https://site.com/video.mp4", out_root="/data")
pipe.ingest().extract().structure().enrich().persist().publish()
print(pipe.report.summary)
#   ... or simply:
pipe.run()

The pipeline autodetects media kind, dispatches each stage accordingly, skips stages already satisfied (idempotent), and rehydrates from a prior manifest on re-run (resumable). Results land in out_root/<media_id>/manifest.json.

Persistence (DB-pluggable, two records)

Each item is persisted as two records so indexing stays cheap while aggregation stays simple:

manifest.json — lean index: ids, counts, text_chars, summary, keywords, SEO, asset pointers. (The JSONB metadata row.)
document.json — canonical content: full text, pages/segments, transcript. The single source of truth for search / aggregation / LLM datasets — one read per item, no re-stitching of per-owner on-disk files.

store = mi.persist.FileStore("/data")
store.save_manifest(item.media_id, manifest)   # lean index
store.save_document(item.media_id, document)   # full body
doc = store.load_document(item.media_id)        # aggregation reads this

# later, identical interface, JSONB backend:
# store = mi.persist.PgStore(dsn=...)   # planned (abstract_database)
#   -> metadata in JSONB, body text in a full-text-indexed column

MediaPipeline.persist() writes both. On re-run, the body is rehydrated from document.json, so extract/enrich skip (no re-OCR / re-transcribe).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

Jun 25, 2026

0.1.1

Jun 25, 2026

0.1.0

Jun 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

media_intelligence-0.1.2.tar.gz (26.2 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

media_intelligence-0.1.2-py3-none-any.whl (31.4 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file media_intelligence-0.1.2.tar.gz.

File metadata

Download URL: media_intelligence-0.1.2.tar.gz
Upload date: Jun 25, 2026
Size: 26.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for media_intelligence-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`3f15b03347e01b81006c47b0e5dfc757005c04ad14c53a06ff84237b408defc9`
MD5	`1da98fe2f622df7b6677c58f6864fcd6`
BLAKE2b-256	`be02da03cd78a7c90cf95f5d9166f7fc8ee2052eb0c609dea7bd91dcb77e3dc0`

See more details on using hashes here.

File details

Details for the file media_intelligence-0.1.2-py3-none-any.whl.

File metadata

Download URL: media_intelligence-0.1.2-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 31.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for media_intelligence-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f2ea0719eb3f14a41f5e0915d34aa0080d6ac805e29dbf9c7f3adc8ca1c5e064`
MD5	`e66156af07dfdb468f6c1f7a99090890`
BLAKE2b-256	`6bd483132aa3a018b81c2979e2efa86b63a634c7856b027d891eaadd9e4fe047`

See more details on using hashes here.

media-intelligence 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

media_intelligence — Abstract Intelligence Platform

Layers → canonical owners

Install

Usage

Direct namespace access

Orchestrated pipeline (idempotent + resumable)

Persistence (DB-pluggable, two records)

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes