Abstract Intelligence Platform — a unified, layered pipeline that turns raw media (PDFs, images, video) into structured, searchable, SEO-ready data.
Project description
media_intelligence — Abstract Intelligence Platform
A unified, layered facade that turns raw media — PDFs, images, and video — into structured, searchable, SEO-ready data. It does not reimplement any engine: it selects the best function of each sibling package and exposes it behind one clean, lazy API, plus an orchestrated pipeline.
Raw Media (PDF / Image / Video / URL)
│
▼
ingest → extract → structure → enrich → persist → publish
(webtools) (ocr/ (typed (hugpy) (FS / DB) (react/
pdfs/ metadata) nginx)
videos)
Layers → canonical owners
| Layer | Owner package | What it does |
|---|---|---|
ingest |
abstract_webtools |
scrape pages, download video (yt-dlp/ffmpeg) |
ocr |
abstract_ocr |
layout-aware, multi-engine OCR |
documents |
abstract_pdfs |
PDF decomposition + manifests + HTML |
video |
abstract_videos |
registry pipeline: download/frames/transcribe |
transcribe |
hugpy (→ abstract_ocr fallback) |
Whisper speech-to-text |
enrich |
hugpy |
summaries, keywords, vision captioning, SEO |
persist |
filesystem (DB-pluggable) | typed JSON/JSONB manifests |
publish |
abstract_react + abstract_nginx |
SEO/OG metadata + static HTML |
Overlapping capabilities are resolved to one owner (Whisper → hugpy;
video download → webtools; summarize/keywords → hugpy).
Install
media_intelligence is just this src/ facade — it contains none of the
engines. Each layer's owner is its own PyPI package, declared as an optional
extra, so you install only what you use:
pip install media_intelligence # zero third-party deps — facade only
pip install "media_intelligence[ocr,enrich]" # just those layers
pip install "media_intelligence[all]" # the full platform
The package has no required third-party dependencies: importing it is cheap
(~20 ms) and pulls none of the backing packages. Each sibling is imported
lazily, only when its layer is actually called; a missing one raises a clear
MissingDependency naming the extra to install.
Check what's usable in the current environment without importing anything:
import media_intelligence as mi
mi.available() # {'ingest': True, 'ocr': True, 'publish': False, ...}
mi.available("enrich") # True / False
Usage
Direct namespace access
import media_intelligence as mi
text = mi.ocr.image_to_text("page.png")
kw = mi.enrich.keywords(text)
mi.documents.process_pdf("doc.pdf")
mi.ingest.download_video("https://site.com/v.mp4", download_directory="/data")
Orchestrated pipeline (idempotent + resumable)
from media_intelligence import MediaPipeline
pipe = MediaPipeline("https://site.com/video.mp4", out_root="/data")
pipe.ingest().extract().structure().enrich().persist().publish()
print(pipe.report.summary)
# ... or simply:
pipe.run()
The pipeline autodetects media kind, dispatches each stage accordingly, skips
stages already satisfied (idempotent), and rehydrates from a prior manifest on
re-run (resumable). Results land in out_root/<media_id>/manifest.json.
Persistence (DB-pluggable, two records)
Each item is persisted as two records so indexing stays cheap while aggregation stays simple:
manifest.json— lean index: ids, counts,text_chars, summary, keywords, SEO, asset pointers. (The JSONB metadata row.)document.json— canonical content: fulltext,pages/segments,transcript. The single source of truth for search / aggregation / LLM datasets — one read per item, no re-stitching of per-owner on-disk files.
store = mi.persist.FileStore("/data")
store.save_manifest(item.media_id, manifest) # lean index
store.save_document(item.media_id, document) # full body
doc = store.load_document(item.media_id) # aggregation reads this
# later, identical interface, JSONB backend:
# store = mi.persist.PgStore(dsn=...) # planned (abstract_database)
# -> metadata in JSONB, body text in a full-text-indexed column
MediaPipeline.persist() writes both. On re-run, the body is rehydrated from
document.json, so extract/enrich skip (no re-OCR / re-transcribe).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file media_intelligence-0.1.0.tar.gz.
File metadata
- Download URL: media_intelligence-0.1.0.tar.gz
- Upload date:
- Size: 22.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e704f532f1a6f17dac758a7cb7754be822f44a919f6910f5f6b637a82dadac82
|
|
| MD5 |
c0ec221a68e54f79ae8b96d29a5bf96a
|
|
| BLAKE2b-256 |
d1213d8f9726babfae5dd8ea66e55c5c9c8b48a7d7509f02c7bcb282722d064e
|
File details
Details for the file media_intelligence-0.1.0-py3-none-any.whl.
File metadata
- Download URL: media_intelligence-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47334e10963aa29d7ece891bbd24d61ef3981ecaa7c21d048984ae8d37b7f6b6
|
|
| MD5 |
74f89b31ca3d5d2510c6ab3218872e12
|
|
| BLAKE2b-256 |
789db9b27df2d3c9ff058dc344ff8fa6d6a4816d94d5e3b9d24e666d415e21d0
|