The `abstract_hugpy` module is designed to facilitate hugging face modules
Project description
abstract_hugpy
Part of the Abstract Intelligence Platform
This module provides NLP and ML enrichment across the media pipeline.
abstract_hugpy focuses on:
- summarization and keyword extraction
- metadata generation (titles, descriptions, SEO)
- multimodal refinement (text, audio, video)
Full system: https://github.com/AbstractEndeavors/abstract-intelligence
abstract_hugpy is a Python module for local-first model orchestration, summarization, keyword extraction, transcription support, and media metadata enrichment.
It began as a Hugging Face-oriented utility layer, but has grown into a broader intelligence module that sits between raw media/text inputs and higher-level content workflows. In practice, it acts as a unified interface for loading models lazily, routing tasks through consistent APIs, and producing structured outputs for downstream systems.
Why
Most model wrappers stop at “load model, run inference.”
That is not usually enough in a real pipeline.
Media and document systems need more than raw generation:
- consistent model loading
- reusable singleton lifecycle management
- local path resolution and download control
- summarization presets
- keyword extraction presets
- fallback-friendly dependency handling
- structured outputs for SEO and metadata workflows
- text, audio, video, and PDF-aware utilities
abstract_hugpy exists to provide that layer.
What this module does
Summarization
- unified summarization API across multiple backends
- preset-driven request handling
- chunked long-text summarization
- lazy-loaded singleton model managers
- configurable input policies for short or invalid text
Keyword extraction
- transformer-backed extraction with KeyBERT
- rule-based extraction with spaCy
- preset-based post-processing for SEO, metadata, and social use cases
- density analysis, hashtag generation, slug candidates, and filtered keyword tiers
Model orchestration
- central model registry
- environment-aware path resolution
- lazy local-or-hub source selection
- on-demand snapshot downloads
- reusable singleton-backed managers for expensive models
Media pipeline support
- Whisper-backed transcription loading
- audio extraction from video
- URL/media helpers
- PDF text analysis for summaries and keyword reports
- media URL derivation for public asset mapping
Code and generation support
- DeepCoder manager for code generation workflows
- generation config loading
- quantization-aware initialization
- single access point for repeated model usage
Core design goals
- local-first when possible
- lazy load everything expensive
- treat models as managed infrastructure, not loose function calls
- return structured results instead of ad hoc tuples/dicts
- support larger media/content systems, not just isolated scripts
- degrade cleanly when optional dependencies are unavailable
Architecture overview
Raw inputs
├── text
├── audio
├── video
├── pdf text
└── URLs
│
v
abstract_hugpy
├── model registry / path resolution
├── lazy dependency import layer
├── singleton model managers
├── summarization backends
├── keyword extraction backends
├── media + seo helpers
└── metadata refinement utilities
│
v
Structured outputs
├── summaries
├── keywords
├── hashtags
├── slug candidates
├── transcript text
├── metadata dictionaries
└── seo/media reports
Main module areas
abstract_hugpy/
├── imports/
│ ├── lazy dependency accessors
│ ├── token chunking
│ ├── web/media helpers
│ └── metadata helpers
├── models/
│ ├── model registry and path resolution
│ ├── shared imports/config
│ └── managers/
│ ├── summarizers/
│ ├── keybert_model.py
│ ├── whisper_model.py
│ ├── bigbird_module.py
│ └── deepcoder.py
└── utils/
└── seo/
└── pdf_utils.py
Features
1. Unified summarization registry
abstract_hugpy exposes a summarization registry where multiple backends share one contract.
Current design includes:
- T5-style chunked summarization
- Flan-based summarization
- Falconsai pipeline summarization
- preset-aware request resolution
- request objects for consistent parameter handling
This makes summarization feel like one API even when the underlying model changes.
Highlights
SummaryRequestSummaryPresetInputPolicy- backend registry and lookup
- chunk-aware consolidation passes
- short / medium / long summarization modes
2. Keyword extraction pipeline
Keyword extraction combines:
- a KeyBERT transformer-based backend
- a spaCy rule-based backend
- a refinement layer that classifies and formats results
This is especially useful because raw keyword extraction is often not enough for production use. The refinement layer turns extracted phrases into:
- primary keywords
- secondary keywords
- dropped terms
- density flags
- meta keyword strings
- hashtags
- slug candidates
Presets include
seometadatasociallong_tail
3. Model registry and storage control
The model config layer provides a registry mapping logical names to:
- hub IDs
- storage folders
- task types
- framework categories
It also supports:
- global model home configuration
- environment-variable overrides
- local path resolution
- lazy download behavior through
snapshot_download
This keeps model locations predictable and centralized.
4. Lazy dependency management
One of the stronger parts of this package is the import layer.
Heavy libraries are not imported directly everywhere. Instead, the module exposes dedicated accessors like:
get_torch()get_transformers()get_whisper()get_sentence_transformers()get_moviepy()get_pdf2image()
This matters because it keeps:
- startup lighter
- optional features optional
- failures localized and understandable
- large media/model workflows from eagerly loading things they do not need
The require() pattern is especially good here because it forces hard dependency failures to occur at manager initialization time instead of surfacing later as opaque runtime errors.
5. Whisper and transcription support
The Whisper manager provides:
- singleton-backed model loading
- configurable model size
- model path override support
- direct transcription wrapper
- audio extraction from video through MoviePy
This makes it useful as the transcription stage in a larger media ingestion pipeline.
6. DeepCoder support
The DeepCoder manager extends the module beyond summarization and SEO into code-generation/classic LLM workflows.
It supports:
- lazy singleton initialization
- device-aware loading
- optional quantization
- tokenizer and generation config loading
- chat-template or direct-prompt generation
- output persistence helpers
- model info inspection
That gives the project a broader “ML operations utility layer” identity rather than a narrowly scoped summarization package.
7. PDF SEO and document analysis
The PDF utilities provide one of the clearest examples of this module’s role in a larger system.
They support:
- loading full-document text
- page-by-page text loading
- running summarization and keyword extraction per scope
- building structured full-report objects
Outputs are intentionally shaped for downstream consumption rather than only human display.
This makes the package fit naturally into archival, indexing, SEO, and document intelligence workflows.
8. Media URL and metadata helpers
The media layer includes filesystem-to-public-URL helpers and scaffolding for deriving metadata from media assets.
That includes functionality such as:
- filesystem path → public asset URL mapping
- basic video metadata derivation flow
- transcript → summary → refined title pipeline
- keyword extraction for media metadata generation
This is the point where abstract_hugpy clearly becomes part of a broader platform rather than a standalone model wrapper.
Installation
pip install abstract-hugpy
Or from source:
git clone <your-repo-url>
cd abstract_hugpy
pip install -e .
Dependencies
From the package metadata shown, current declared dependencies include:
abstract_utilities
requests
openai_whisper
In practice, several optional capabilities also rely on libraries such as:
- transformers
- torch
- sentence_transformers
- keybert
- spacy
- moviepy
- tiktoken
- huggingface_hub
- pydub
- speech_recognition
- pdf2image
- pytesseract
- easyocr
- paddleocr
Because the package uses lazy import accessors, not every environment needs every dependency installed at once.
Example usage
Summarize text
from abstract_hugpy.models.managers.summarizers.summarizers import summarize
text = "Your long-form text goes here..."
result = summarize(text, backend="t5", preset="article")
print(result)
Use a structured summary request
from abstract_hugpy.models.managers.summarizers.summarizers import (
summarize,
SummaryRequest,
InputPolicy,
)
req = SummaryRequest(
text="Example text to summarize",
max_chunk_tokens=400,
summary_mode="medium",
input_policy=InputPolicy.WARN,
)
print(summarize(request=req, backend="t5"))
Extract refined keywords
from abstract_hugpy.models.managers.keybert_model import refine_keywords
text = "Python-based media enrichment pipeline for summaries, keywords, and SEO metadata."
result = refine_keywords(text, preset="seo")
print(result.primary)
print(result.hashtags)
print(result.meta_keywords)
Transcribe audio with Whisper
from abstract_hugpy.models.managers.whisper_model import whisper_transcribe
transcript = whisper_transcribe(
audio_path="example.wav",
model_size="small",
language="english",
)
print(transcript)
Extract audio from video
from abstract_hugpy.models.managers.whisper_model import extract_audio_from_video
audio_path = extract_audio_from_video("video.mp4", "audio.wav")
print(audio_path)
Use DeepCoder
from abstract_hugpy.models.managers.deepcoder import get_deep_coder
coder = get_deep_coder()
result = coder.generate(
prompt="Write a Python function that groups strings by length.",
max_new_tokens=300,
)
print(result)
Analyze a PDF text directory
from abstract_hugpy.utils.seo.pdf_utils import analyze_pdf
report = analyze_pdf("/path/to/pdf_dir")
print(report.to_dict())
Public concepts
A nice strength of this project is that it is moving toward explicit schemas instead of loose return conventions.
Important objects include:
ModelConfigSummaryRequestSummaryPresetKeywordRequestKeywordResultRefinedResultPDFSeoResultPDFSeoReport
That makes the package easier to compose into larger systems and easier to test.
Why not just call Hugging Face pipelines directly?
Because the actual problem here is larger than inference.
Direct pipeline calls do not solve:
- lifecycle management
- dependency gating
- model path control
- preset systems
- output normalization
- media-aware utilities
- structured reporting
- consistent backends for multiple tasks
abstract_hugpy is the orchestration layer around those concerns.
Good fit for
- local-first AI utilities
- media enrichment pipelines
- SEO/document intelligence systems
- content analysis workflows
- transcription pipelines
- Python services that need managed access to multiple NLP/ML models
- systems where models should be loaded once and reused
Less about
- training models
- benchmarking frameworks
- browser-side inference
- pure research experimentation
- single-purpose one-off scripts
This package is more about operational reuse than experimentation.
Design philosophy
Lazy by default
Large models and large libraries should not initialize unless they are actually needed.
Singleton where expensive
If a model is costly to load, repeated calls should reuse it.
Structured outputs over loose values
Named dataclasses are easier to inspect, serialize, and integrate.
Presets over repeated argument noise
Common tasks should be easy to invoke consistently.
Pipeline-aware utilities
Text, audio, video, and PDF workflows should feel connected rather than siloed.
Current strengths
- coherent model-management layer
- strong lazy import pattern
- reusable summarization registry
- practical keyword extraction and refinement
- real downstream usefulness for SEO/media systems
- good fit for larger platform integration
Reasonable future split points
You said you will split another day, and that makes sense. When that day comes, the natural seams look like:
abstract_hugpy_modelsabstract_hugpy_summarizersabstract_hugpy_keywordsabstract_hugpy_mediaabstract_hugpy_seo
But for now, keeping it unified is still coherent because all of it serves the same higher-level role: media intelligence and model orchestration.
Roadmap ideas
- CLI entrypoints for summary/keyword/pdf workflows
- stronger public API exports at package root
- richer dependency extras in packaging
- improved test coverage around manager lifecycle
- clearer backend capability matrix
- stricter separation between experimental and stable modules
- normalized metadata interfaces for text/audio/video/PDF
Status
Actively evolving module for model-backed media intelligence workflows.
Originally closer to a Hugging Face helper layer. Now a broader orchestration package for summarization, keyword extraction, transcription support, SEO analysis, and managed inference inside the Abstract Media Intelligence Platform.
I can also do a second pass that is even more GitHub-polished with badges, a quick-start section, and a “module map” table.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file abstract_hugpy-0.1.199.tar.gz.
File metadata
- Download URL: abstract_hugpy-0.1.199.tar.gz
- Upload date:
- Size: 45.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f039d0aa99f56dac75106ce18504f94eb2a61f0cb71cc5366bd63c792051a9b4
|
|
| MD5 |
4dde88b8702921e2be60816c4b1c1da2
|
|
| BLAKE2b-256 |
f8c0e15fb894c12972049b796152fbfe87536dd750bebad69187d1e80cd08617
|
File details
Details for the file abstract_hugpy-0.1.199-py3-none-any.whl.
File metadata
- Download URL: abstract_hugpy-0.1.199-py3-none-any.whl
- Upload date:
- Size: 48.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e819866f9f7269aaea5a06af1ca081abc046979dc74b9bfe8b49ef948883e557
|
|
| MD5 |
247a3bc5d69f6b5f12641c17db9dddde
|
|
| BLAKE2b-256 |
941c4727ea03f6dc624ea1e38260e96384316aa420149b449283ad1a857bf1ca
|