Universal asset memorizer: scrape URLs, memorize images/text/code/video as A-Z unlimited assets
Project description
aax ๐๏ธ
Universal Asset Memorizer โ scrape images, text, code, video, and data from any URL and store them as AโZ unlimited labeled assets.
pip install aax
What is aax?
aax is a Python library that memorizes everything on a URL:
| Asset Type | What it captures |
|---|---|
| ๐ผ๏ธ IMAGE | PNG, JPEG, WebP, GIF, SVG, AVIF, BMP, TIFF โฆ |
| ๐ TEXT | Paragraphs, headings, lists, captions, blockquotes |
| ๐ป CODE | <code>, <pre>, inline snippets |
| ๐ฅ VIDEO | <video>, iframes, embedded players |
| ๐ต AUDIO | <audio>, podcast feeds |
| ๐ DATA | JSON-LD, meta tags, HTML tables โ structured JSON |
| ๐ LINK | All hyperlinks with anchor text |
| ๐ DOC | PDF, DOCX, XLSX, PPTX linked files |
Every asset gets a unique AโZ unlimited label (A, B, C โฆ Z, AA, AB โฆ โ).
Architecture
Built on three pillars:
aax/
โโโ core/ โ Memorizer engine + AssetSession
โ โโโ memorizer โ scrapes URLs, extracts all assets
โ โโโ session โ A-Z labeled container for results
โ โโโ types โ Asset dataclass, AssetKind enum, index_label()
โ
โโโ vision/ โ URL Vision Checker (inspired by torchvision / vision-main)
โ โโโ checker โ image size, dominant colors, webpage meta
โ
โโโ image/ โ Image processing (inspired by image-rs / image-main)
โ โโโ processor โ load/transform/save images from URL or file
โ
โโโ data/ โ Structured data builder (inspired by serde_json / json-master)
โ โโโ builder โ serialize to JSON/JSONL/CSV/SQLite, full-text search
โ
โโโ storage/ โ Persistent A-Z vault
โโโ vault โ disk-backed long-term asset memory
Quick Start
1. Memorize a URL
import aax
# Scrape everything from Wikipedia's main page
session = aax.memorize("https://id.wikipedia.org/wiki/Halaman_Utama")
print(session.summary())
# โโโ aax AssetSession โโโ
# URL : https://id.wikipedia.org/wiki/Halaman_Utama
# Total : 847 assets (labeled AโZH)
# text : 312
# link : 289
# image : 143
# data : 78
# ...
# Access by label
first_image = session["A"] # not always image โ first scraped asset
print(first_image.kind) # AssetKind.TEXT / IMAGE / etc.
# Typed views
for img in session.images:
print(img.label, img.src)
for text in session.texts:
print(text.label, text.content[:80])
# Search content
results = session.search("Indonesia")
print(f"{len(results)} assets mention 'Indonesia'")
# Save everything to disk
session.save("./wiki_assets", download_images=True)
2. Check URL Vision
import aax
# What's in this URL?
v = aax.vision("https://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Flag_of_Indonesia.svg/320px-Flag_of_Indonesia.svg.png")
print(v.describe())
# [aax.vision] https://upload.wikimedia.org/...
# Type : image/png
# Kind : Image (PNG)
# Size : 320ร213 px (ratio 1.5023)
# FileSize : 3.2 KB
# Colors : #ce1126, #ffffff, #f5f5f5, #d4d4d4, #e8e8e8
print(v.dominant_colors) # ['#ce1126', '#ffffff', ...]
print(v.size) # (320, 213)
print(v.is_image) # True
# Webpage vision
vw = aax.vision("https://id.wikipedia.org/wiki/Pemerintahan_Nasional_Pertama")
print(vw.describe())
# [aax.vision] https://id.wikipedia.org/...
# Type : text/html
# Kind : Webpage (HTML)
# Title : Pemerintahan Nasional Pertama โ Wikipedia ...
# Desc : Pemerintahan Nasional Pertama adalah...
3. Process Images
from aax.image import ImageProcessor
ip = ImageProcessor()
# Load from URL โ transform โ save
(ip.from_url("https://upload.wikimedia.org/wikipedia/commons/thumb/...")
.resize(640, 480)
.grayscale()
.blur(1.5)
.save("processed.jpg"))
# Batch download
handles = ip.batch_from_urls([
"https://example.com/img1.jpg",
"https://example.com/img2.png",
])
for h in handles:
h.thumbnail(256).save(f"thumb_{h.source.split('/')[-1]}")
# Get image info
h = ip.from_url("https://...")
print(h.info())
# {'format': 'JPEG', 'size': (1920, 1080), 'mode': 'RGB', ...}
4. Build Structured Data
from aax.data import DataBuilder
import aax
session = aax.memorize("https://id.wikipedia.org/wiki/Pemerintahan_Nasional_Pertama")
db = DataBuilder()
db.ingest(session)
# Export formats
db.to_json("assets.json") # full JSON
db.to_jsonl("assets.jsonl") # one record per line
db.to_csv("texts.csv", kind="text") # CSV of text assets
db.to_sqlite("assets.db") # SQLite with FTS
# Search
results = db.query("kabinet")
for r in results:
print(r["label"], r["content_text"][:60])
# Reload from disk
db2 = DataBuilder.from_json("assets.json")
db3 = DataBuilder.from_sqlite("assets.db")
5. Persistent Vault
from aax.storage import AssetVault
import aax
vault = AssetVault("./my_vault")
# Store sessions from multiple URLs
urls = [
"https://id.wikipedia.org/wiki/Halaman_Utama",
"https://id.wikipedia.org/wiki/Pemerintahan_Nasional_Pertama",
]
for url in urls:
session = aax.memorize(url)
stored = vault.store(session)
print(f"Stored {stored} assets from {url}")
# Retrieve
asset = vault.get("A")
content = vault.get_content("B")
# Query
images = vault.list_by_kind("image")
indonesia_assets = vault.search("Indonesia")
# Stats
print(vault.stats())
# {'total': 1694, 'labels': 'A โฆ ZZH', 'disk_bytes': 4_200_000, ...}
CLI Usage
# Memorize a URL
aax memorize https://id.wikipedia.org/wiki/Halaman_Utama --out ./assets
# Only scrape images and text
aax memorize https://id.wikipedia.org/wiki/Halaman_Utama --kinds IMAGE,TEXT
# Download images too
aax memorize https://example.com --download-images
# Follow internal links (depth 2)
aax memorize https://example.com --follow-links --depth 2
# Vision check
aax vision https://example.com/image.png
aax vision https://id.wikipedia.org/wiki/Halaman_Utama --json
# Vault management
aax vault ./my_vault stats
aax vault ./my_vault list --kind image --limit 20
Asset Labels: AโZ Unlimited
Assets are labeled like Excel columns โ never runs out:
A, B, C, โฆ Z,
AA, AB, AC, โฆ AZ,
BA, BB, โฆ ZZ,
AAA, AAB, โฆ โ
from aax.core.types import index_label
index_label(0) # 'A'
index_label(25) # 'Z'
index_label(26) # 'AA'
index_label(701) # 'ZZ'
index_label(702) # 'AAA'
Filtering Asset Kinds
from aax.core.types import AssetKind
session = aax.memorize(url, kinds=[AssetKind.IMAGE, AssetKind.TEXT])
Available kinds: IMAGE TEXT CODE VIDEO AUDIO DATA LINK DOC FONT STYLE SCRIPT ICON IFRAME UNKNOWN
Advanced: Multi-URL Scrape
import aax
from aax.data import DataBuilder
from aax.storage import AssetVault
urls = [
"https://id.wikipedia.org/wiki/Halaman_Utama",
"https://id.wikipedia.org/wiki/Pemerintahan_Nasional_Pertama",
]
vault = AssetVault("./vault")
db = DataBuilder()
for url in urls:
print(f"Memorizing {url}")
session = aax.memorize(url, verbose=True)
vault.store(session)
db.ingest(session)
print(session.summary())
# Export everything
db.to_sqlite("all_assets.db")
print(f"\nVault total: {len(vault)} assets")
print(vault.stats())
Dependencies
Core (always installed):
requests,aiohttp,httpxโ HTTPbeautifulsoup4,lxmlโ HTML parsingPillowโ image processingrich,click,tqdmโ CLI/output
Optional (pip install aax[vision]):
torch,torchvisionโ deep vision modelstransformersโ image captioning, classificationopencv-pythonโ advanced image ops
Full (pip install aax[full]):
- All vision deps +
yt-dlp,pytesseract,pdf2image
Inspired By
| Library | Role in aax |
|---|---|
serde_json (json-master) |
Structured data serialization, JSON A-Z asset records |
image (image-main) |
Image format support, decoding/encoding pipeline |
torchvision (vision-main) |
URL-based image loading, transform pipelines, vision checking |
License
MIT ยฉ aax
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aax_vision_lib-1.0.0.tar.gz.
File metadata
- Download URL: aax_vision_lib-1.0.0.tar.gz
- Upload date:
- Size: 32.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c552416bfef109aba28b06a6675169c224d85f8317c286556a30f0f26a14883e
|
|
| MD5 |
41619100432a024f211694d041e63b05
|
|
| BLAKE2b-256 |
faf49876de2d71ee6707e0832efe96a09ffcb23de58cebf7bb6ad50fe1147306
|
File details
Details for the file aax_vision_lib-1.0.0-py3-none-any.whl.
File metadata
- Download URL: aax_vision_lib-1.0.0-py3-none-any.whl
- Upload date:
- Size: 31.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91e7c46b1b641ded4320b46794784a8bf540f01c638cac81702b856243dfd3e9
|
|
| MD5 |
a2d0994d847647c520fba07b4cc8ba1f
|
|
| BLAKE2b-256 |
063323186d8d2ad28ca00f528461bca04f872b3a1c67671cfa4aa6c1b8d171e4
|