Skip to main content

Universal asset memorizer: scrape URLs, memorize images/text/code/video as A-Z unlimited assets

Project description

aax ๐Ÿ—‚๏ธ

Universal Asset Memorizer โ€” scrape images, text, code, video, and data from any URL and store them as Aโ€“Z unlimited labeled assets.

pip install aax

What is aax?

aax is a Python library that memorizes everything on a URL:

Asset Type What it captures
๐Ÿ–ผ๏ธ IMAGE PNG, JPEG, WebP, GIF, SVG, AVIF, BMP, TIFF โ€ฆ
๐Ÿ“ TEXT Paragraphs, headings, lists, captions, blockquotes
๐Ÿ’ป CODE <code>, <pre>, inline snippets
๐ŸŽฅ VIDEO <video>, iframes, embedded players
๐ŸŽต AUDIO <audio>, podcast feeds
๐Ÿ“Š DATA JSON-LD, meta tags, HTML tables โ†’ structured JSON
๐Ÿ”— LINK All hyperlinks with anchor text
๐Ÿ“„ DOC PDF, DOCX, XLSX, PPTX linked files

Every asset gets a unique Aโ€“Z unlimited label (A, B, C โ€ฆ Z, AA, AB โ€ฆ โˆž).


Architecture

Built on three pillars:

aax/
โ”œโ”€โ”€ core/          โ† Memorizer engine + AssetSession
โ”‚   โ”œโ”€โ”€ memorizer  โ† scrapes URLs, extracts all assets
โ”‚   โ”œโ”€โ”€ session    โ† A-Z labeled container for results
โ”‚   โ””โ”€โ”€ types      โ† Asset dataclass, AssetKind enum, index_label()
โ”‚
โ”œโ”€โ”€ vision/        โ† URL Vision Checker (inspired by torchvision / vision-main)
โ”‚   โ””โ”€โ”€ checker    โ† image size, dominant colors, webpage meta
โ”‚
โ”œโ”€โ”€ image/         โ† Image processing (inspired by image-rs / image-main)
โ”‚   โ””โ”€โ”€ processor  โ† load/transform/save images from URL or file
โ”‚
โ”œโ”€โ”€ data/          โ† Structured data builder (inspired by serde_json / json-master)
โ”‚   โ””โ”€โ”€ builder    โ† serialize to JSON/JSONL/CSV/SQLite, full-text search
โ”‚
โ””โ”€โ”€ storage/       โ† Persistent A-Z vault
    โ””โ”€โ”€ vault      โ† disk-backed long-term asset memory

Quick Start

1. Memorize a URL

import aax

# Scrape everything from Wikipedia's main page
session = aax.memorize("https://id.wikipedia.org/wiki/Halaman_Utama")

print(session.summary())
# โ”โ”โ” aax AssetSession โ”โ”โ”
#   URL   : https://id.wikipedia.org/wiki/Halaman_Utama
#   Total : 847 assets (labeled Aโ€“ZH)
#   text  : 312
#   link  : 289
#   image : 143
#   data  : 78
#   ...

# Access by label
first_image = session["A"]      # not always image โ€” first scraped asset
print(first_image.kind)         # AssetKind.TEXT / IMAGE / etc.

# Typed views
for img in session.images:
    print(img.label, img.src)

for text in session.texts:
    print(text.label, text.content[:80])

# Search content
results = session.search("Indonesia")
print(f"{len(results)} assets mention 'Indonesia'")

# Save everything to disk
session.save("./wiki_assets", download_images=True)

2. Check URL Vision

import aax

# What's in this URL?
v = aax.vision("https://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Flag_of_Indonesia.svg/320px-Flag_of_Indonesia.svg.png")
print(v.describe())
# [aax.vision] https://upload.wikimedia.org/...
#   Type     : image/png
#   Kind     : Image (PNG)
#   Size     : 320ร—213 px  (ratio 1.5023)
#   FileSize : 3.2 KB
#   Colors   : #ce1126, #ffffff, #f5f5f5, #d4d4d4, #e8e8e8

print(v.dominant_colors)   # ['#ce1126', '#ffffff', ...]
print(v.size)              # (320, 213)
print(v.is_image)          # True

# Webpage vision
vw = aax.vision("https://id.wikipedia.org/wiki/Pemerintahan_Nasional_Pertama")
print(vw.describe())
# [aax.vision] https://id.wikipedia.org/...
#   Type     : text/html
#   Kind     : Webpage (HTML)
#   Title    : Pemerintahan Nasional Pertama โ€“ Wikipedia ...
#   Desc     : Pemerintahan Nasional Pertama adalah...

3. Process Images

from aax.image import ImageProcessor

ip = ImageProcessor()

# Load from URL โ†’ transform โ†’ save
(ip.from_url("https://upload.wikimedia.org/wikipedia/commons/thumb/...")
   .resize(640, 480)
   .grayscale()
   .blur(1.5)
   .save("processed.jpg"))

# Batch download
handles = ip.batch_from_urls([
    "https://example.com/img1.jpg",
    "https://example.com/img2.png",
])
for h in handles:
    h.thumbnail(256).save(f"thumb_{h.source.split('/')[-1]}")

# Get image info
h = ip.from_url("https://...")
print(h.info())
# {'format': 'JPEG', 'size': (1920, 1080), 'mode': 'RGB', ...}

4. Build Structured Data

from aax.data import DataBuilder
import aax

session = aax.memorize("https://id.wikipedia.org/wiki/Pemerintahan_Nasional_Pertama")

db = DataBuilder()
db.ingest(session)

# Export formats
db.to_json("assets.json")                      # full JSON
db.to_jsonl("assets.jsonl")                    # one record per line
db.to_csv("texts.csv", kind="text")            # CSV of text assets
db.to_sqlite("assets.db")                      # SQLite with FTS

# Search
results = db.query("kabinet")
for r in results:
    print(r["label"], r["content_text"][:60])

# Reload from disk
db2 = DataBuilder.from_json("assets.json")
db3 = DataBuilder.from_sqlite("assets.db")

5. Persistent Vault

from aax.storage import AssetVault
import aax

vault = AssetVault("./my_vault")

# Store sessions from multiple URLs
urls = [
    "https://id.wikipedia.org/wiki/Halaman_Utama",
    "https://id.wikipedia.org/wiki/Pemerintahan_Nasional_Pertama",
]
for url in urls:
    session = aax.memorize(url)
    stored = vault.store(session)
    print(f"Stored {stored} assets from {url}")

# Retrieve
asset = vault.get("A")
content = vault.get_content("B")

# Query
images = vault.list_by_kind("image")
indonesia_assets = vault.search("Indonesia")

# Stats
print(vault.stats())
# {'total': 1694, 'labels': 'A โ€ฆ ZZH', 'disk_bytes': 4_200_000, ...}

CLI Usage

# Memorize a URL
aax memorize https://id.wikipedia.org/wiki/Halaman_Utama --out ./assets

# Only scrape images and text
aax memorize https://id.wikipedia.org/wiki/Halaman_Utama --kinds IMAGE,TEXT

# Download images too
aax memorize https://example.com --download-images

# Follow internal links (depth 2)
aax memorize https://example.com --follow-links --depth 2

# Vision check
aax vision https://example.com/image.png
aax vision https://id.wikipedia.org/wiki/Halaman_Utama --json

# Vault management
aax vault ./my_vault stats
aax vault ./my_vault list --kind image --limit 20

Asset Labels: Aโ€“Z Unlimited

Assets are labeled like Excel columns โ€” never runs out:

A, B, C, โ€ฆ Z,
AA, AB, AC, โ€ฆ AZ,
BA, BB, โ€ฆ ZZ,
AAA, AAB, โ€ฆ โˆž
from aax.core.types import index_label

index_label(0)    # 'A'
index_label(25)   # 'Z'
index_label(26)   # 'AA'
index_label(701)  # 'ZZ'
index_label(702)  # 'AAA'

Filtering Asset Kinds

from aax.core.types import AssetKind

session = aax.memorize(url, kinds=[AssetKind.IMAGE, AssetKind.TEXT])

Available kinds: IMAGE TEXT CODE VIDEO AUDIO DATA LINK DOC FONT STYLE SCRIPT ICON IFRAME UNKNOWN


Advanced: Multi-URL Scrape

import aax
from aax.data import DataBuilder
from aax.storage import AssetVault

urls = [
    "https://id.wikipedia.org/wiki/Halaman_Utama",
    "https://id.wikipedia.org/wiki/Pemerintahan_Nasional_Pertama",
]

vault = AssetVault("./vault")
db    = DataBuilder()

for url in urls:
    print(f"Memorizing {url}")
    session = aax.memorize(url, verbose=True)
    vault.store(session)
    db.ingest(session)
    print(session.summary())

# Export everything
db.to_sqlite("all_assets.db")
print(f"\nVault total: {len(vault)} assets")
print(vault.stats())

Dependencies

Core (always installed):

  • requests, aiohttp, httpx โ€” HTTP
  • beautifulsoup4, lxml โ€” HTML parsing
  • Pillow โ€” image processing
  • rich, click, tqdm โ€” CLI/output

Optional (pip install aax[vision]):

  • torch, torchvision โ€” deep vision models
  • transformers โ€” image captioning, classification
  • opencv-python โ€” advanced image ops

Full (pip install aax[full]):

  • All vision deps + yt-dlp, pytesseract, pdf2image

Inspired By

Library Role in aax
serde_json (json-master) Structured data serialization, JSON A-Z asset records
image (image-main) Image format support, decoding/encoding pipeline
torchvision (vision-main) URL-based image loading, transform pipelines, vision checking

License

MIT ยฉ aax

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aax_vision_lib-1.0.0.tar.gz (32.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aax_vision_lib-1.0.0-py3-none-any.whl (31.3 kB view details)

Uploaded Python 3

File details

Details for the file aax_vision_lib-1.0.0.tar.gz.

File metadata

  • Download URL: aax_vision_lib-1.0.0.tar.gz
  • Upload date:
  • Size: 32.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for aax_vision_lib-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c552416bfef109aba28b06a6675169c224d85f8317c286556a30f0f26a14883e
MD5 41619100432a024f211694d041e63b05
BLAKE2b-256 faf49876de2d71ee6707e0832efe96a09ffcb23de58cebf7bb6ad50fe1147306

See more details on using hashes here.

File details

Details for the file aax_vision_lib-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: aax_vision_lib-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 31.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for aax_vision_lib-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 91e7c46b1b641ded4320b46794784a8bf540f01c638cac81702b856243dfd3e9
MD5 a2d0994d847647c520fba07b4cc8ba1f
BLAKE2b-256 063323186d8d2ad28ca00f528461bca04f872b3a1c67671cfa4aa6c1b8d171e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page