LlamaIndex reader for Plasmate SOM, providing structured web content for AI agents
Project description
LlamaIndex Plasmate Reader
A LlamaIndex reader for Plasmate SOM (Structured Object Model), providing clean, structured web content optimized for AI agents and RAG pipelines.
What is Plasmate SOM?
Plasmate SOM converts messy HTML into a clean, semantic structure that AI models can easily understand. Instead of parsing raw HTML with all its noise, you get structured content with:
- Semantic regions (headers, navigation, main content, footers)
- Clean text extraction from headings, paragraphs, links, lists, and tables
- Compression ratios typically 10x smaller than raw HTML
- Consistent structure across any website
Installation
pip install llama-index-readers-plasmate
Quick Start
from llama_index_plasmate import PlasmateReader
# Initialize the reader
reader = PlasmateReader()
# Load documents from URLs
documents = reader.load_data(urls=[
"https://example.com/page1",
"https://example.com/page2",
])
# Use with LlamaIndex
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is on these pages?")
Configuration
Using the SOM Cache API (Recommended)
The reader uses the Plasmate SOM Cache API by default for fast, cached responses:
reader = PlasmateReader(
api_key="your-api-key", # Optional, for authenticated access
api_base="https://cache.plasmate.app", # Default
)
Using Local Plasmate CLI Fallback
If the API is unavailable, the reader automatically falls back to the local plasmate CLI if installed:
# Install plasmate CLI
npm install -g plasmate
The reader will use the CLI when:
- The API returns an error
- No API key is provided and the endpoint requires authentication
- You explicitly disable the API
Document Metadata
Each document includes rich metadata:
doc = documents[0]
print(doc.metadata)
# {
# "source": "https://example.com/page1",
# "title": "Page Title",
# "som_version": "1.0",
# "compression_ratio": 12.5,
# "html_bytes": 125000,
# "som_bytes": 10000,
# }
API Reference
PlasmateReader
PlasmateReader(
api_key: Optional[str] = None,
api_base: str = "https://cache.plasmate.app",
)
Parameters:
api_key: Optional API key for authenticated access to the SOM Cache APIapi_base: Base URL for the SOM Cache API (default:https://cache.plasmate.app)
load_data
reader.load_data(
urls: List[str],
) -> List[Document]
Parameters:
urls: List of URLs to fetch and convert to documents
Returns:
List of LlamaIndex Document objects with extracted text and metadata.
How It Works
- The reader sends URLs to the Plasmate SOM Cache API
- Plasmate fetches the page and converts HTML to SOM format
- The reader extracts readable text from semantic regions:
- Headings (h1 through h6)
- Paragraphs
- Links (with href context)
- Lists (ordered and unordered)
- Tables
- Text is assembled into a clean document with source metadata
Links
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_index_readers_plasmate-0.1.0.tar.gz.
File metadata
- Download URL: llama_index_readers_plasmate-0.1.0.tar.gz
- Upload date:
- Size: 9.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d74334631853c023170dca46780b7d186c96931bc9ba806c2e11e36668fbc22
|
|
| MD5 |
5855b6f8b9007068eb629f0fb32b05f9
|
|
| BLAKE2b-256 |
7b290e183fb8e4af0612c30ea06c7b73dd71ad5e626e62a7f31ab6305b1cc3ba
|
File details
Details for the file llama_index_readers_plasmate-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llama_index_readers_plasmate-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d0c0edeb038faa4ba06b3e135ef078933c378388da9639d02d1b10698007261
|
|
| MD5 |
472e30f19f4ef3fb92143304673fe94f
|
|
| BLAKE2b-256 |
ca8392a9710697dacf3b63b7204626de93f4595e0e594115cf5fcd8fd9773cd3
|