Skip to main content

LlamaIndex reader for Plasmate SOM, providing structured web content for AI agents

Project description

LlamaIndex Plasmate Reader

A LlamaIndex reader for Plasmate SOM (Structured Object Model), providing clean, structured web content optimized for AI agents and RAG pipelines.

What is Plasmate SOM?

Plasmate SOM converts messy HTML into a clean, semantic structure that AI models can easily understand. Instead of parsing raw HTML with all its noise, you get structured content with:

  • Semantic regions (headers, navigation, main content, footers)
  • Clean text extraction from headings, paragraphs, links, lists, and tables
  • Compression ratios typically 10x smaller than raw HTML
  • Consistent structure across any website

Installation

pip install llama-index-readers-plasmate

Quick Start

from llama_index_plasmate import PlasmateReader

# Initialize the reader
reader = PlasmateReader()

# Load documents from URLs
documents = reader.load_data(urls=[
    "https://example.com/page1",
    "https://example.com/page2",
])

# Use with LlamaIndex
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is on these pages?")

Configuration

Using the SOM Cache API (Recommended)

The reader uses the Plasmate SOM Cache API by default for fast, cached responses:

reader = PlasmateReader(
    api_key="your-api-key",  # Optional, for authenticated access
    api_base="https://cache.plasmate.app",  # Default
)

Using Local Plasmate CLI Fallback

If the API is unavailable, the reader automatically falls back to the local plasmate CLI if installed:

# Install plasmate CLI
npm install -g plasmate

The reader will use the CLI when:

  • The API returns an error
  • No API key is provided and the endpoint requires authentication
  • You explicitly disable the API

Document Metadata

Each document includes rich metadata:

doc = documents[0]
print(doc.metadata)
# {
#     "source": "https://example.com/page1",
#     "title": "Page Title",
#     "som_version": "1.0",
#     "compression_ratio": 12.5,
#     "html_bytes": 125000,
#     "som_bytes": 10000,
# }

API Reference

PlasmateReader

PlasmateReader(
    api_key: Optional[str] = None,
    api_base: str = "https://cache.plasmate.app",
)

Parameters:

  • api_key: Optional API key for authenticated access to the SOM Cache API
  • api_base: Base URL for the SOM Cache API (default: https://cache.plasmate.app)

load_data

reader.load_data(
    urls: List[str],
) -> List[Document]

Parameters:

  • urls: List of URLs to fetch and convert to documents

Returns:

List of LlamaIndex Document objects with extracted text and metadata.

How It Works

  1. The reader sends URLs to the Plasmate SOM Cache API
  2. Plasmate fetches the page and converts HTML to SOM format
  3. The reader extracts readable text from semantic regions:
    • Headings (h1 through h6)
    • Paragraphs
    • Links (with href context)
    • Lists (ordered and unordered)
    • Tables
  4. Text is assembled into a clean document with source metadata

Links

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_plasmate-0.1.0.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_plasmate-0.1.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_plasmate-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_plasmate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1d74334631853c023170dca46780b7d186c96931bc9ba806c2e11e36668fbc22
MD5 5855b6f8b9007068eb629f0fb32b05f9
BLAKE2b-256 7b290e183fb8e4af0612c30ea06c7b73dd71ad5e626e62a7f31ab6305b1cc3ba

See more details on using hashes here.

File details

Details for the file llama_index_readers_plasmate-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_plasmate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d0c0edeb038faa4ba06b3e135ef078933c378388da9639d02d1b10698007261
MD5 472e30f19f4ef3fb92143304673fe94f
BLAKE2b-256 ca8392a9710697dacf3b63b7204626de93f4595e0e594115cf5fcd8fd9773cd3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page