llama-index readers scrappey integration

These details have not been verified by PyPI

Project links

Project description

LlamaIndex Readers Integration: Scrappey

Use Scrappey to load web pages into LlamaIndex as clean Markdown, bypassing anti-bot protections (Cloudflare, DataDome, PerimeterX, etc.).

Scrappey fetches the page through its anti-bot proxy and returns server-side Markdown directly (via the markdown: true flag). This reader reads that Markdown straight into a LlamaIndex Document. If a response ever lacks server-side Markdown, the reader falls back to local conversion with markdownify.

Installation

pip install llama-index-readers-scrappey

Setup

Sign up at scrappey.com and grab your API key.
Pass it to the reader (or set SCRAPPEY_API_KEY and read from env):

Quickstart

import os
from llama_index.core import VectorStoreIndex
from llama_index.readers.scrappey import ScrappeyReader

reader = ScrappeyReader(api_key=os.environ["SCRAPPEY_API_KEY"])

documents = reader.load_data(
    [
        "https://example.com",
        "https://en.wikipedia.org/wiki/Web_scraping",
    ]
)

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
print(query_engine.query("Summarize web scraping."))

Async

import asyncio

async def main():
    reader = ScrappeyReader(api_key="...")
    docs = await reader.aload_data(["https://example.com"])
    print(docs[0].text[:200])

asyncio.run(main())

Raw HTML instead of Markdown

reader = ScrappeyReader(api_key="...", as_markdown=False)
docs = reader.load_data(["https://example.com"])
# docs[0].text is the raw HTML returned by Scrappey

Document schema

Each URL produces one Document:

Field	Type	Value
`text`	`str`	Markdown body (or raw HTML if `as_markdown=False`)
`id_`	`str`	The source URL (stable ID for ingestion)
`metadata["source"]`	`str`	Always `"scrappey"`
`metadata["url"]`	`str`	The URL you asked for
`metadata["current_url"]`	`str`	Final URL after redirects
`metadata["verified"]`	`bool`	Scrappey's anti-bot verification flag
`metadata["detected_antibot_providers"]`	`dict`	e.g. `{"primaryProvider": "cloudflare"}`
`metadata["session_id"]`	`str`	Scrappey session ID (reusable for crawls)
`metadata["time_elapsed_ms"]`	`int`	How long Scrappey took to fetch the page

Constructor options

Arg	Default	Purpose
`api_key`	required	Scrappey API key
`api_url`	`https://publisher.scrappey.com/api/v1`	Override for self-hosted / proxied endpoints
`timeout`	`120.0`	Per-request HTTP timeout in seconds
`as_markdown`	`True`	Convert scraped HTML to Markdown locally

Roadmap

v0.1 is intentionally minimal (URLs → Markdown Documents). Planned for later releases:

Session reuse (sessions.create / sessions.destroy) for cheaper crawls and consistent fingerprinting
proxyCountry, premiumProxy, browser configuration
browserActions, customHeaders, cookies, postData passthrough for JS-heavy and auth-gated pages
POST-body scraping via cmd: "request.post"

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Apr 16, 2026

0.1.0

Apr 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_scrappey-0.1.1.tar.gz (5.2 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_index_readers_scrappey-0.1.1-py3-none-any.whl (5.8 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file llama_index_readers_scrappey-0.1.1.tar.gz.

File metadata

Download URL: llama_index_readers_scrappey-0.1.1.tar.gz
Upload date: Apr 16, 2026
Size: 5.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for llama_index_readers_scrappey-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f0a846ba575d3d3f5ec0c0a1bd5977c4eacd72bdce1de50ee1f46abc714951f8`
MD5	`bff0402c3230aac819a961b8816be161`
BLAKE2b-256	`22bab8b63dcbcbc29171cea7a3f6e0b0f2225470a6e9adb8c2283bf3387282a7`

See more details on using hashes here.

File details

Details for the file llama_index_readers_scrappey-0.1.1-py3-none-any.whl.

File metadata

Download URL: llama_index_readers_scrappey-0.1.1-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 5.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for llama_index_readers_scrappey-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e104b90712a4693ba17b7ea8fc5dd0bf2817f61d72f8db592254b4e78186f712`
MD5	`05395355171dfb4e2362c5ec5c2f5d47`
BLAKE2b-256	`34bd4eb125a0efadda15c47fbba059cd036e4f8cb1948f5498d5ef04565ccec9`

See more details on using hashes here.

llama-index-readers-scrappey 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

LlamaIndex Readers Integration: Scrappey

Installation

Setup

Quickstart

Async

Raw HTML instead of Markdown

Document schema

Constructor options

Roadmap

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes