Skip to main content

llama-index readers scrappey integration

Project description

LlamaIndex Readers Integration: Scrappey

Use Scrappey to load web pages into LlamaIndex as clean Markdown, bypassing anti-bot protections (Cloudflare, DataDome, PerimeterX, etc.).

Scrappey fetches the page through its anti-bot proxy and returns server-side Markdown directly (via the markdown: true flag). This reader reads that Markdown straight into a LlamaIndex Document. If a response ever lacks server-side Markdown, the reader falls back to local conversion with markdownify.

Installation

pip install llama-index-readers-scrappey

Setup

  1. Sign up at scrappey.com and grab your API key.
  2. Pass it to the reader (or set SCRAPPEY_API_KEY and read from env):

Quickstart

import os
from llama_index.core import VectorStoreIndex
from llama_index.readers.scrappey import ScrappeyReader

reader = ScrappeyReader(api_key=os.environ["SCRAPPEY_API_KEY"])

documents = reader.load_data(
    [
        "https://example.com",
        "https://en.wikipedia.org/wiki/Web_scraping",
    ]
)

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
print(query_engine.query("Summarize web scraping."))

Async

import asyncio

async def main():
    reader = ScrappeyReader(api_key="...")
    docs = await reader.aload_data(["https://example.com"])
    print(docs[0].text[:200])

asyncio.run(main())

Raw HTML instead of Markdown

reader = ScrappeyReader(api_key="...", as_markdown=False)
docs = reader.load_data(["https://example.com"])
# docs[0].text is the raw HTML returned by Scrappey

Document schema

Each URL produces one Document:

Field Type Value
text str Markdown body (or raw HTML if as_markdown=False)
id_ str The source URL (stable ID for ingestion)
metadata["source"] str Always "scrappey"
metadata["url"] str The URL you asked for
metadata["current_url"] str Final URL after redirects
metadata["verified"] bool Scrappey's anti-bot verification flag
metadata["detected_antibot_providers"] dict e.g. {"primaryProvider": "cloudflare"}
metadata["session_id"] str Scrappey session ID (reusable for crawls)
metadata["time_elapsed_ms"] int How long Scrappey took to fetch the page

Constructor options

Arg Default Purpose
api_key required Scrappey API key
api_url https://publisher.scrappey.com/api/v1 Override for self-hosted / proxied endpoints
timeout 120.0 Per-request HTTP timeout in seconds
as_markdown True Convert scraped HTML to Markdown locally

Roadmap

v0.1 is intentionally minimal (URLs → Markdown Documents). Planned for later releases:

  • Session reuse (sessions.create / sessions.destroy) for cheaper crawls and consistent fingerprinting
  • proxyCountry, premiumProxy, browser configuration
  • browserActions, customHeaders, cookies, postData passthrough for JS-heavy and auth-gated pages
  • POST-body scraping via cmd: "request.post"

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_scrappey-0.1.1.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_scrappey-0.1.1-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_scrappey-0.1.1.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_scrappey-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f0a846ba575d3d3f5ec0c0a1bd5977c4eacd72bdce1de50ee1f46abc714951f8
MD5 bff0402c3230aac819a961b8816be161
BLAKE2b-256 22bab8b63dcbcbc29171cea7a3f6e0b0f2225470a6e9adb8c2283bf3387282a7

See more details on using hashes here.

File details

Details for the file llama_index_readers_scrappey-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_scrappey-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e104b90712a4693ba17b7ea8fc5dd0bf2817f61d72f8db592254b4e78186f712
MD5 05395355171dfb4e2362c5ec5c2f5d47
BLAKE2b-256 34bd4eb125a0efadda15c47fbba059cd036e4f8cb1948f5498d5ef04565ccec9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page