llama-index readers scrappey integration
Project description
LlamaIndex Readers Integration: Scrappey
Use Scrappey to load web pages into LlamaIndex as clean Markdown, bypassing anti-bot protections (Cloudflare, DataDome, PerimeterX, etc.).
Scrappey fetches the page through its anti-bot proxy; this reader then converts the returned HTML to Markdown locally with markdownify, producing LLM-ready text.
Installation
pip install llama-index-readers-scrappey
Setup
- Sign up at scrappey.com and grab your API key.
- Pass it to the reader (or set
SCRAPPEY_API_KEYand read from env):
Quickstart
import os
from llama_index.core import VectorStoreIndex
from llama_index.readers.scrappey import ScrappeyReader
reader = ScrappeyReader(api_key=os.environ["SCRAPPEY_API_KEY"])
documents = reader.load_data(
[
"https://example.com",
"https://en.wikipedia.org/wiki/Web_scraping",
]
)
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
print(query_engine.query("Summarize web scraping."))
Async
import asyncio
async def main():
reader = ScrappeyReader(api_key="...")
docs = await reader.aload_data(["https://example.com"])
print(docs[0].text[:200])
asyncio.run(main())
Raw HTML instead of Markdown
reader = ScrappeyReader(api_key="...", as_markdown=False)
docs = reader.load_data(["https://example.com"])
# docs[0].text is the raw HTML returned by Scrappey
Document schema
Each URL produces one Document:
| Field | Type | Value |
|---|---|---|
text |
str |
Markdown body (or raw HTML if as_markdown=False) |
id_ |
str |
The source URL (stable ID for ingestion) |
metadata["source"] |
str |
Always "scrappey" |
metadata["url"] |
str |
The URL you asked for |
metadata["current_url"] |
str |
Final URL after redirects |
metadata["verified"] |
bool |
Scrappey's anti-bot verification flag |
metadata["detected_antibot_providers"] |
dict |
e.g. {"primaryProvider": "cloudflare"} |
metadata["session_id"] |
str |
Scrappey session ID (reusable for crawls) |
metadata["time_elapsed_ms"] |
int |
How long Scrappey took to fetch the page |
Constructor options
| Arg | Default | Purpose |
|---|---|---|
api_key |
required | Scrappey API key |
api_url |
https://publisher.scrappey.com/api/v1 |
Override for self-hosted / proxied endpoints |
timeout |
120.0 |
Per-request HTTP timeout in seconds |
as_markdown |
True |
Convert scraped HTML to Markdown locally |
Roadmap
v0.1 is intentionally minimal (URLs → Markdown Documents). Planned for later releases:
- Session reuse (
sessions.create/sessions.destroy) for cheaper crawls and consistent fingerprinting proxyCountry,premiumProxy,browserconfigurationbrowserActions,customHeaders,cookies,postDatapassthrough for JS-heavy and auth-gated pages- POST-body scraping via
cmd: "request.post"
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_index_readers_scrappey-0.1.0.tar.gz.
File metadata
- Download URL: llama_index_readers_scrappey-0.1.0.tar.gz
- Upload date:
- Size: 4.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bc9849456289f670175c549369bb6d17479ea25ee37a3cc50b90ff9f626aa18
|
|
| MD5 |
0e3dc38e03bc78089e78bc46cb0b71b0
|
|
| BLAKE2b-256 |
ca5654da8e80d50e3c890e349bf1bc6ffdf20baeaab02234e5d3580ee62d7c49
|
File details
Details for the file llama_index_readers_scrappey-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llama_index_readers_scrappey-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ae33b6f2118ea95396327261c7b3829aee2a8c98e41686f4877edabfe3bcd35
|
|
| MD5 |
9e654ce7dc752f7e7e93eab3650531a0
|
|
| BLAKE2b-256 |
86fc44ee8204c86032544d2b3fd95d27eaaf603170e9acb691f13a421a3f66c6
|