Official Python SDK for the scrapedatshi RAG pipeline API
Project description
scrapedatshi-py
Official Python SDK for the scrapedatshi RAG pipeline API.
Scrape URLs, chunk documents, embed content, and inject into vector databases — all from a clean, typed Python interface.
Installation
pip install scrapedatshi
Requires Python 3.10+.
Quick Start
from scrapedatshi import ScrapedatshiClient
client = ScrapedatshiClient(api_key="sds_...")
# Chunk a URL to JSON (no embedding required)
result = client.pipeline.chunk_url("https://docs.example.com")
print(f"Got {result.total_chunks} chunks")
print(f"Cost: ${result.credits_used:.4f} | Remaining: ${result.credits_remaining:.4f}")
for chunk in result.chunks:
print(chunk.content[:80])
Authentication
Pass your API key directly or set the SCRAPEDATSHI_API_KEY environment variable:
export SCRAPEDATSHI_API_KEY="sds_..."
# Explicit key
client = ScrapedatshiClient(api_key="sds_...")
# From environment variable
client = ScrapedatshiClient()
Get your API key at scrapedatshi.com/portal/register. New accounts receive $1.00 free credits — no credit card required.
Pricing
scrapedatshi uses a pay-per-use credit wallet — no subscriptions, no monthly fees. Credits are deducted after each successful API call. Failed requests are never charged.
| Operation | Rate | Applies To |
|---|---|---|
| URL Fetch | $0.0020 / URL | /v1/rag-chunk, /v1/crawl, /v1/crawl-chunk, /v1/sync, /v1/ingest |
| Spider Fetch | $0.0050 / URL | /v1/spider (replaces URL fetch) |
| Chunk Fee | $0.0005 / chunk | All routes — per chunk generated |
| Injection Fee | $0.0030 / chunk | /v1/sync, /v1/ingest — per chunk upserted to vector DB |
| Contextual Retrieval | $0.0030 / URL | When contextual_retrieval=True |
Example: sync() on 1 URL → 10 chunks = $0.0020 + (10 × $0.0005) + (10 × $0.0030) = $0.0370
Top up your balance at scrapedatshi.com/portal/billing.
Pipeline Methods
Chunk to JSON
No embedding or vector DB required. Returns structured JSON chunks from any source.
Chunk a URL
result = client.pipeline.chunk_url("https://docs.example.com")
# result.chunks → list[Chunk]
# result.total_chunks → int
# result.source → str (the URL)
# result.credits_used → float
# result.credits_remaining → float
# result.content_truncated → bool (True if content exceeded ~75,000 words)
Chunk a local file
Supports PDF, DOCX, TXT, MD, and HTML.
result = client.pipeline.chunk_file("./docs/manual.pdf")
print(f"Got {result.total_chunks} chunks from {result.source}")
print(f"Cost: ${result.credits_used:.4f}")
Crawl a website
Crawls via sitemap and chunks all pages.
result = client.pipeline.crawl("https://example.com", max_pages=10)
print(f"Crawled {result.pages_crawled} pages → {result.total_chunks} chunks")
print(f"Cost: ${result.credits_used:.4f}")
Full Pipeline — Embed + Inject
Scrape, embed, and inject directly into your vector database in one call.
Sync a URL
result = client.pipeline.sync(
url="https://docs.example.com",
embedding_provider="openai",
embedding_api_key="sk-...",
vector_db="pinecone",
vector_db_api_key="pc-...",
index_name="my-docs",
)
print(f"Upserted {result.vectors_upserted} vectors ({result.total_tokens} tokens)")
print(f"Cost: ${result.credits_used:.4f}")
Ingest a local file
result = client.pipeline.ingest(
file_path="./docs/manual.pdf",
embedding_provider="openai",
embedding_api_key="sk-...",
vector_db="pinecone",
vector_db_api_key="pc-...",
index_name="my-docs",
)
Contextual Retrieval (RAG 2.0)
Prepend an LLM-generated document summary to every chunk before embedding, dramatically improving retrieval accuracy.
Additional cost: $0.0030 per URL when contextual_retrieval=True.
result = client.pipeline.chunk_url(
"https://docs.example.com",
contextual_retrieval=True,
llm_provider="openai",
llm_api_key="sk-...",
llm_model="gpt-4o-mini",
)
Supported LLM providers: openai, anthropic, gemini
Async Support
All methods have an _async variant for use with asyncio.
import asyncio
from scrapedatshi import ScrapedatshiClient
async def main():
async with ScrapedatshiClient(api_key="sds_...") as client:
result = await client.pipeline.chunk_url_async("https://docs.example.com")
print(f"Got {result.total_chunks} chunks — cost ${result.credits_used:.4f}")
asyncio.run(main())
Parallel processing with asyncio.gather
async def main():
async with ScrapedatshiClient(api_key="sds_...") as client:
urls = [
"https://docs.example.com/page1",
"https://docs.example.com/page2",
"https://docs.example.com/page3",
]
results = await asyncio.gather(
*[client.pipeline.chunk_url_async(url) for url in urls]
)
total = sum(r.total_chunks for r in results)
total_cost = sum(r.credits_used for r in results)
print(f"Processed {len(urls)} URLs → {total} total chunks — total cost ${total_cost:.4f}")
Response Models
All methods return typed Pydantic models with full IDE autocomplete support.
Every response includes credits_used and credits_remaining for programmatic spend tracking.
ChunkResult
result.chunks # list[Chunk]
result.total_chunks # int
result.source # str
result.contextual_retrieval_used # bool
result.content_truncated # bool — True if content exceeded ~75,000 words
result.credits_used # float — credits deducted for this request
result.credits_remaining # float — account balance after this request
Chunk
chunk.content # str — the chunk text
chunk.token_estimate # int — estimated token count
chunk.metadata # dict — source URL, page number, etc.
CrawlChunkResult
result.chunks # list[Chunk]
result.total_chunks # int
result.pages_crawled # int
result.source_url # str
result.credits_used # float
result.credits_remaining # float
SyncResult / IngestResult
result.status # "success" | "partial" | "error"
result.chunks_created # int
result.vectors_upserted # int
result.total_tokens # int
result.embedding_provider # str
result.vector_db_provider # str
result.credits_used # float
result.credits_remaining # float
Error Handling
from scrapedatshi.exceptions import (
AuthError, # Invalid or missing API key (401/403)
InsufficientCreditsError, # Balance too low — top up at portal/billing (402)
RateLimitError, # Per-request hard cap or rate limit exceeded (429)
ValidationError, # Bad request payload (422)
ServerError, # API server error (5xx)
TimeoutError, # Request timed out
ScrapedatshiError # Base exception — catch-all
)
try:
result = client.pipeline.sync(
url="https://docs.example.com",
embedding_provider="openai",
embedding_api_key="sk-...",
vector_db="pinecone",
vector_db_api_key="pc-...",
index_name="my-docs",
)
except InsufficientCreditsError:
print("Balance too low — top up at scrapedatshi.com/portal/billing")
except RateLimitError as e:
print(f"Rate limit hit: {e.message}")
except ScrapedatshiError as e:
print(f"API error {e.status_code}: {e.message}")
Hard Caps
Per-request hard caps protect server stability and apply to all accounts:
| Cap | Limit |
|---|---|
| Max pages / crawl | 35 |
| Max pages / spider | 35 |
| Max chunks / request | 35,000 |
| Max content size | ~75,000 words (auto-truncated) |
Exceeding a hard cap returns HTTP 400. Content exceeding the size limit is automatically
truncated — check result.content_truncated to detect this.
Development
git clone https://github.com/mxchris18/scrapedatshi-py
cd scrapedatshi-py
pip install -e ".[dev]"
pytest
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapedatshi-0.1.3.tar.gz.
File metadata
- Download URL: scrapedatshi-0.1.3.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.17.0 {"ci":null,"cpu":"AMD64","implementation":{"name":"CPython","version":"3.13.12"},"installer":{"name":"hatch","version":"1.17.0"},"openssl_version":"OpenSSL 3.0.18 30 Sep 2025","python":"3.13.12","system":{"name":"Windows","release":"11"}} HTTPX2/2.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6109e74693ba08104fde065acb64a659d1bc77c9785faa518c9aae9cc7c04092
|
|
| MD5 |
a9e2a1d917f511cca045017b68c09645
|
|
| BLAKE2b-256 |
817c5eeed79c53688aec5f240ff3a1daf01f4b3ed26556e015dfe913c41f4cd0
|
File details
Details for the file scrapedatshi-0.1.3-py3-none-any.whl.
File metadata
- Download URL: scrapedatshi-0.1.3-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.17.0 {"ci":null,"cpu":"AMD64","implementation":{"name":"CPython","version":"3.13.12"},"installer":{"name":"hatch","version":"1.17.0"},"openssl_version":"OpenSSL 3.0.18 30 Sep 2025","python":"3.13.12","system":{"name":"Windows","release":"11"}} HTTPX2/2.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18a0a024cc13e7b4bf3e22087afa4aca004eee1af594a45f305cf6d8e2eb32c4
|
|
| MD5 |
622e7fd01d4b3f1440cdd971e0a693ad
|
|
| BLAKE2b-256 |
245ca6e85dc3b5dfe12ef7a52698cb46418477bafd79519fc074d8aea727df0a
|