Official Python SDK for the scrapedatshi RAG pipeline API

These details have not been verified by PyPI

Project links

Project description

scrapedatshi-py

Official Python SDK for the scrapedatshi RAG pipeline API.

Scrape URLs, chunk documents, embed content, and inject into vector databases — all from a clean, typed Python interface.

Installation

pip install scrapedatshi

Requires Python 3.10+.

Quick Start

from scrapedatshi import ScrapedatshiClient

client = ScrapedatshiClient(api_key="sds_...")

# Chunk a URL to JSON (all tiers — no embedding required)
result = client.pipeline.chunk_url("https://docs.example.com")

print(f"Got {result.total_chunks} chunks")
for chunk in result.chunks:
    print(chunk.content[:80])

Authentication

Pass your API key directly or set the SCRAPEDATSHI_API_KEY environment variable:

export SCRAPEDATSHI_API_KEY="sds_..."

# Explicit key
client = ScrapedatshiClient(api_key="sds_...")

# From environment variable
client = ScrapedatshiClient()

Get your API key at scrapedatshi.com/portal/register.

Pipeline Methods

Chunk to JSON (all tiers)

No embedding or vector DB required. Returns structured JSON chunks from any source.

Chunk a URL

result = client.pipeline.chunk_url("https://docs.example.com")

# result.chunks       → list[Chunk]
# result.total_chunks → int
# result.source       → str (the URL)

Chunk a local file

Supports PDF, DOCX, TXT, MD, and HTML.

result = client.pipeline.chunk_file("./docs/manual.pdf")

print(f"Got {result.total_chunks} chunks from {result.source}")

Crawl a website (Basic tier+)

Crawls via sitemap and chunks all pages.

result = client.pipeline.crawl("https://example.com", max_pages=10)

print(f"Crawled {result.pages_crawled} pages → {result.total_chunks} chunks")

Full Pipeline — Embed + Inject (Pro/Enterprise)

Scrape, embed, and inject directly into your vector database in one call.

Sync a URL

result = client.pipeline.sync(
    url="https://docs.example.com",
    embedding_provider="openai",
    embedding_api_key="sk-...",
    vector_db="pinecone",
    vector_db_api_key="pc-...",
    index_name="my-docs",
)

print(f"Upserted {result.vectors_upserted} vectors ({result.total_tokens} tokens)")

Ingest a local file

result = client.pipeline.ingest(
    file_path="./docs/manual.pdf",
    embedding_provider="openai",
    embedding_api_key="sk-...",
    vector_db="pinecone",
    vector_db_api_key="pc-...",
    index_name="my-docs",
)

Contextual Retrieval (RAG 2.0) — Basic tier+

Prepend an LLM-generated document summary to every chunk before embedding, dramatically improving retrieval accuracy.

result = client.pipeline.chunk_url(
    "https://docs.example.com",
    contextual_retrieval=True,
    llm_provider="openai",
    llm_api_key="sk-...",
    llm_model="gpt-4o-mini",
)

Supported LLM providers: openai, anthropic, gemini

Async Support

All methods have an _async variant for use with asyncio.

import asyncio
from scrapedatshi import ScrapedatshiClient

async def main():
    async with ScrapedatshiClient(api_key="sds_...") as client:
        result = await client.pipeline.chunk_url_async("https://docs.example.com")
        print(f"Got {result.total_chunks} chunks")

asyncio.run(main())

Parallel processing with `asyncio.gather`

async def main():
    async with ScrapedatshiClient(api_key="sds_...") as client:
        urls = [
            "https://docs.example.com/page1",
            "https://docs.example.com/page2",
            "https://docs.example.com/page3",
        ]
        results = await asyncio.gather(
            *[client.pipeline.chunk_url_async(url) for url in urls]
        )
        total = sum(r.total_chunks for r in results)
        print(f"Processed {len(urls)} URLs → {total} total chunks")

Response Models

All methods return typed Pydantic models with full IDE autocomplete support.

`ChunkResult`

result.chunks              # list[Chunk]
result.total_chunks        # int
result.source              # str
result.contextual_retrieval_used  # bool

`Chunk`

chunk.content              # str — the chunk text
chunk.token_estimate       # int — estimated token count
chunk.metadata             # dict — source URL, page number, etc.

`CrawlChunkResult`

result.chunks              # list[Chunk]
result.total_chunks        # int
result.pages_crawled       # int
result.source_url          # str

`SyncResult` / `IngestResult`

result.status              # "success" | "error"
result.chunks_created      # int
result.vectors_upserted    # int
result.total_tokens        # int
result.embedding_provider  # str
result.vector_db_provider  # str

Error Handling

from scrapedatshi.exceptions import (
    AuthError,        # Invalid or missing API key (401/403)
    TierError,        # Feature not available on your plan (403)
    RateLimitError,   # Monthly or per-minute limit exceeded (429)
    ValidationError,  # Bad request payload (422)
    ServerError,      # API server error (5xx)
    TimeoutError,     # Request timed out
    ScrapedatshiError # Base exception — catch-all
)

try:
    result = client.pipeline.sync(
        url="https://docs.example.com",
        embedding_provider="openai",
        embedding_api_key="sk-...",
        vector_db="pinecone",
        vector_db_api_key="pc-...",
        index_name="my-docs",
    )
except TierError as e:
    print(f"Upgrade required: {e.message}")
except RateLimitError as e:
    print(f"Rate limit hit: {e.message}")
except ScrapedatshiError as e:
    print(f"API error {e.status_code}: {e.message}")

Tier Limits

Feature	Free	Basic	Pro	Enterprise
Price	$0/mo	$9/mo	$29/mo	$49/mo + usage
Chunk to JSON	✓	✓	✓	✓
Sitemap Crawl	—	✓	✓	✓
Contextual Retrieval	—	✓	✓	✓
Full Pipeline	—	—	✓	✓
Deep Spider Crawl	—	—	✓	✓
Max pages / crawl	5	10	25	50
Max chunks / request	500	2,000	10,000	Unlimited
Concurrent requests	1	3	10	25

Development

git clone https://github.com/mxchris18/scrapedatshi-py
cd scrapedatshi-py
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Jun 17, 2026

This version

0.1.2

Jun 16, 2026

0.1.1

Jun 16, 2026

0.1.0

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapedatshi-0.1.2.tar.gz (12.4 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapedatshi-0.1.2-py3-none-any.whl (12.4 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file scrapedatshi-0.1.2.tar.gz.

File metadata

Download URL: scrapedatshi-0.1.2.tar.gz
Upload date: Jun 16, 2026
Size: 12.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.17.0 {"ci":null,"cpu":"AMD64","implementation":{"name":"CPython","version":"3.13.12"},"installer":{"name":"hatch","version":"1.17.0"},"openssl_version":"OpenSSL 3.0.18 30 Sep 2025","python":"3.13.12","system":{"name":"Windows","release":"11"}} HTTPX2/2.4.0

File hashes

Hashes for scrapedatshi-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`4af32723c9f3b00abcecef3a84a59e1f44a0fb9dde716538c1b41710e8c569b5`
MD5	`1afcccbc712d465202286f258d7bf956`
BLAKE2b-256	`80528fc555d0253ffa4e13c9e772761850f56216aef32d9d66f7bccff68c85de`

See more details on using hashes here.

File details

Details for the file scrapedatshi-0.1.2-py3-none-any.whl.

File metadata

Download URL: scrapedatshi-0.1.2-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.17.0 {"ci":null,"cpu":"AMD64","implementation":{"name":"CPython","version":"3.13.12"},"installer":{"name":"hatch","version":"1.17.0"},"openssl_version":"OpenSSL 3.0.18 30 Sep 2025","python":"3.13.12","system":{"name":"Windows","release":"11"}} HTTPX2/2.4.0

File hashes

Hashes for scrapedatshi-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c883fa7bd903c2e70c5d042e6aa26484f2b2869f27b37b56bec7e4e7f61ea412`
MD5	`b70e5decb4b8f400e4c519dbb98fc7d6`
BLAKE2b-256	`69f898ad2b57fb4877f33180e392aadbd9e6139cae2e7e57690339984e6e9036`

See more details on using hashes here.

scrapedatshi 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scrapedatshi-py

Installation

Quick Start

Authentication

Pipeline Methods

Chunk to JSON (all tiers)

Chunk a URL

Chunk a local file

Crawl a website (Basic tier+)

Full Pipeline — Embed + Inject (Pro/Enterprise)

Sync a URL

Ingest a local file

Contextual Retrieval (RAG 2.0) — Basic tier+

Async Support

Parallel processing with asyncio.gather

Response Models

ChunkResult

Chunk

CrawlChunkResult

SyncResult / IngestResult

Error Handling

Tier Limits

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Parallel processing with `asyncio.gather`

`ChunkResult`

`Chunk`

`CrawlChunkResult`

`SyncResult` / `IngestResult`