Skip to main content

API-first SaaS content retriever — uniform Document stream from GitHub (issues + PRs + code, org-wide). No scraping, no Playwright.

Project description

saas-retriever

API-first SaaS content retriever. Yields a uniform Document stream from SaaS providers via their official APIs — no scraping, no Playwright, no Chrome. Downstream pipelines (pleno-anonymize, pleno-secret-scanner) consume the same Document shape they always have.

Heads up: saas-retriever is the API-only successor of saas-scraper (PyPI 0.1–0.5, deprecated). The browser-driven connectors are gone; everything in this package goes through documented APIs. Pin saas-scraper<=0.5 if you specifically need the old behaviour.

Install

uv add saas-retriever
# or, as a CLI:
pipx install saas-retriever

Usage

CLI

# Org-wide GitHub scan (default = code + issues + PRs across every repo)
GITHUB_TOKEN=ghp_... saas-retriever fetch github --owner plenoai

# Single repo, only issues
saas-retriever fetch github --owner plenoai --repo saas-retriever \
    --resource issues

# Filter to recently-updated content
saas-retriever fetch github --owner plenoai --since 7d

fetch streams Documents as NDJSON to stdout (or --out FILE). One line per Document: ref, text or binary_b64, fetched_at, content_hash, created_by, extra.

Programmatic

import asyncio
from saas_retriever import registry

async def main() -> None:
    gh = registry.create(
        "github",
        owner="plenoai",
        resources={"code", "issues", "prs"},
    )
    try:
        async for doc in gh.discover_and_fetch():
            kind = doc.ref.metadata.get("resource_type")
            print(kind, doc.ref.path, len(doc.text or ""))
    finally:
        await gh.close()

asyncio.run(main())

Auth

Token resolution order:

  1. token= constructor argument (--token on the CLI)
  2. GITHUB_TOKEN environment variable
  3. gh auth token if the GitHub CLI is on PATH

Anonymous (token-less) requests work for public content but are rate-limited to 60/h — fine for a smoke test, not enough for an org-wide scan. Use a fine-grained PAT with metadata:read + contents:read + issues:read + pull_requests:read for the minimum viable scope.

Connectors

Connector Status What it covers
github implemented (v0.1) Org-wide repo enumeration + per-repo code (recursive tree), issues (title + body + comments), pull requests (title + body + comments + review comments + diff). Default: all three resources.

Slack, Jira, Confluence, Notion, GitLab, Bitbucket land in subsequent releases as standalone API-based connectors. The Document / DocumentRef / Connector protocol is stable and downstream consumers won't need to change.

Rate-limit handling

The connector reads X-RateLimit-Remaining / X-RateLimit-Reset and sleeps until the bucket resets on 403 secondary rate-limit responses. Hard 429s honour Retry-After. 5xx errors retry with exponential backoff (3 attempts).

Development

uv sync --all-extras
uv run ruff check
uv run mypy src
uv run pytest

The default pytest pass uses httpx.MockTransport for every HTTP call — no live API access in CI. A live smoke test against a real public org runs as a manual step before release.

Release

vX.Y.Z tag pushes trigger PyPI trusted publishing via GitHub Actions — no manual token. The first publish requires a one-time Trusted Publisher configuration at https://pypi.org/manage/account/publishing/:

Field Value
PyPI Project Name saas-retriever
Owner plenoai
Repository name saas-retriever
Workflow name release.yml
Environment name pypi

License

AGPL-3.0-or-later.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saas_retriever-0.1.0.tar.gz (26.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saas_retriever-0.1.0-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file saas_retriever-0.1.0.tar.gz.

File metadata

  • Download URL: saas_retriever-0.1.0.tar.gz
  • Upload date:
  • Size: 26.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for saas_retriever-0.1.0.tar.gz
Algorithm Hash digest
SHA256 80d60b59f5cb058bfd3fa1d4ec478662455ab3d028d03e451e06dd983e9731fd
MD5 fa04ceee32168295d744aacb98836545
BLAKE2b-256 9f7d349d97aa25a34589ad479b3a08f6a1cb7527d06dbcf542713e182e738563

See more details on using hashes here.

Provenance

The following attestation bundles were made for saas_retriever-0.1.0.tar.gz:

Publisher: release.yml on plenoai/saas-retriever

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file saas_retriever-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: saas_retriever-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for saas_retriever-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f8add4a0a934025ceff81bcd8fb8ea6d6cb994340b28078ec0349b52c3569fa
MD5 e9cc3074784d94cfa75daa4ae9710605
BLAKE2b-256 33c0601ed5858fa0ab474926e8400d3321b6b3dd81f8fe4b3b555f61f17c4d94

See more details on using hashes here.

Provenance

The following attestation bundles were made for saas_retriever-0.1.0-py3-none-any.whl:

Publisher: release.yml on plenoai/saas-retriever

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page