API-first SaaS content retriever — uniform Document stream from GitHub (issues + PRs + code, org-wide). No scraping, no Playwright.
Project description
saas-retriever
API-first SaaS content retriever. Yields a uniform Document stream
from SaaS providers via their official APIs — no scraping, no
Playwright, no Chrome. Downstream pipelines
(pleno-anonymize,
pleno-secret-scanner)
consume the same Document shape they always have.
Heads up:
saas-retrieveris the API-only successor ofsaas-scraper(PyPI 0.1–0.5, deprecated). The browser-driven connectors are gone; everything in this package goes through documented APIs. Pinsaas-scraper<=0.5if you specifically need the old behaviour.
Install
uv add saas-retriever
# or, as a CLI:
pipx install saas-retriever
Usage
CLI
# Org-wide GitHub scan (default = code + issues + PRs across every repo)
GITHUB_TOKEN=ghp_... saas-retriever fetch github --owner plenoai
# Single repo, only issues
saas-retriever fetch github --owner plenoai --repo saas-retriever \
--resource issues
# Filter to recently-updated content
saas-retriever fetch github --owner plenoai --since 7d
fetch streams Documents as NDJSON to stdout (or --out FILE). One
line per Document: ref, text or binary_b64, fetched_at,
content_hash, created_by, extra.
Programmatic
import asyncio
from saas_retriever import registry
async def main() -> None:
gh = registry.create(
"github",
owner="plenoai",
resources={"code", "issues", "prs"},
)
try:
async for doc in gh.discover_and_fetch():
kind = doc.ref.metadata.get("resource_type")
print(kind, doc.ref.path, len(doc.text or ""))
finally:
await gh.close()
asyncio.run(main())
Auth
Token resolution order:
token=constructor argument (--tokenon the CLI)GITHUB_TOKENenvironment variablegh auth tokenif the GitHub CLI is on PATH
Anonymous (token-less) requests work for public content but are
rate-limited to 60/h — fine for a smoke test, not enough for an
org-wide scan. Use a fine-grained PAT with metadata:read +
contents:read + issues:read + pull_requests:read for the minimum
viable scope.
Connectors
| Connector | Status | What it covers |
|---|---|---|
| github | implemented (v0.1) | Org-wide repo enumeration + per-repo code (recursive tree), issues (title + body + comments), pull requests (title + body + comments + review comments + diff). Default: all three resources. |
Slack, Jira, Confluence, Notion, GitLab, Bitbucket land in subsequent
releases as standalone API-based connectors. The Document /
DocumentRef / Connector protocol is stable and downstream consumers
won't need to change.
Rate-limit handling
The connector reads X-RateLimit-Remaining / X-RateLimit-Reset and
sleeps until the bucket resets on 403 secondary rate-limit
responses. Hard 429s honour Retry-After. 5xx errors retry with
exponential backoff (3 attempts).
Development
uv sync --all-extras
uv run ruff check
uv run mypy src
uv run pytest
The default pytest pass uses httpx.MockTransport for every HTTP
call — no live API access in CI. A live smoke test against a real
public org runs as a manual step before release.
Release
vX.Y.Z tag pushes trigger PyPI trusted publishing via GitHub Actions
— no manual token. The first publish requires a one-time Trusted
Publisher configuration at https://pypi.org/manage/account/publishing/:
| Field | Value |
|---|---|
| PyPI Project Name | saas-retriever |
| Owner | plenoai |
| Repository name | saas-retriever |
| Workflow name | release.yml |
| Environment name | pypi |
License
AGPL-3.0-or-later.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file saas_retriever-0.1.0.tar.gz.
File metadata
- Download URL: saas_retriever-0.1.0.tar.gz
- Upload date:
- Size: 26.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80d60b59f5cb058bfd3fa1d4ec478662455ab3d028d03e451e06dd983e9731fd
|
|
| MD5 |
fa04ceee32168295d744aacb98836545
|
|
| BLAKE2b-256 |
9f7d349d97aa25a34589ad479b3a08f6a1cb7527d06dbcf542713e182e738563
|
Provenance
The following attestation bundles were made for saas_retriever-0.1.0.tar.gz:
Publisher:
release.yml on plenoai/saas-retriever
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
saas_retriever-0.1.0.tar.gz -
Subject digest:
80d60b59f5cb058bfd3fa1d4ec478662455ab3d028d03e451e06dd983e9731fd - Sigstore transparency entry: 1449993520
- Sigstore integration time:
-
Permalink:
plenoai/saas-retriever@a8d3f80211b1a528393eb826550273c001864eec -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/plenoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a8d3f80211b1a528393eb826550273c001864eec -
Trigger Event:
push
-
Statement type:
File details
Details for the file saas_retriever-0.1.0-py3-none-any.whl.
File metadata
- Download URL: saas_retriever-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f8add4a0a934025ceff81bcd8fb8ea6d6cb994340b28078ec0349b52c3569fa
|
|
| MD5 |
e9cc3074784d94cfa75daa4ae9710605
|
|
| BLAKE2b-256 |
33c0601ed5858fa0ab474926e8400d3321b6b3dd81f8fe4b3b555f61f17c4d94
|
Provenance
The following attestation bundles were made for saas_retriever-0.1.0-py3-none-any.whl:
Publisher:
release.yml on plenoai/saas-retriever
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
saas_retriever-0.1.0-py3-none-any.whl -
Subject digest:
6f8add4a0a934025ceff81bcd8fb8ea6d6cb994340b28078ec0349b52c3569fa - Sigstore transparency entry: 1449993563
- Sigstore integration time:
-
Permalink:
plenoai/saas-retriever@a8d3f80211b1a528393eb826550273c001864eec -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/plenoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a8d3f80211b1a528393eb826550273c001864eec -
Trigger Event:
push
-
Statement type: