API-first SaaS content retriever — uniform Document stream from GitHub, GitLab, Bitbucket, Notion, Confluence, Jira, and Slack via official REST APIs. No scraping, no Playwright.
Project description
saas-retriever
API-first SaaS content retriever. Yields a uniform Document stream
from seven SaaS providers via their official REST APIs. Downstream
pipelines (pleno-anonymize,
pleno-dlp) consume the same
Document shape regardless of which provider produced it.
Install
uv add saas-retriever
# or, as a CLI:
pipx install saas-retriever
Connectors
| kind | targets | resources |
|---|---|---|
| github | org or single repo | code, issues, pull requests (title + body + comments + diff) |
| gitlab | group (recursive) or single project | code, issues, merge requests (with per-file diff) |
| bitbucket | Cloud workspace or Server project, optionally pinned to a repo | code, pull requests, issues (Cloud only) |
| notion | search / explicit pages / database query (combinable) | page tree + database row properties → Markdown |
| confluence | Cloud or Data Center; spaces enumerated then pages | page body (storage XHTML → text) + comments + attachment refs |
| jira | Cloud (/rest/api/3 + ADF) or Data Center (/rest/api/2 + storage XHTML) |
issues + comments + attachment URLs |
| slack | xoxb (bot) or xoxp (user) tokens | channels → history → threads → optional file refs |
Usage
CLI
# Org-wide GitHub scan (default = code + issues + PRs across every repo)
GITHUB_TOKEN=ghp_... saas-retriever fetch github --owner plenoai
# Single repo, only issues
saas-retriever fetch github --owner plenoai --repo saas-retriever \
--resource issues
# Filter to recently-updated content
saas-retriever fetch github --owner plenoai --since 7d
fetch streams Documents as NDJSON to stdout (or --out FILE). One
line per Document: ref, text or binary_b64, fetched_at,
content_hash, created_by, extra.
Programmatic
import asyncio
from saas_retriever import registry
async def main() -> None:
gh = registry.create(
"github",
owner="plenoai",
resources={"code", "issues", "prs"},
)
try:
async for doc in gh.discover_and_fetch():
kind = doc.ref.metadata.get("resource_type")
print(kind, doc.ref.path, len(doc.text or ""))
finally:
await gh.close()
asyncio.run(main())
Every connector exposes the same Connector protocol — swap "github"
for "gitlab", "slack", etc. and the loop above keeps working.
Auth
Each connector accepts either a typed Credential or the discrete
constructor kwargs (token=, username=, email=, api_token=, …).
Credential payload keys are auto-redacted in repr/str.
| connector | accepted credential shapes |
|---|---|
| github | token= (PAT). CLI also resolves GITHUB_TOKEN env var or gh auth token. |
| gitlab | token= + auth= ∈ {pat, project, oauth}. Bearer for OAuth, PRIVATE-TOKEN otherwise. |
| bitbucket | Cloud: token= (Bearer) or username=/app_password= (Basic). Server: token= or username=/password=. |
| notion | token= (Bearer integration token). |
| confluence | Cloud: token= (Bearer) or email=/api_token= (Basic). DC: token= (Bearer PAT) or username=/password=. |
| jira | access_token= (Bearer); Cloud: email=/api_token=; DC: username=/password=. |
| slack | token= (xoxb-… or xoxp-…). |
Cursors and incremental scans
Connectors that advertise Capabilities.incremental round-trip an
opaque resume token through discover(filter, cursor=...):
- gitlab / github — server-side filters where available.
- confluence / jira — JSON cursor anchored on
version.when/updated. Stale or malformed cursors fall back to a full re-walk. - slack — JSON
{channel_id: latest_ts}per channel, fed back into Slack'soldest=parameter. - notion — search cursor round-tripped on every emitted ref via
metadata["_cursor"].
Persist cursor_after_run() (when the connector exposes it) and pass
the same string back on the next scan to resume.
Rate limiting
saas_retriever.AdaptiveTokenBucket + GlobalRateLimiter provide an
AIMD bucket per BucketKey(connector_kind, tenant_id). Connectors
raise RateLimited on persistent throttle (429 on most providers,
plus 503 on Atlassian Data Center where their reverse proxy emits
overload signals over 429 by policy). Callers can shrink the
effective rate via on_throttle_signal(factor=0.5) and grow it back
with on_success(recovery=...).
Development
uv sync --all-extras
uv run ruff check
uv run mypy src
uv run pytest
The default pytest pass uses httpx.MockTransport for every HTTP
call — no live API access in CI.
Release
vX.Y.Z tag pushes trigger PyPI trusted publishing via GitHub Actions
— no manual token. The first publish requires a one-time Trusted
Publisher configuration at https://pypi.org/manage/account/publishing/:
| Field | Value |
|---|---|
| PyPI Project Name | saas-retriever |
| Owner | plenoai |
| Repository name | saas-retriever |
| Workflow name | release.yml |
| Environment name | pypi |
License
AGPL-3.0-or-later.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file saas_retriever-1.0.0.tar.gz.
File metadata
- Download URL: saas_retriever-1.0.0.tar.gz
- Upload date:
- Size: 74.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a6581360e33dd48ec8d2f264648afadfdc283ab377629588baac189451b95e3
|
|
| MD5 |
bad3eb9f989c30b5a3787c63307dd7e9
|
|
| BLAKE2b-256 |
dfb434e08a1d926daa249793a7601866cae49e2d0e9d6194237f1ffb4e5ec608
|
Provenance
The following attestation bundles were made for saas_retriever-1.0.0.tar.gz:
Publisher:
release.yml on plenoai/saas-retriever
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
saas_retriever-1.0.0.tar.gz -
Subject digest:
7a6581360e33dd48ec8d2f264648afadfdc283ab377629588baac189451b95e3 - Sigstore transparency entry: 1451217805
- Sigstore integration time:
-
Permalink:
plenoai/saas-retriever@1a5ed8ce7698366717efb4d82bd5134bdd2c6c34 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/plenoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1a5ed8ce7698366717efb4d82bd5134bdd2c6c34 -
Trigger Event:
push
-
Statement type:
File details
Details for the file saas_retriever-1.0.0-py3-none-any.whl.
File metadata
- Download URL: saas_retriever-1.0.0-py3-none-any.whl
- Upload date:
- Size: 89.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8542a990b354858446d0df29943a6976a7cc58ecd645dfb956e778a5d7f3ade
|
|
| MD5 |
af5dd22dc4e4d3cee1229cac7a739d60
|
|
| BLAKE2b-256 |
63636353e40ee3a8eff6ec43741597dca9598c86b249d4c0b7a9c22f330d7c61
|
Provenance
The following attestation bundles were made for saas_retriever-1.0.0-py3-none-any.whl:
Publisher:
release.yml on plenoai/saas-retriever
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
saas_retriever-1.0.0-py3-none-any.whl -
Subject digest:
a8542a990b354858446d0df29943a6976a7cc58ecd645dfb956e778a5d7f3ade - Sigstore transparency entry: 1451217906
- Sigstore integration time:
-
Permalink:
plenoai/saas-retriever@1a5ed8ce7698366717efb4d82bd5134bdd2c6c34 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/plenoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1a5ed8ce7698366717efb4d82bd5134bdd2c6c34 -
Trigger Event:
push
-
Statement type: