Skip to main content

API-first SaaS content retriever — uniform Document stream from GitHub, GitLab, Bitbucket, Notion, Confluence, Jira, and Slack via official REST APIs. No scraping, no Playwright.

Project description

saas-retriever

API-first SaaS content retriever. Yields a uniform Document stream from seven SaaS providers via their official REST APIs. Downstream pipelines (pleno-anonymize, pleno-dlp) consume the same Document shape regardless of which provider produced it.

Install

uv add saas-retriever
# or, as a CLI:
pipx install saas-retriever

Connectors

kind targets resources
github org or single repo code, issues, pull requests (title + body + comments + diff)
gitlab group (recursive) or single project code, issues, merge requests (with per-file diff)
bitbucket Cloud workspace or Server project, optionally pinned to a repo code, pull requests, issues (Cloud only)
notion search / explicit pages / database query (combinable) page tree + database row properties → Markdown
confluence Cloud or Data Center; spaces enumerated then pages page body (storage XHTML → text) + comments + attachment refs
jira Cloud (/rest/api/3 + ADF) or Data Center (/rest/api/2 + storage XHTML) issues + comments + attachment URLs
slack xoxb (bot) or xoxp (user) tokens channels → history → threads → optional file refs

Usage

CLI

# Org-wide GitHub scan (default = code + issues + PRs across every repo)
GITHUB_TOKEN=ghp_... saas-retriever fetch github --owner plenoai

# Single repo, only issues
saas-retriever fetch github --owner plenoai --repo saas-retriever \
    --resource issues

# Filter to recently-updated content
saas-retriever fetch github --owner plenoai --since 7d

fetch streams Documents as NDJSON to stdout (or --out FILE). One line per Document: ref, text or binary_b64, fetched_at, content_hash, created_by, extra.

Programmatic

import asyncio
from saas_retriever import registry

async def main() -> None:
    gh = registry.create(
        "github",
        owner="plenoai",
        resources={"code", "issues", "prs"},
    )
    try:
        async for doc in gh.discover_and_fetch():
            kind = doc.ref.metadata.get("resource_type")
            print(kind, doc.ref.path, len(doc.text or ""))
    finally:
        await gh.close()

asyncio.run(main())

Every connector exposes the same Connector protocol — swap "github" for "gitlab", "slack", etc. and the loop above keeps working.

Auth

Each connector accepts either a typed Credential or the discrete constructor kwargs (token=, username=, email=, api_token=, …). Credential payload keys are auto-redacted in repr/str.

connector accepted credential shapes
github token= (PAT). CLI also resolves GITHUB_TOKEN env var or gh auth token.
gitlab token= + auth= ∈ {pat, project, oauth}. Bearer for OAuth, PRIVATE-TOKEN otherwise.
bitbucket Cloud: token= (Bearer) or username=/app_password= (Basic). Server: token= or username=/password=.
notion token= (Bearer integration token).
confluence Cloud: token= (Bearer) or email=/api_token= (Basic). DC: token= (Bearer PAT) or username=/password=.
jira access_token= (Bearer); Cloud: email=/api_token=; DC: username=/password=.
slack token= (xoxb-… or xoxp-…).

Cursors and incremental scans

Connectors that advertise Capabilities.incremental round-trip an opaque resume token through discover(filter, cursor=...):

  • gitlab / github — server-side filters where available.
  • confluence / jira — JSON cursor anchored on version.when / updated. Stale or malformed cursors fall back to a full re-walk.
  • slack — JSON {channel_id: latest_ts} per channel, fed back into Slack's oldest= parameter.
  • notion — search cursor round-tripped on every emitted ref via metadata["_cursor"].

Persist cursor_after_run() (when the connector exposes it) and pass the same string back on the next scan to resume.

Rate limiting

saas_retriever.AdaptiveTokenBucket + GlobalRateLimiter provide an AIMD bucket per BucketKey(connector_kind, tenant_id). Connectors raise RateLimited on persistent throttle (429 on most providers, plus 503 on Atlassian Data Center where their reverse proxy emits overload signals over 429 by policy). Callers can shrink the effective rate via on_throttle_signal(factor=0.5) and grow it back with on_success(recovery=...).

Development

uv sync --all-extras
uv run ruff check
uv run mypy src
uv run pytest

The default pytest pass uses httpx.MockTransport for every HTTP call — no live API access in CI.

Release

vX.Y.Z tag pushes trigger PyPI trusted publishing via GitHub Actions — no manual token. The first publish requires a one-time Trusted Publisher configuration at https://pypi.org/manage/account/publishing/:

Field Value
PyPI Project Name saas-retriever
Owner plenoai
Repository name saas-retriever
Workflow name release.yml
Environment name pypi

License

AGPL-3.0-or-later.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saas_retriever-1.0.0.tar.gz (74.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saas_retriever-1.0.0-py3-none-any.whl (89.7 kB view details)

Uploaded Python 3

File details

Details for the file saas_retriever-1.0.0.tar.gz.

File metadata

  • Download URL: saas_retriever-1.0.0.tar.gz
  • Upload date:
  • Size: 74.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for saas_retriever-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7a6581360e33dd48ec8d2f264648afadfdc283ab377629588baac189451b95e3
MD5 bad3eb9f989c30b5a3787c63307dd7e9
BLAKE2b-256 dfb434e08a1d926daa249793a7601866cae49e2d0e9d6194237f1ffb4e5ec608

See more details on using hashes here.

Provenance

The following attestation bundles were made for saas_retriever-1.0.0.tar.gz:

Publisher: release.yml on plenoai/saas-retriever

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file saas_retriever-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: saas_retriever-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 89.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for saas_retriever-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a8542a990b354858446d0df29943a6976a7cc58ecd645dfb956e778a5d7f3ade
MD5 af5dd22dc4e4d3cee1229cac7a739d60
BLAKE2b-256 63636353e40ee3a8eff6ec43741597dca9598c86b249d4c0b7a9c22f330d7c61

See more details on using hashes here.

Provenance

The following attestation bundles were made for saas_retriever-1.0.0-py3-none-any.whl:

Publisher: release.yml on plenoai/saas-retriever

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page