Chrome-driven SaaS content scraper — yields a uniform Document stream for downstream pipelines (pleno-anonymize, pleno-secret-scanner).
Project description
saas-scraper
Chrome-driven SaaS content scraper. Yields a uniform Document stream for
downstream pipelines (e.g. pleno-anonymize,
pleno-secret-scanner).
Where API-based connectors stop — locked-down workspaces, SSO-only sessions,
content only visible in the UI — saas-scraper keeps going by driving a real
Chrome session via Playwright. Reuses your existing browser profile so login,
MFA and SSO flows are inherited rather than re-implemented per provider.
Install
uv add saas-scraper
# one-time browser binary install
uv run playwright install chromium
Or as a CLI:
pipx install saas-scraper
playwright install chromium
Usage
# List available connectors
saas-scraper list
# Scrape a Slack workspace and stream Documents to stdout (NDJSON)
saas-scraper fetch slack --workspace acme --since 7d
# Save to a file for downstream consumption
saas-scraper fetch notion --workspace acme > docs.ndjson
Programmatic use:
import asyncio
from saas_scraper import BrowserSession, registry
async def main() -> None:
async with BrowserSession() as session:
connector = registry.create("slack", session=session, workspace="acme")
async for doc in connector.discover_and_fetch():
print(doc.ref.path, len(doc.text or b""))
asyncio.run(main())
Connectors
| Connector | Status | Notes |
|---|---|---|
| slack | implemented (v0.2) | channel sidebar walk, message pane scrape |
| github | implemented (v0.5) | code (file tree) + issues + PRs (title/body/comments/diff). Pass resources={"code","issues","prs"} |
| gitlab | implemented (v0.3) | gitlab.com or self-hosted via base_url |
| bitbucket | implemented (v0.3) | bitbucket.org file walk |
| jira | implemented (v0.3) | Atlassian Cloud issue list + body |
| confluence | implemented (v0.3) | Atlassian Cloud space page-tree |
| notion | implemented (v0.3) | sidebar page enumeration + body |
All connectors share a single BrowserSession so cookies and SSO state
inherit across providers. Virtualised lists (Slack sidebar, Notion
sidebar) only see the currently-visible portion in v0.3 — scroll-walking
landed in v0.4. GitHub issue / PR scrape landed in v0.5.
GitHub multi-resource example
async with BrowserSession() as session:
gh = registry.create(
"github",
session=session,
owner="plenoai",
repo="saas-scraper",
resources={"code", "issues", "prs"},
)
async for doc in gh.discover_and_fetch():
kind = doc.ref.metadata.get("resource_type")
print(kind, doc.ref.path, len(doc.text or ""))
metadata["resource_type"] is one of code, issue, pr. Issue and
PR documents concatenate title + body + every visible comment (PRs also
include the inline diff hunks) into a single Document.text so the
downstream secret/PII scanners run unchanged.
The v0.1.0 release ships the Document protocol, the Chrome session manager,
and a working scaffold per connector. Additional providers and per-connector
hardening land in subsequent releases — see issues.
Why Chrome and not the API?
- Inherits SSO / MFA / SCIM-locked sessions that don't cleanly expose API tokens to a scanner role.
- Bypasses API quota tiers that throttle org-wide content enumeration.
- Reaches UI-only surfaces (Notion comments, Slack canvas, Jira views).
When an official API exists and is sufficient, prefer that — saas-scraper
is the fallback for the cases where it isn't.
Development
uv sync --all-extras
uv run playwright install chromium
uv run pytest
uv run ruff check
uv run mypy src
The default pytest pass exercises plumbing only (Document protocol,
registry wiring, CLI helpers). Live browser scrapes against real SaaS
providers are not part of CI; run them locally with
saas-scraper fetch <connector> --headed so a real Chromium window
opens for first-time SSO.
Release
vX.Y.Z tag pushes trigger PyPI trusted publishing via GitHub Actions —
no manual token. The first publish requires a one-time Trusted Publisher
configuration at https://pypi.org/manage/account/publishing/:
| Field | Value |
|---|---|
| PyPI Project Name | saas-scraper |
| Owner | plenoai |
| Repository name | saas-scraper |
| Workflow name | release.yml |
| Environment name | pypi |
After that, every tag matching v* will publish automatically.
License
AGPL-3.0-or-later.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file saas_scraper-0.5.0.tar.gz.
File metadata
- Download URL: saas_scraper-0.5.0.tar.gz
- Upload date:
- Size: 34.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2533589d7e35214eadb8e0f2d66eb3d7ba3a7f5e2bb63f4a793c15521a32ea9e
|
|
| MD5 |
856cb47d37e017d521a83e7796c77718
|
|
| BLAKE2b-256 |
27382f2db8e5c3c208a486e3bd1111b4af9ace51311cc212efc5d90046b35e69
|
Provenance
The following attestation bundles were made for saas_scraper-0.5.0.tar.gz:
Publisher:
release.yml on plenoai/saas-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
saas_scraper-0.5.0.tar.gz -
Subject digest:
2533589d7e35214eadb8e0f2d66eb3d7ba3a7f5e2bb63f4a793c15521a32ea9e - Sigstore transparency entry: 1449708373
- Sigstore integration time:
-
Permalink:
plenoai/saas-scraper@eb5ec072d6380cf22cf18cf387de53fd72747888 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/plenoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@eb5ec072d6380cf22cf18cf387de53fd72747888 -
Trigger Event:
push
-
Statement type:
File details
Details for the file saas_scraper-0.5.0-py3-none-any.whl.
File metadata
- Download URL: saas_scraper-0.5.0-py3-none-any.whl
- Upload date:
- Size: 46.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad917c10f4e5f08c84b91569ce819ac8ecc35a3d91c486af3a735c07590883a2
|
|
| MD5 |
609f2f575f24fe7f8e97bacb6d55a411
|
|
| BLAKE2b-256 |
74797701729a4a964c829021ce916ca01f8220b79ca982a0b1d5a707e779a2b6
|
Provenance
The following attestation bundles were made for saas_scraper-0.5.0-py3-none-any.whl:
Publisher:
release.yml on plenoai/saas-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
saas_scraper-0.5.0-py3-none-any.whl -
Subject digest:
ad917c10f4e5f08c84b91569ce819ac8ecc35a3d91c486af3a735c07590883a2 - Sigstore transparency entry: 1449708378
- Sigstore integration time:
-
Permalink:
plenoai/saas-scraper@eb5ec072d6380cf22cf18cf387de53fd72747888 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/plenoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@eb5ec072d6380cf22cf18cf387de53fd72747888 -
Trigger Event:
push
-
Statement type: