Skip to main content

Playwright-shape Python client for the paprika browser fleet

Project description

paprika-client

Playwright-shape async Python client for the paprika browser fleet.

import asyncio
from paprika_client import async_paprika

async def main():
    async with async_paprika.connect() as cli:
        async with cli.session(initial_url="https://news.ycombinator.com") as page:
            await page.locator(".athing .titleline > a").click()
            state = await page.state()
            print(state["url"], "->", state["title"])
            await page.back()
            await page.screenshot(path="hn.png")

asyncio.run(main())

What is this?

paprika is a hub-and-spoke browser fleet: a hub orchestrates many workers, each of which runs N pre-warmed Chrome lanes. This client opens a Session against the hub — the hub reserves a Lane on some Worker, attaches CDP, and the client drives the Chrome action-by-action over HTTP.

The action surface mirrors Playwright's so existing browser-automation intuition transfers. You don't get a local Chrome — you get a shared fleet, viewable over noVNC for free.

Install

From the paprika source tree:

pip install -e ./client/python

Or pull just this directory into your own project. The only runtime dependency is httpx.

API

Connect

cli = async_paprika.connect()                    # PAPRIKA_HUB env -> localhost:8000
async with cli:                                  # opens an httpx.AsyncClient
    ...

# Or pass an explicit URL:
cli = async_paprika.connect("http://hub.lan")

Open a Session

A session reserves one Lane (one Chrome process) for the lifetime of the async with block.

async with cli.session(initial_url="https://example.com") as page:
    ...

Or manually:

page = await cli.open_session(initial_url="https://example.com")
try:
    ...
finally:
    await page.close()

Actions

Playwright paprika
page.goto(url) page.goto(url)
page.click(selector) page.click(selector)
page.fill(selector, value) page.fill(selector, value)
page.press(key) page.press(key)
page.go_back() page.back() (alias go_back)
page.screenshot(path=...) page.screenshot(path=...)
page.title() page.title()
page.url page.url (sync, cached)
page.locator(sel) page.locator(sel)
page.get_by_role(role) page.get_by_role(role)
page.get_by_text(text) page.get_by_text(text)

paprika-only:

await page.outline()       # text view with [@N] ids
await page.visited_urls()  # list of canonical URLs the session opened
await page.capture("snap") # persist HTML + PNG + outline server-side
await page.assets()        # URLs of images captured on this page (see below)
await page.save_assets("out/")  # download those captured images to disk
await page.cookies()       # current cookie jar (CDP-shaped), host-filtered
await page.save_cookies_to_host()  # promote those cookies into the Host registry
await page.network()       # media network log (image/audio/video responses)
await page.close_popups()  # close all non-default tabs (after a popup-spawning click)
page.novnc_url             # live noVNC URL for this session's lane

Captured assets (page.assets / page.save_assets)

paprika passively records every resource the page loads (images, video, …) — the same machinery Fetch mode uses. page.assets() flushes the worker's capture buffer and lists what was captured; it's the one-call equivalent of wiring up Playwright's page.on("response") yourself.

async with cli.session(
    "https://example.com/article",
    parent_job_id="my-crawl",        # assets need a job dir to land in
) as page:
    await page.scroll()              # trigger lazy-loaded images
    srcs = await page.assets()       # -> ["http://hub/jobs/.../img_001.jpg", ...]
    rows = await page.assets(details=True)  # -> list of dicts w/ size, source_url, mime
    await page.save_assets("out/images")    # download them to disk
arg default meaning
kind "image" image / video / audio / other, or None for all
absolute True absolute URLs (False -> hub-relative href)
refresh True flush newly-captured assets off the worker first
details False return full metadata dicts instead of URL strings

Job-bound session required. Like get_state / set_state, the passive capture needs a parent job to store assets under. Scripts run by paprika-runner are bound automatically (PAPRIKA_JOB_ID); a raw cli.session(...) must pass parent_job_id=..., else page.assets() raises PaprikaActionError.

Multi-tab gotcha. await sess[-1].close() is NOT how you close a popup tab. Popups spawned by worker-side clicks are not in the SDK's local _pages cache, so sess[-1] resolves to sess itself, and Session.close() is unconditional — it kills the whole session. Use await sess.close_popups() instead.

Jobs & assets (non-session API)

The session API drives a live browser. The job API is the other half: submit a one-shot fetch / codegen job, poll it, and read its captured assets after the fact. These are methods on the client, not the page:

async with async_paprika.connect() as cli:
    # fetch() = submit a fetch-mode job + wait for it to finish
    job = await cli.fetch("https://example.com/article", scroll=True)
    print(job["status"])                       # "completed"

    # collect the captured images (assets.json, kind=image)
    imgs = await cli.job_images(job["job_id"])  # -> [url, url, ...]
    rows = await cli.job_assets(job["job_id"], details=True)  # full metadata
    await cli.download_job_assets(job["job_id"], "out/images")
method endpoint purpose
cli.create_job(url, **opts) POST /jobs submit (returns immediately)
cli.fetch(url, wait=True, **opts) POST /jobs (+poll) submit fetch + wait
cli.get_job(id) / cli.list_jobs() GET /jobs[/{id}] status / listing
cli.wait_job(id) poll GET /jobs/{id} block until terminal
cli.job_result(id) GET /jobs/{id}/result final JobResult
cli.cancel_job(id) / cli.delete_job(id) POST cancel / DELETE lifecycle
cli.job_assets(id, kind=, details=) GET /jobs/{id}/assets.json captured assets
cli.job_images(id) (assets, kind="image") shorthand
cli.download_job_assets(id, dir) GET /jobs/{id}/assets/* save to disk

**opts flow into JobOptions (mode=, scroll=, scroll_max=, use_profile=, cookies_from=, goal= for codegen / vision modes, …).

Anything not wrapped here (hosts / profiles / engines / settings / …) is still reachable via await cli._json("GET", "/hosts") etc. — the same thin HTTP helper every wrapper uses.

Keep the session alive after the script exits

Call page.keepalive() (alias detach) before leaving the async with block. The hub keeps the lane held and the browser open so a human can take over via noVNC; the session auto-closes after idle_ttl_s seconds of no operator activity (mouse / key / clipboard through the noVNC viewer).

async with cli.session(initial_url="https://example.com") as page:
    await page.get_by_text("Login").click()
    await page.fill("input[name=user]", "alice")
    await page.keepalive(idle_ttl_s=120)   # default: 120s
    # leaving the `with` block here no longer kills the session.

The hub's screenshot grid shows three states:

  • RUNNING (red) — a script action or noVNC interaction is in flight
  • KEEPALIVE (orange) — alive but nobody is touching it
  • IDLE — closed and reaped (lane freed)

Fetch jobs submitted with options.keep_session=true (server-side crawl + human handoff) use a 60 s default idle TTL; SDK keepalive() defaults to 120 s and can be set per-call.

Locators

page.locator(selector) returns a Locator. Like Playwright, it's lazy — the selector is resolved each time you call .click() etc:

btn = page.locator("button.primary")
await btn.click()

get_by_text("…") walks the current page outline to find the first interactive element with that visible label, then clicks the matching [data-paprika-id="N"]:

await page.get_by_text("Login").click()

Errors

Exception When
PaprikaError HTTP-level error (404, 5xx, network)
PaprikaActionError The hub returned 200 but the action returned NO_MATCH or ERR: ...

PaprikaActionError.status carries the raw string from browser_ops.

Now implemented (built on page.evaluate)

page.evaluate(js) landed and the DOM surface is built on top of it (see REFERENCE.ja.md §7): wait_for_selector, text_content / inner_text / get_attribute / input_value / count / is_visible / is_checked / …, the JS-dispatched inputs (hover / dblclick / select_option / check / uncheck / focus), set_input_files (CDP), cookies() (read), and the locator chain (first / last / nth / all / count) plus get_by_test_id / get_by_placeholder / get_by_title / get_by_alt_text.

Still deferred (V1 → V2)

Feature Why deferred
page.wait_for_url, wait_for_load_state Navigation-event hooks not wired
locator.bounding_box(), element screenshot() Need a geometry/CDP path
page.context.add_cookies() Use the Host registry + use_profile / cookies_from
iframe / frame_locator / multiple BrowserContext Worker is single-frame, 1 session = 1 context
Real (trusted) input events / route() interception Synthetic events + network() polling are the V1 stand-ins
Sync API (sync_paprika) Async-first; sync wrapper can come later

⚠️ Note: page.evaluate runs arbitrary JS in the browser. paprika is LAN-trusted (same model as cookie injection / profile upload), so it's exposed without the RFC-001 §12 auth gate.

See also

  • API.ja.md — 全公開関数の API リファレンス(関数ごとの引数・戻り値・例)
  • REFERENCE.ja.md — 画像・動画取得の実践リファレンス (単発ページ取得 / Recent jobs からの取得 / 動画 / ログイン必須サイト)
  • RFC-001 for the protocol design
  • The paprika hub admin UI at http://hub:8000/ for live VNC viewers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paprika_client-0.1.0.tar.gz (71.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paprika_client-0.1.0-py3-none-any.whl (71.6 kB view details)

Uploaded Python 3

File details

Details for the file paprika_client-0.1.0.tar.gz.

File metadata

  • Download URL: paprika_client-0.1.0.tar.gz
  • Upload date:
  • Size: 71.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for paprika_client-0.1.0.tar.gz
Algorithm Hash digest
SHA256 76eb251c1c2c7e487a2c182b4eee83faa55c79741f7acdf1d32d46368c15ccba
MD5 36bff5d3e2d7c5faecd3f638a800d0b5
BLAKE2b-256 5a7dfdbf706ca821526b6acd4caca0d0151e36b58b1b1c1f8d95e916be35f65d

See more details on using hashes here.

File details

Details for the file paprika_client-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: paprika_client-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 71.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for paprika_client-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a00daeb0185d73c2d3ef70a5bfb1f7045b69df75d74f9a1c886df45e6ec403f8
MD5 90941c520914a0a43b57ec265ed73a97
BLAKE2b-256 80d3ebf93acb06748defd7b44bcdb9152350bbe8146f5d330498732dde5b704a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page