Skip to main content

LionScraper bridge daemon, thin MCP stdio, and CLI — local HTTP + WebSocket to Chrome extension (Python)

Project description

LionScraper MCP + CLI + API service (Python)

简体中文

What is this?

LionScraper is a browser extension that can collect lists, articles, links, images, and more from web pages. This PyPI package provides the companion bridge to that extension in three ways:

  • MCP (lionscraper-mcp): connect your AI app (e.g. Cursor) so the model can call scraping tools in chat.
  • CLI (lionscraper): run daemon, scrape, and ping from a terminal over the same local HTTP/WebSocket port as the extension.
  • HTTP API: call the same tools over loopback JSON HTTP (/v1/...) when the daemon is running—useful for scripts, services, or any HTTP client without MCP or the CLI front-end.

The real scraping logic runs in the extension; this package connects and forwards.

Before you start

  1. Browser: Chrome or Edge (follow what the extension actually supports).
  2. LionScraper extension: Install and enable it from your browser’s store (the listing title may vary by storefront).
  3. Python: 3.10 or newer on your machine. If you do not have it yet, download an installer from the Python website and follow the prompts (check the option to add Python to PATH on Windows if the installer offers it).
  4. For MCP: an AI app that supports MCP (e.g. Cursor, Trae).
  5. For the HTTP API: same browser, extension, and daemon as the CLI; use curl, fetch, or any HTTP client against http://127.0.0.1:$PORT (see HTTP API (local REST) below).

This package uses aiohttp for outbound HTTP and WebSocket probes to the daemon. When Chrome/Edge are not detected and the extension is not connected, ping / scrape* can use a server-side HTTP fallback (no in-page JS).

Install (pip)

This package is published on PyPI as lionscraper.

pip install -U lionscraper

You may use a virtual environment (recommended) or pip install -U --user lionscraper if you prefer not to install into the system interpreter.

You get two commands; together they support three integration styles (MCP, CLI, HTTP API):

Command Role
lionscraper-mcp Thin MCP server (stdio) for AI apps
lionscraper CLI: daemon, stop, scrape, ping, … (also runs the process that serves the HTTP API)

If lionscraper-mcp is not on your PATH after install, use the full path to the script (e.g. under your venv’s Scripts/ or bin/), or the python -m lionscraper form below in MCP (still requires a normal pip install -U lionscraper).


MCP (AI applications)

Add MCP in your AI app

Examples assume lionscraper-mcp is on your PATH (UIs differ). In MCP JSON, every env value is a string.

Minimal config (omit env for built-in defaults; PORT defaults to 13808 and must match the extension bridge port):

{
  "mcpServers": {
    "lionscraper": {
      "command": "lionscraper-mcp"
    }
  }
}

Full env example (drop keys you do not need; empty strings behave like “unset” for most of these):

{
  "mcpServers": {
    "lionscraper": {
      "command": "lionscraper-mcp",
      "env": {
        "PORT": "13808",
        "TIMEOUT": "120000",
        "LANG": "en-US",
        "TOKEN": "",
        "DAEMON": ""
      }
    }
  }
}
  • PORT: HTTP + WebSocket listen port; default 13808; must match the extension bridge port.
  • TIMEOUT: Milliseconds to wait for a previous instance to release the port before forcing takeover; default 120000; 0 means force quickly.
  • LANG: Tool descriptions and stderr log language (en-US, zh-CN, or POSIX forms like en_US.UTF-8).
  • TOKEN: Bearer token shared with the daemon; empty means no Authorization header.
  • DAEMON: Only 0 disables auto-spawning lionscraper daemon from thin MCP; empty or other values match omitting the key (auto-start allowed).

Restart MCP or the app so the config applies.

If lionscraper-mcp is not on PATH

After pip install -U lionscraper, you can run the same routing with python -m lionscraper: no arguments after the module name selects thin MCP over stdio; any extra argument (including --debug) uses the CLI instead. Example:

{
  "mcpServers": {
    "lionscraper": {
      "command": "python",
      "args": ["-m", "lionscraper"]
    }
  }
}

Use the same python executable you used to install the package (or python3 on some systems).

Match the port in the browser extension

  1. Open LionScraper settings / options.
  2. Set bridge port to the same value as PORT (e.g. 13808).
  3. If needed, use Reconnect in the extension, or reload the extension / restart the browser.

Day-to-day use (MCP)

  1. Keep the extension enabled and target pages open as required.
  2. Ask in natural language, e.g. “Check if LionScraper is connected” or “Scrape lists / article / emails / phones / links / images from this page.”
  3. If you see not connected or timeout, retry a connection check and confirm PORT matches.

MCP tools (summary)

The server registers tools that mirror extension capabilities. Names and shapes are what your MCP client shows to the model.

Tool Purpose (short)
ping Check that the extension is connected and registered
scrape Auto-detect structure (lists, tables, …); supports pagination
scrape_article Article body (e.g. Markdown) and metadata
scrape_emails Email addresses on the page
scrape_phones Phone numbers (structured)
scrape_urls Hyperlink URLs
scrape_images Image URLs and basic metadata

Parameters: A full JSON schema for every field belongs in the tool definitions your client displays and in release-accurate docs; duplicating it here goes stale quickly. Use the tool list / descriptions in the AI app when calling MCP.

MCP Resources / Prompts

The thin MCP process (lionscraper-mcp) exposes Resources and Prompts in addition to Tools:

  • Resources: static Markdown at stable URIs, e.g. lionscraper://guide/connection (PORT alignment, ping troubleshooting), lionscraper://guide/when-to-use-tools (prefer LionScraper over WebFetch/curl/wget by scenario), lionscraper://guide/cli (terminal CLI), lionscraper://reference/tools, lionscraper://reference/common-params. Clients list/read them into context; they are served inside the stdio process and do not require the daemon HTTP path (works even if the extension is offline). For the loopback HTTP control plane (/v1/...), see HTTP API (local REST) below (not an MCP resource URI).
  • Prompts: workflow templates (e.g. ping-then-scrape, multi-URL, scrape_article, prefer_lionscraper_scraping, extension troubleshooting). Clients list/get prompts; UI varies by host (Cursor, Trae, …).

Copy follows LANG (e.g. zh-CN), same as tool metadata.


CLI (terminal)

The lionscraper command is the terminal front-end to the same stack as MCP: lionscraper daemon listens on PORT (default 13808) for HTTP (used by the CLI and by the thin lionscraper-mcp process) and WebSocket (used by the extension). Set PORT (and optional TOKEN) to match the extension bridge port and any MCP config. Use the CLI for scripts, CI, or quick one-off runs without opening an AI chat.

The CLI talks to the daemon HTTP API on http://127.0.0.1:$PORT (default port 13808, same as the extension). If you do not pass --api-url, a local daemon is auto-started when possible when you run scrape or ping.

Common commands:

lionscraper --help
lionscraper daemon              # keep running; HTTP + WebSocket on PORT
lionscraper stop                # stop daemon on configured PORT
lionscraper ping
lionscraper scrape -u https://www.example.com
lionscraper scrape --method article -u https://www.example.com
# Shorthand: lionscraper -u https://www.example.com   → same as scrape

--method selects which tool the daemon runs (default scrape): scrape, article, emails, phones, urls, images. Repeat -u / --url to pass several URLs in one run.

Set PORT (and optional TOKEN) in the environment so the CLI matches the extension and MCP. Use --api-url http://127.0.0.1:PORT if the daemon is not on the default base URL. Run lionscraper --help for every flag the binary accepts.

Scrape: parameters and richer examples

Below flags are forwarded to the extension (except --bridge-timeout-ms, which only caps how long the CLI waits on the bridge). Actual behavior still depends on the extension version.

Output and connection

Flag Meaning
--format json or pretty JSON one-line vs. indented (default json)
--raw Print the tool’s text block as returned, without re-formatting
-o / --output <file> Write the result to a file instead of stdout

Timing and load

Flag Meaning
--delay <ms> Wait after page load before scraping (dynamic content)
--timeout-ms <ms> Per-URL timeout on the extension side
--bridge-timeout-ms <ms> Max wait for this CLI → daemon call
--scrape-interval <ms> Delay between starting multiple URLs
--concurrency <n> Concurrency hint for multi-URL runs
--scroll-speed <px> Global scroll speed (extension semantics)
--max-pages <n> Pagination cap for list-style scrape

Lazy-loaded / infinite scroll (maps to waitForScroll in the tool payload)

Flag Meaning
--wait-scroll-speed <px> Pixels per step while scrolling
--wait-scroll-interval <ms> Delay between scroll steps
--wait-max-scroll-height <px> Optional max scroll distance
--scroll-container <selector> Optional scrollable container selector

Locale

Flag Meaning
--lang zh-CN or en-US Language hint for messages / extension UI strings

Extra payload

Flag Meaning
--include-html + true / false Ask for full-page HTML in meta when supported
--include-text + true / false Ask for full-page plain text in meta when supported

Method-specific filters (examples; see --help for the full set)

  • Emails (--method emails): --email-domain, --email-keyword, --email-limit
  • Phones (--method phones): --phone-type, --phone-area-code, --phone-keyword, --phone-limit
  • URLs (--method urls): --url-domain, --url-keyword, --url-pattern, --url-limit
  • Images (--method images): --img-min-width, --img-min-height, --img-format, --img-keyword, --img-limit

Optional browser automation hints (when the extension supports them): --auto-launch-browser, --no-auto-launch-browser, --post-launch-wait-ms.


Example A — List / table scrape with pagination, delays, and scroll assist

lionscraper scrape \
  -u https://www.example.com/items \
  --max-pages 5 \
  --delay 800 \
  --timeout-ms 90000 \
  --bridge-timeout-ms 180000 \
  --wait-scroll-speed 400 \
  --wait-scroll-interval 350 \
  --lang zh-CN \
  --format pretty \
  -o items.json
  • --max-pages: stop after at most five “pages” of list accumulation for this job.
  • --delay: give the page extra time after load before extraction.
  • --timeout-ms / --bridge-timeout-ms: extension vs. CLI-side waits; raise both for slow sites or heavy pages.
  • --wait-scroll-*: gentle scrolling so lazy-loaded rows can appear before scraping.
  • -o / --format pretty: save a human-readable JSON file for inspection.

Example B — Article body with optional HTML snapshot

lionscraper scrape --method article \
  -u https://www.example.com/blog/post-1 \
  --include-html true \
  --timeout-ms 120000 \
  --format json
  • --method article: maps to the scrape_article tool (Markdown-style body + metadata when the extension provides it).
  • --include-html true: requests additional HTML in the result meta where supported (larger payload).

Example C — Emails and URLs with filters, multiple pages

lionscraper scrape --method emails \
  -u https://www.example.com/contact \
  --email-domain example.com \
  --email-keyword support \
  --email-limit 30 \
  --format pretty

lionscraper scrape --method urls \
  -u https://www.example.com \
  -u https://www.example.com/docs \
  --url-domain example.com \
  --url-limit 200 \
  --scrape-interval 500
  • Email flags: narrow to addresses in a domain, matching a keyword, and cap the count.
  • Two -u values: two URLs processed in one invocation; --scrape-interval spaces out task starts.

Example D — Images with size / format filters

lionscraper scrape --method images \
  -u https://www.example.com/gallery \
  --img-min-width 240 \
  --img-min-height 240 \
  --img-format webp \
  --img-limit 40 \
  -o gallery-images.json
  • Filters out small thumbnails and non-WebP assets when the extension honors these fields; --img-limit caps how many entries are returned.

HTTP API (local REST)

Base URL: http://127.0.0.1:$PORT (default 13808). Binds 127.0.0.1 only. The extension must be connected on that port and a daemon must be listening (same process model as the rest of this package).

Method Path Response (success)
GET /v1/health { "ok", "identity", "bridgePort", "sessionCount" }
POST /v1/daemon/shutdown { "ok": true } then the daemon exits
POST /v1/tools/call Tool result JSON (below)

Anything else → 404 { "ok": false, "error": { "code": "NOT_FOUND", "message": "Not found" } }.

Auth: If env TOKEN is set on the daemon, every request needs Authorization: Bearer <TOKEN>; otherwise omit the header.

POST /v1/tools/call

  • Headers: Content-Type: application/json. For streaming, also Accept: application/x-ndjson.
  • Body:
{ "name": "<tool>", "arguments": { }, "progressToken": "<optional>" }
Field Meaning
name One of: ping, scrape, scrape_article, scrape_emails, scrape_phones, scrape_urls, scrape_images
arguments Tool payload; omit or {} if empty
progressToken Any string or number; with Accept: application/x-ndjson, the body is NDJSON (lines type: "progress", then a final type: "result" or type: "error")

arguments (JSON keys)

ping only: optional lang ("en-US" | "zh-CN"), autoLaunchBrowser (boolean), postLaunchWaitMs (number, 3000–60000).

All scrape-family tools (scrape, scrape_article, scrape_emails, scrape_phones, scrape_urls, scrape_images) share:

Key Type Constraint / note
url string | string[] Required — one URL or an array of URLs
lang "en-US" | "zh-CN" Optional
delay number Optional, ≥ 0
timeoutMs number Optional, ≥ 1000
bridgeTimeoutMs number Optional, ≥ 1000
includeHtml boolean Optional
includeText boolean Optional
scrapeInterval number Optional
concurrency number Optional
scrollSpeed number Optional
autoLaunchBrowser boolean Optional
postLaunchWaitMs number Optional, 3000–60000
waitForScroll object Optional; if set, must include scrollSpeed and scrollInterval; may include maxScrollHeight, scrollContainerSelector

Only scrape: optional maxPages (number, ≥ 1).

Optional filter object (only on the matching tool):

name filter properties
scrape_emails domain, keyword, limit (≥ 1)
scrape_phones type, areaCode, keyword, limit (≥ 1)
scrape_urls domain, keyword, pattern, limit (≥ 1)
scrape_images minWidth, minHeight (≥ 0), format, keyword, limit (≥ 1)

Unknown name, bad JSON, or schema violations → 400 { "code": "BAD_REQUEST", ... }. Bad bearer → 401 UNAUTHORIZED. Tool crash → 500 INTERNAL.

200 body: { "content": [ { "type": "text", "text": "..." } ], "isError"?: boolean }.

Examples (PORT=13808)

curl -sS "http://127.0.0.1:13808/v1/health"
curl -sS -X POST "http://127.0.0.1:13808/v1/tools/call" \
  -H "Content-Type: application/json" \
  -d '{"name":"ping","arguments":{}}'
curl -sS -X POST "http://127.0.0.1:13808/v1/tools/call" \
  -H "Content-Type: application/json" \
  -d '{"name":"scrape_article","arguments":{"url":"https://www.example.com","timeoutMs":120000}}'
# With TOKEN on the daemon:
curl -sS -X POST "http://127.0.0.1:13808/v1/tools/call" \
  -H "Content-Type: application/json" -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"name":"ping","arguments":{}}'

FAQ (plain language)

Q: Extension not connected or scraping fails?

  • Extension enabled?
  • PORT in MCP (or env for CLI) exactly matches the extension bridge port?
  • Avoid multiple conflicting MCP/CLI setups on one machine.

Q: Many “tools” visible in the AI app—does that mean the extension is connected?

Not necessarily. Tools only confirm AI → MCP server; the extension must still connect on the same port and register.

Q: CLI says it cannot reach the daemon?

Start lionscraper daemon in another terminal, or fix PORT / --api-url.

Q: I want to drive scraping from my own HTTP client?

Keep the daemon and extension on the same PORT, then POST /v1/tools/call with name and arguments as in HTTP API (local REST). GET /v1/health checks that the listener is LionScraper.

License

MIT (same as the lionscraper package on PyPI).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lionscraper-1.0.6.tar.gz (77.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lionscraper-1.0.6-py3-none-any.whl (79.1 kB view details)

Uploaded Python 3

File details

Details for the file lionscraper-1.0.6.tar.gz.

File metadata

  • Download URL: lionscraper-1.0.6.tar.gz
  • Upload date:
  • Size: 77.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for lionscraper-1.0.6.tar.gz
Algorithm Hash digest
SHA256 827176570ca4c4f2816b058d1d24eb302c0b6262e7011da17f578300bec9864b
MD5 72c77f88bcefbf16b46b78c18631bd0d
BLAKE2b-256 dc5a0fa39c615727570933fa894419d06812f9634cfab3701bcfeab172409766

See more details on using hashes here.

File details

Details for the file lionscraper-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: lionscraper-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 79.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for lionscraper-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 feecb0522a047549af3cb3825fdf1f1526a9a7d5fc33e181e98acd65e5a3e94c
MD5 6f8ebcedfc5872746f6c9768574ac745
BLAKE2b-256 fe1ccfc3cd58666a07c125a2990555435e18f1b0b8391c2f9d6154ece22d5966

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page