Fast, fault-tolerant headless browser scraping at scale.
Project description
yoink
Fast, fault-tolerant headless browser scraping with built-in bot-detection bypass.
- One Chromium browser per worker process, N async page tabs per browser
- Stealth layer: patches
navigator.webdriver,window.chrome, plugins, WebGL, permissions, and more - State-based completion: wait for selectors, network idle, DOM stability, HTTP status, or compose your own
- Actions lifecycle: interact with pages before scraping (scroll, click, fill forms, block trackers)
- Auto-installs Playwright Chromium on first run — no manual setup
pip install python-yoink
Quick start
# Single URL — HTML to stdout
yoink https://example.com
# File of URLs, stream JSONL as each completes
yoink urls.txt --stream
# Save one HTML file per URL
yoink urls.txt --output ./results/
# 4 concurrent pages, save as archive
yoink urls.txt --pages 4 --tarball results.tar.gz
yk is an alias for yoink.
Python API
import yoink
# Single URL
html = yoink.get("https://example.com")
# Wait for a CSS selector before extracting
html = yoink.get("https://shop.example.com", state=yoink.Selector(".product-card"))
# Parallel fetch — results in completion order
results = yoink.get_all(["https://a.com", "https://b.com", "https://c.com"])
for r in results:
print(r.url, r.status, len(r.html))
# Stream results as they land
for result in yoink.stream(urls, state=yoink.Selector(".content")):
if result.ok:
process(result.html)
States — when to stop waiting
States define the condition yoink waits for before extracting HTML. They compose with & (all) and | (any).
from yoink import (
DOMContentLoaded, # initial HTML parsed (default)
NetworkIdle, # no network requests for 500ms
DOMStable, # no DOM mutations for 250ms
Selector, # CSS selector present in DOM
SubstringMatch, # visible text contains string
HTTPStatus, # response status matches
ResponseHeader, # response header matches
MinCount, # at least N elements match selector
URLMatches, # final URL matches pattern
TimeDelay, # unconditional wait
Not, # logical NOT
All, # all states must pass (alias for &)
Any, # any state must pass (alias for |)
)
# Wait for selector AND network to go quiet
state = Selector(".results") & NetworkIdle()
# Accept any 2xx status
state = HTTPStatus(lambda s: 200 <= s < 300)
# At least 12 product cards loaded
state = MinCount(".product-card", 12)
# Compose freely
state = (Selector(".data") | TimeDelay(3000)) & Not(SubstringMatch("loading"))
Actions — interact before scraping
Actions run against the page. pre_actions run before goto(), actions run after navigation but before state evaluation.
from yoink import Request
from yoink.actions import (
Click, Fill, Hover, PressKey, SelectOption,
Scroll, ScrollToBottom,
Wait, WaitForSelector,
EvaluateJS,
RouteBlock, # block URLs matching patterns (ads, trackers)
)
req = Request(
url="https://shop.example.com",
pre_actions=[
RouteBlock("**/ads/**", "**/analytics/**", "**doubleclick**"),
],
actions=[
WaitForSelector(".product-grid"),
ScrollToBottom(step_px=400, delay_ms=80),
Click(".load-more"),
],
state=MinCount(".product-card", 20),
retries=2,
)
results = yoink.get_all([req])
Engine — full control
from yoink import Engine, Config, load_config
from yoink import HTTPStatus, Not, SubstringMatch, DOMStable
cfg = load_config()
cfg.workers.count = 2
cfg.workers.page_limit = 6
with Engine(
config=cfg,
guard=HTTPStatus(200) & Not(SubstringMatch("captcha")), # fail fast on blocked pages
middleware_state=DOMStable(), # applied to every request
) as engine:
for url in urls:
engine.submit(Request(url=url, retries=2, timeout=20))
for result in engine.results():
if result.ok:
save(result.html)
elif result.terminal == "guard_failed":
handle_block(result.url)
elif result.terminal == "timeout":
retry_later(result.url)
Result fields
| Field | Type | Description |
|---|---|---|
ok |
bool |
True when terminal == "success" |
html |
str |
Extracted page HTML |
url |
str |
Final URL after redirects |
status |
int|None |
HTTP response status |
headers |
dict |
Response headers |
terminal |
str |
"success", "timeout", "error", "guard_failed" |
duration_ms |
int |
Total wall time including retries |
screenshot |
bytes|None |
PNG bytes if screenshot=True |
error |
Exception|None |
Exception on error terminal |
Request options
Request(
url="https://example.com",
state=None, # State | None — completion condition
timeout=30.0, # seconds before terminal="timeout"
retries=0, # retry count on error or timeout
tick_ms=250, # reconciler poll interval (ms)
headers={}, # extra request headers
cookies={}, # injected as Cookie header
proxy=None, # ProxyConfig(server, username, password)
pre_actions=[], # actions before goto()
actions=[], # actions after goto(), before state eval
screenshot=False, # capture PNG screenshot
clean_html=False, # strip <script>/<style>/<svg> tags
use_browser=True, # False → httpx fast path (no JS, no actions)
metadata={}, # pass-through dict, untouched
)
Stealth
Every browser context is patched before any page script runs:
| Signal | Patch |
|---|---|
navigator.webdriver |
undefined (not true) |
window.chrome |
Full runtime/app/csi object |
navigator.plugins |
5 PDF Viewer plugins |
navigator.mimeTypes |
PDF mime types |
navigator.languages |
['en-US', 'en'] |
navigator.deviceMemory |
8 |
navigator.userAgentData |
Chrome 124 brand list |
| WebGL vendor/renderer | Intel Inc. / Intel Iris OpenGL Engine |
Notification.permission |
'default' (not 'denied') |
Function.prototype.toString |
Native code appearance |
The --disable-blink-features=AutomationControlled Chromium flag is also set, which removes the automation infobar and additional blink-level automation signals.
Performance
Rule of thumb: 1 worker per machine, page_limit = cpu_cores / 2.
Each worker is one Chromium browser (a heavy multi-process application). Adding more workers means more competing browsers on the same CPU. Pages are cheap async tasks within one browser and scale well for network-bound targets.
workers pages req/s p50ms notes
1 1 3.1 306 sequential
1 4 9.4 336 async concurrency kicks in
1 8 11.9 506 approaching CPU saturation
2 4 10.7 483 two browsers compete, worse than 1×8
4 4 11.2 804 p50 triples vs 1×4
Benchmarked against a local server with 200ms simulated latency.
To scale beyond a single machine: run multiple yoink processes on separate hosts pointing at the same external work queue.
For non-JS pages, use_browser=False routes through httpx (no Chromium overhead):
req = Request(url="https://api.example.com/feed.json", use_browser=False)
Configuration
Config is loaded in order (later wins): defaults → ~/.config/yoink/config.toml → --config FILE → YK_ env vars.
[workers]
count = 1 # workers per machine (default: 1)
page_limit = 4 # concurrent pages per worker (default: cpu_count // 2)
headless = true
idle_timeout_secs = 300
persist_context = false # reuse one BrowserContext per worker (faster, less isolated)
[rate_limit]
default_delay_ms = 0
[rate_limit.per_domain]
"news.ycombinator.com" = 1000
"reddit.com" = 500
[log]
level = "INFO"
Environment variables use YK_SECTION__FIELD (double underscore):
YK_WORKERS__COUNT=2
YK_WORKERS__PAGE_LIMIT=6
YK_RATE_LIMIT__DEFAULT_DELAY_MS=500
YK_LOG__LEVEL=DEBUG
Docker
docker build -t yoink .
docker run --rm yoink https://example.com
docker run --rm yoink --stream --pages 4 - < urls.txt
Development
uv sync
uv run pytest tests/unit/ # 179 unit tests, no browser needed
uv run pytest tests/integration/ # requires Playwright Chromium
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file python_yoink-0.1.0.tar.gz.
File metadata
- Download URL: python_yoink-0.1.0.tar.gz
- Upload date:
- Size: 25.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8890565c3d7da8cc32180078856c5988c0320ba402a2918a711131770a86af64
|
|
| MD5 |
0aa6c11a0114c14f7e3dd11124d86303
|
|
| BLAKE2b-256 |
efba6a6e07a623e9c94894552af1f627c983afbc678be9401e10c7e699525851
|
File details
Details for the file python_yoink-0.1.0-py3-none-any.whl.
File metadata
- Download URL: python_yoink-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cd7bf8460027e35dcd33c86962286f6b2921992f3890e0ded8ab8180e5b18eb
|
|
| MD5 |
4e09c88d2cb887801b26c6928292d52d
|
|
| BLAKE2b-256 |
801a7bf2d42f2ec36febc0f3e4b726900c92f0518dc50348930609421dab0aa5
|