Skip to main content

Browserless, native-speed web crawler + extractor with a V8 JS-render tier (PyO3 binding over the turbo-surf Rust engine).

Project description

turbo-surf (Python)

Browserless, native-speed web crawler + extractor with a real V8 JS-render tier — a PyO3 binding over the turbo-surf Rust engine. Fetch-free: you pass a page's HTML in, and get a view out (Markdown, visible text, links, a typed extraction, an accessibility tree). For JS-gated pages, the engine runs the page's own scripts in a true V8 isolate over a native DOM — no headless Chromium.

import turbo_surf as ts

html = open("page.html").read()

ts.markdown(html, base_url="https://example.com/")   # -> Markdown str
ts.text(html)                                        # -> visible text
ts.links(html, base_url="https://example.com/")      # -> list[str]

# Typed extraction: a JSON schema maps field names to selector specs.
schema = '{"title": {"selector": "h1"}, "prices": {"selector": ".price", "list": true}}'
ts.extract(html, schema, base_url="https://example.com/")   # -> JSON str

# JS-gated page: run its own scripts, read the hydrated DOM.
hydrated = ts.render(html, script, base_url="https://example.com/")

Fatal faults (malformed schema JSON, a render-tier failure) raise turbo_surf.TurboSurfError; the non-JS views never raise.

Install

pip install turbo-surf

Prebuilt abi3 wheels (CPython 3.8+) are published for Linux (x86_64/aarch64), macOS (arm64), and Windows (x64).

API

function returns notes
markdown(html, base_url="") str Markdown render
text(html) str visible text
title(html) str document <title>
html(html) str re-serialized HTML
links(html, base_url="") list[str] resolved hyperlink targets
interactive_elements(html, base_url="") JSON str links/buttons/inputs
accessibility_tree(html) JSON str a11y tree
hydration_state(html) JSON str hydration probe
detect(html) JSON str is the page JS-gated?
query(html, selector, kind=None) JSON str kind = "css"/"xpath"/auto
extract(html, schema_json, base_url="") JSON str typed extraction
evaluate(html, script) str run script over the DOM (sync)
render(html, script, base_url="") str hydrated HTML after page scripts run
transform(src, ts=False, jsx=False) str TS/JSX → classic JS (swc)

MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_surf-0.3.3.tar.gz (530.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

turbo_surf-0.3.3-cp38-abi3-win_amd64.whl (28.1 MB view details)

Uploaded CPython 3.8+Windows x86-64

turbo_surf-0.3.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

turbo_surf-0.3.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (32.8 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

turbo_surf-0.3.3-cp38-abi3-macosx_11_0_arm64.whl (28.9 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file turbo_surf-0.3.3.tar.gz.

File metadata

  • Download URL: turbo_surf-0.3.3.tar.gz
  • Upload date:
  • Size: 530.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for turbo_surf-0.3.3.tar.gz
Algorithm Hash digest
SHA256 5d0647bfa40038a5513c450c342478ad62999886661a2ca1533d694749dc4ff8
MD5 25c5750e6b7c0f941f4da375bb2788a9
BLAKE2b-256 7fd235ece7caf1487ea62ee97bc25b09b2bf0d187d003c1834c825b4c136bc58

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.3-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.3-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 ef67a6e71c4d54405a97dbad89518613cd7ccd52fc167a40cece8d20ea976b8d
MD5 6b2d67f39cb9ef44ceb3c851fc64dec9
BLAKE2b-256 27227f59c7702693ca494d350be2a68ef9d3d3f7fac93721adf20a8fade964b3

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4992b496e38112ed110c5a70f109a3b5924bf92fedb295a104ebf41433a2b0c9
MD5 38a0525eb5a61b78e229a648c5f75479
BLAKE2b-256 76b4791baab9edf103e573b99fecfd1f1ff8e0623713e0af77b79556aed24c62

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dbd0184730288fa0ad5c9dffc8e69d7690e9004db33086d01f3d0b134d8e5a6a
MD5 c0519f52eb06603be1975ca63a7379a3
BLAKE2b-256 f1dbf8baa55800174bee469392b904d9df43e66db6e73cc2f785ab1a70c3adbe

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.3-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.3-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 791f425b16b1cc1d575365e6b2cef7423cea0b2995d15f7b99faf85393e6bbf1
MD5 a25cbe2b09d5b97757b6991a4e7d40bb
BLAKE2b-256 465453b226c7ae43ea188cf7f898170bf4da58265a5d9128138bdcd101c7f4ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page