Skip to main content

Browserless, native-speed web crawler + extractor with a V8 JS-render tier (PyO3 binding over the turbo-surf Rust engine).

Project description

turbo-surf (Python)

Browserless, native-speed web crawler + extractor with a real V8 JS-render tier — a PyO3 binding over the turbo-surf Rust engine. Fetch-free: you pass a page's HTML in, and get a view out (Markdown, visible text, links, a typed extraction, an accessibility tree). For JS-gated pages, the engine runs the page's own scripts in a true V8 isolate over a native DOM — no headless Chromium.

import turbo_surf as ts

html = open("page.html").read()

ts.markdown(html, base_url="https://example.com/")   # -> Markdown str
ts.text(html)                                        # -> visible text
ts.links(html, base_url="https://example.com/")      # -> list[str]

# Typed extraction: a JSON schema maps field names to selector specs.
schema = '{"title": {"selector": "h1"}, "prices": {"selector": ".price", "list": true}}'
ts.extract(html, schema, base_url="https://example.com/")   # -> JSON str

# JS-gated page: run its own scripts, read the hydrated DOM.
hydrated = ts.render(html, script, base_url="https://example.com/")

Fatal faults (malformed schema JSON, a render-tier failure) raise turbo_surf.TurboSurfError; the non-JS views never raise.

Install

pip install turbo-surf

Prebuilt abi3 wheels (CPython 3.8+) are published for Linux (x86_64/aarch64), macOS (arm64), and Windows (x64).

API

function returns notes
markdown(html, base_url="") str Markdown render
text(html) str visible text
title(html) str document <title>
html(html) str re-serialized HTML
links(html, base_url="") list[str] resolved hyperlink targets
interactive_elements(html, base_url="") JSON str links/buttons/inputs
accessibility_tree(html) JSON str a11y tree
hydration_state(html) JSON str hydration probe
detect(html) JSON str is the page JS-gated?
query(html, selector, kind=None) JSON str kind = "css"/"xpath"/auto
extract(html, schema_json, base_url="") JSON str typed extraction
evaluate(html, script) str run script over the DOM (sync)
render(html, script, base_url="") str hydrated HTML after page scripts run
transform(src, ts=False, jsx=False) str TS/JSX → classic JS (swc)

MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_surf-0.2.7.tar.gz (503.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

turbo_surf-0.2.7-cp38-abi3-win_amd64.whl (23.0 MB view details)

Uploaded CPython 3.8+Windows x86-64

turbo_surf-0.2.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

turbo_surf-0.2.7-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (27.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

turbo_surf-0.2.7-cp38-abi3-macosx_11_0_arm64.whl (23.7 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file turbo_surf-0.2.7.tar.gz.

File metadata

  • Download URL: turbo_surf-0.2.7.tar.gz
  • Upload date:
  • Size: 503.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for turbo_surf-0.2.7.tar.gz
Algorithm Hash digest
SHA256 6f89b4d23ae5f2a25a8fe9bb5e177af7ff916645c00cf574247cd2021725fe00
MD5 787f6aabc71233a2b0c8b30ee9325895
BLAKE2b-256 3bf315ba9c13b7107390ce326e9a78d2401fd22067f8eee1f013d77f0f6e3342

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.7-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.7-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c7c2448fa091128e8a02391351f35cd731134b5843026def2553a75f6db446ed
MD5 262353740266fc7e317e4942d2613aa0
BLAKE2b-256 0f4dc3c449b9e648c14496b5104d988784009b2cfdf8c29552e68e3824b96b15

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fad80b8c163b33c69f87a34007a4476bee71c755fef56a2d861fd0fcf2bdfdf2
MD5 69102280ad1b016a80c0686c5d97815b
BLAKE2b-256 a7ce95f8f13b0f96a4888a053b5d2a3aa0afa53950eb0f407d7aa35fd45ebc62

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.7-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.7-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 eac095f12397495f74e475bca2b386b93ba3ecca38fb8525e10eff08074d7b60
MD5 678e9bbdd74361540ad494f16c016307
BLAKE2b-256 fc835fc576dcf44154e0c3191404c796a44b6eafd8f152c04eff8e732aa30e23

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.7-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.7-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 16601ffe9178534cd40f80469dce380390275a10773c3ee2563f71575a376ce5
MD5 6a845c7892b9d3ae33c7799473b4aa7e
BLAKE2b-256 8b2cfbf10b58b6dffc349fb252d458f16919b97f9868adfe7b04a25fbd4d2ab2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page