Skip to main content

Browserless, native-speed web crawler + extractor with a V8 JS-render tier (PyO3 binding over the turbo-surf Rust engine).

Project description

turbo-surf (Python)

Browserless, native-speed web crawler + extractor with a real V8 JS-render tier — a PyO3 binding over the turbo-surf Rust engine. Fetch-free: you pass a page's HTML in, and get a view out (Markdown, visible text, links, a typed extraction, an accessibility tree). For JS-gated pages, the engine runs the page's own scripts in a true V8 isolate over a native DOM — no headless Chromium.

import turbo_surf as ts

html = open("page.html").read()

ts.markdown(html, base_url="https://example.com/")   # -> Markdown str
ts.text(html)                                        # -> visible text
ts.links(html, base_url="https://example.com/")      # -> list[str]

# Typed extraction: a JSON schema maps field names to selector specs.
schema = '{"title": {"selector": "h1"}, "prices": {"selector": ".price", "list": true}}'
ts.extract(html, schema, base_url="https://example.com/")   # -> JSON str

# JS-gated page: run its own scripts, read the hydrated DOM.
hydrated = ts.render(html, script, base_url="https://example.com/")

Fatal faults (malformed schema JSON, a render-tier failure) raise turbo_surf.TurboSurfError; the non-JS views never raise.

Install

pip install turbo-surf

Prebuilt abi3 wheels (CPython 3.8+) are published for Linux (x86_64/aarch64), macOS (arm64), and Windows (x64).

API

function returns notes
markdown(html, base_url="") str Markdown render
text(html) str visible text
title(html) str document <title>
html(html) str re-serialized HTML
links(html, base_url="") list[str] resolved hyperlink targets
interactive_elements(html, base_url="") JSON str links/buttons/inputs
accessibility_tree(html) JSON str a11y tree
hydration_state(html) JSON str hydration probe
detect(html) JSON str is the page JS-gated?
query(html, selector, kind=None) JSON str kind = "css"/"xpath"/auto
extract(html, schema_json, base_url="") JSON str typed extraction
evaluate(html, script) str run script over the DOM (sync)
render(html, script, base_url="") str hydrated HTML after page scripts run
transform(src, ts=False, jsx=False) str TS/JSX → classic JS (swc)

MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_surf-0.2.5.tar.gz (467.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

turbo_surf-0.2.5-cp38-abi3-win_amd64.whl (23.0 MB view details)

Uploaded CPython 3.8+Windows x86-64

turbo_surf-0.2.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

turbo_surf-0.2.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (27.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

turbo_surf-0.2.5-cp38-abi3-macosx_11_0_arm64.whl (23.7 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file turbo_surf-0.2.5.tar.gz.

File metadata

  • Download URL: turbo_surf-0.2.5.tar.gz
  • Upload date:
  • Size: 467.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for turbo_surf-0.2.5.tar.gz
Algorithm Hash digest
SHA256 ec3fb225254d20954fb78f41eeac473ea5a470d59acd304d12c7f184b61c2055
MD5 a9d38032a0d8cbdfd97d0bd3302accfe
BLAKE2b-256 fa7029eb15d52b99680032a0934aade2d322adc84466e432ecc38ce794c44244

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.5-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.5-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 807b143180872c5f2a9016fdf4191da2c7bddcdaadd717148a7a975d34b21f08
MD5 40d0f283e4b2fe023901259551eb4463
BLAKE2b-256 266901dce07533f823e079cef234fb0ed5ac730918449aeeb8d0c4c68e85d546

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7659e297d5b4d97f2716949c88351d91351f5274b12f6b3ebd5856d24fc15611
MD5 0c9baab8f45bd050129bba9cadc95dc0
BLAKE2b-256 3eef9a03510fc4dea3e54dfc4e37e07650b4cff70f9048037ae3a2630de99fb3

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1174b525ba30d33acef0935c49e04fa9a1f8877fd166698eba908cf7af7c632e
MD5 148522dfddf7854255792c2822240322
BLAKE2b-256 1ec0468518df610b16b3ed08f59e094c2895d5b4391be89028d27008e812a02e

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.5-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.5-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 863aaf005394835f0fe932a066f9dc3c57c74e95df3bff6909bff44e0fcaf5bd
MD5 43d4340145aadea37323aff58cb5e6c1
BLAKE2b-256 59a82926eb22b6e690e42232e98ff09febcef65801843df01d464e97c8f7f774

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page