Skip to main content

Browserless, native-speed web crawler + extractor with a V8 JS-render tier (PyO3 binding over the turbo-surf Rust engine).

Project description

turbo-surf (Python)

Browserless, native-speed web crawler + extractor with a real V8 JS-render tier — a PyO3 binding over the turbo-surf Rust engine. Fetch-free: you pass a page's HTML in, and get a view out (Markdown, visible text, links, a typed extraction, an accessibility tree). For JS-gated pages, the engine runs the page's own scripts in a true V8 isolate over a native DOM — no headless Chromium.

import turbo_surf as ts

html = open("page.html").read()

ts.markdown(html, base_url="https://example.com/")   # -> Markdown str
ts.text(html)                                        # -> visible text
ts.links(html, base_url="https://example.com/")      # -> list[str]

# Typed extraction: a JSON schema maps field names to selector specs.
schema = '{"title": {"selector": "h1"}, "prices": {"selector": ".price", "list": true}}'
ts.extract(html, schema, base_url="https://example.com/")   # -> JSON str

# JS-gated page: run its own scripts, read the hydrated DOM.
hydrated = ts.render(html, script, base_url="https://example.com/")

Fatal faults (malformed schema JSON, a render-tier failure) raise turbo_surf.TurboSurfError; the non-JS views never raise.

Install

pip install turbo-surf

Prebuilt abi3 wheels (CPython 3.8+) are published for Linux (x86_64/aarch64), macOS (arm64), and Windows (x64).

API

function returns notes
markdown(html, base_url="") str Markdown render
text(html) str visible text
title(html) str document <title>
html(html) str re-serialized HTML
links(html, base_url="") list[str] resolved hyperlink targets
interactive_elements(html, base_url="") JSON str links/buttons/inputs
accessibility_tree(html) JSON str a11y tree
hydration_state(html) JSON str hydration probe
detect(html) JSON str is the page JS-gated?
query(html, selector, kind=None) JSON str kind = "css"/"xpath"/auto
extract(html, schema_json, base_url="") JSON str typed extraction
evaluate(html, script) str run script over the DOM (sync)
render(html, script, base_url="") str hydrated HTML after page scripts run
transform(src, ts=False, jsx=False) str TS/JSX → classic JS (swc)

MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_surf-0.3.0.tar.gz (514.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

turbo_surf-0.3.0-cp38-abi3-win_amd64.whl (27.2 MB view details)

Uploaded CPython 3.8+Windows x86-64

turbo_surf-0.3.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

turbo_surf-0.3.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (31.8 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

turbo_surf-0.3.0-cp38-abi3-macosx_11_0_arm64.whl (28.0 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file turbo_surf-0.3.0.tar.gz.

File metadata

  • Download URL: turbo_surf-0.3.0.tar.gz
  • Upload date:
  • Size: 514.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for turbo_surf-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ac9cf0fa87da104e041ea811d185eddb8934284ee47df77565f30a37db5cdd2e
MD5 fa75832b4d9f56a4584f3c43ad15e97f
BLAKE2b-256 e55871b59a40ac5c66c8434531278f0ab1abf9ac5c7f6a202f6944163f1f4a6b

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 235a4048127431335140870843b275d051f87ad829074eaab10c22f9c7ecbc82
MD5 5fd2d9317ca3b05e6f1dae22adbd08c9
BLAKE2b-256 742843bf8920d901fe7af530635133ff8ccf9cc3ff0a5c4a94c420a88e23fc01

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e5af98e77d05342f471ff40531b8a60503d5b72bbddc729da4175e8255281ef1
MD5 5d851c1db6ab8d33a9f8ad1a84da8f2b
BLAKE2b-256 d2808f98638ae6df98d53f7a489f621a32bcccb057defb441a566d9ebf359034

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 00138fb33683ed994d4093cd719a5f06b3855584ffbb1ddb133ed233913b6918
MD5 fae2b716b6fa3d790f8bd670d5548937
BLAKE2b-256 68d37fff49b3c6fde752990161560009b7f1c951a3e491696608f8b3f8f0f626

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b055a7c3444e82d2f5ad9a378f6ad4a1c10d0e669372863149aa449ed6338dcc
MD5 48d199691195c351a553693315c9f058
BLAKE2b-256 47e585a0cddd60561496ac8b897c80b435f42f7bc1d9b0d379f18d0ac9cd5b02

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page