Skip to main content

Browserless, native-speed web crawler + extractor with a V8 JS-render tier (PyO3 binding over the turbo-surf Rust engine).

Project description

turbo-surf (Python)

Browserless, native-speed web crawler + extractor with a real V8 JS-render tier — a PyO3 binding over the turbo-surf Rust engine. Fetch-free: you pass a page's HTML in, and get a view out (Markdown, visible text, links, a typed extraction, an accessibility tree). For JS-gated pages, the engine runs the page's own scripts in a true V8 isolate over a native DOM — no headless Chromium.

import turbo_surf as ts

html = open("page.html").read()

ts.markdown(html, base_url="https://example.com/")   # -> Markdown str
ts.text(html)                                        # -> visible text
ts.links(html, base_url="https://example.com/")      # -> list[str]

# Typed extraction: a JSON schema maps field names to selector specs.
schema = '{"title": {"selector": "h1"}, "prices": {"selector": ".price", "list": true}}'
ts.extract(html, schema, base_url="https://example.com/")   # -> JSON str

# JS-gated page: run its own scripts, read the hydrated DOM.
hydrated = ts.render(html, script, base_url="https://example.com/")

Fatal faults (malformed schema JSON, a render-tier failure) raise turbo_surf.TurboSurfError; the non-JS views never raise.

Install

pip install turbo-surf

Prebuilt abi3 wheels (CPython 3.8+) are published for Linux (x86_64/aarch64), macOS (arm64), and Windows (x64).

API

function returns notes
markdown(html, base_url="") str Markdown render
text(html) str visible text
title(html) str document <title>
html(html) str re-serialized HTML
links(html, base_url="") list[str] resolved hyperlink targets
interactive_elements(html, base_url="") JSON str links/buttons/inputs
accessibility_tree(html) JSON str a11y tree
hydration_state(html) JSON str hydration probe
detect(html) JSON str is the page JS-gated?
query(html, selector, kind=None) JSON str kind = "css"/"xpath"/auto
extract(html, schema_json, base_url="") JSON str typed extraction
evaluate(html, script) str run script over the DOM (sync)
render(html, script, base_url="") str hydrated HTML after page scripts run
transform(src, ts=False, jsx=False) str TS/JSX → classic JS (swc)

MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_surf-0.3.1.tar.gz (516.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

turbo_surf-0.3.1-cp38-abi3-win_amd64.whl (27.2 MB view details)

Uploaded CPython 3.8+Windows x86-64

turbo_surf-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

turbo_surf-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (31.8 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

turbo_surf-0.3.1-cp38-abi3-macosx_11_0_arm64.whl (28.0 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file turbo_surf-0.3.1.tar.gz.

File metadata

  • Download URL: turbo_surf-0.3.1.tar.gz
  • Upload date:
  • Size: 516.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for turbo_surf-0.3.1.tar.gz
Algorithm Hash digest
SHA256 04304c213d68ea17cb3825304fd199bf5b074bcaa0102f30117a64b193239895
MD5 78d42a6696f12fac63b524c8e5354ca5
BLAKE2b-256 f437405cd4659b9992b049f24dde8b41a4227eb56e01a8cbd55b50202722ff77

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.1-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 1c735e608525f32ccc0b441fa0efbd9deddf12f50fb60a108390ed065fb02721
MD5 e9173ea5fe0848fece888bee8f171898
BLAKE2b-256 ddeedd6586553cc9298bd3802d29ea80d75c4fd609c270497d65aa19e2f5b2b9

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7e7a32bec6688a33c064548286d627997b161f04a56c8456bb76c1c09785d52d
MD5 5db76bfd0b88c9e407f29cb38dc935a7
BLAKE2b-256 44abf8b54deba11f152dc11ae6f1041edbfe0d26bc8d956d18fc7143687e49fc

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7b3592c44b42b8ace4f00a9b6cb7e72a3b02a658a92a81fad355f5add67a6816
MD5 40fdb6c476b8530db12f2d8e3d6bbc21
BLAKE2b-256 c4f941311277933e9cdd2ad462f0efeb85592e851011e644f14a1905038936bd

See more details on using hashes here.

File details

Details for the file turbo_surf-0.3.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.3.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 51723a0ea277446e578ed09edc55f9bc6182519fb466ffc4f9ba39508d956218
MD5 075cc27eb4015fcb5a442577aade8e00
BLAKE2b-256 7ef80e01158a9dda5b497e061bac5b266623caecc35c449b5f357501331542fc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page