Skip to main content

Browserless, native-speed web crawler + extractor with a V8 JS-render tier (PyO3 binding over the turbo-surf Rust engine).

Project description

turbo-surf (Python)

Browserless, native-speed web crawler + extractor with a real V8 JS-render tier — a PyO3 binding over the turbo-surf Rust engine. Fetch-free: you pass a page's HTML in, and get a view out (Markdown, visible text, links, a typed extraction, an accessibility tree). For JS-gated pages, the engine runs the page's own scripts in a true V8 isolate over a native DOM — no headless Chromium.

import turbo_surf as ts

html = open("page.html").read()

ts.markdown(html, base_url="https://example.com/")   # -> Markdown str
ts.text(html)                                        # -> visible text
ts.links(html, base_url="https://example.com/")      # -> list[str]

# Typed extraction: a JSON schema maps field names to selector specs.
schema = '{"title": {"selector": "h1"}, "prices": {"selector": ".price", "list": true}}'
ts.extract(html, schema, base_url="https://example.com/")   # -> JSON str

# JS-gated page: run its own scripts, read the hydrated DOM.
hydrated = ts.render(html, script, base_url="https://example.com/")

Fatal faults (malformed schema JSON, a render-tier failure) raise turbo_surf.TurboSurfError; the non-JS views never raise.

Install

pip install turbo-surf

Prebuilt abi3 wheels (CPython 3.8+) are published for Linux (x86_64/aarch64), macOS (arm64), and Windows (x64).

API

function returns notes
markdown(html, base_url="") str Markdown render
text(html) str visible text
title(html) str document <title>
html(html) str re-serialized HTML
links(html, base_url="") list[str] resolved hyperlink targets
interactive_elements(html, base_url="") JSON str links/buttons/inputs
accessibility_tree(html) JSON str a11y tree
hydration_state(html) JSON str hydration probe
detect(html) JSON str is the page JS-gated?
query(html, selector, kind=None) JSON str kind = "css"/"xpath"/auto
extract(html, schema_json, base_url="") JSON str typed extraction
evaluate(html, script) str run script over the DOM (sync)
render(html, script, base_url="") str hydrated HTML after page scripts run
transform(src, ts=False, jsx=False) str TS/JSX → classic JS (swc)

MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_surf-0.2.4.tar.gz (467.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

turbo_surf-0.2.4-cp38-abi3-win_amd64.whl (23.0 MB view details)

Uploaded CPython 3.8+Windows x86-64

turbo_surf-0.2.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

turbo_surf-0.2.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (27.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

turbo_surf-0.2.4-cp38-abi3-macosx_11_0_arm64.whl (23.7 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file turbo_surf-0.2.4.tar.gz.

File metadata

  • Download URL: turbo_surf-0.2.4.tar.gz
  • Upload date:
  • Size: 467.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for turbo_surf-0.2.4.tar.gz
Algorithm Hash digest
SHA256 70a1ffb9ec44313d6e4754290f8b9ab97b8dde4e14522b07dde1fcbcf7e5ae56
MD5 fd37d9db69b2c2776c479c114bd597cd
BLAKE2b-256 7d6f8ce7520c8c202ede454c37748e56a7b96812c4fd06999fecbab5c7d786b5

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.4-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.4-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 8e7f6fe1491a65cc2606f4241a44299dda6b1897da4572309ad0e9c56d4a40f5
MD5 f57187fb081ec52102717f8460107106
BLAKE2b-256 afdec6157370032cfaf7e281f04863731a3859d69ca318dbf93288fad3238381

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 74170b4478f80c9a0fe78320cfdb1ad79ad159593beb8594c0146d2ed5565ecd
MD5 828ae52afecd07473a1c9c261f059c33
BLAKE2b-256 b5dd4cb67b8657af061fd1209c6a5dbd4c7d9729c6404a961b6b3269ee82b610

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1571f7acc6e4da838526ff91280c1faf9619a9828aeed36ec2241cdf5153da7d
MD5 134bef83704f8a2683020bdaa573006f
BLAKE2b-256 a3f792b2e54ee9ec9bbd9fdcac5a28fb5b634fb41605ee936b2e4ef1c9135342

See more details on using hashes here.

File details

Details for the file turbo_surf-0.2.4-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for turbo_surf-0.2.4-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9bd1fa7bf027773157ebe2463e900eb5bf440655624a9e47a5a5c5924283c335
MD5 71e6ba9971a3698c6cdcb78623784baf
BLAKE2b-256 041c672a3ba38efa40aea74c362a82f527230e69fe9f20fe380562e8b847a96b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page