Skip to main content

Deep URL crawler for Python supporting dynamic and static content, domain restrictions, and callbacks.

Project description

LinkWalker

LinkWalker is a Python library for deep URL crawling and walking. It supports both dynamic browser-based crawling (using Playwright) and static HTTP crawling, allowing you to traverse websites, extract links, and filter URLs with ease. Perfect for developers building scrapers, bots, or web analyzers.

Features

  • Dynamic crawling with Playwright (handles JavaScript-heavy pages)
  • Static HTTP crawling using aiohttp for lightweight scraping
  • Deep crawling with configurable max depth
  • URL filtering:
    • Include or exclude URLs based on substrings
    • Clean URLs by removing query parameters
    • Blacklist certain file extensions
    • HTTPS-only option
  • Domain control: restrict crawling to specific domains or subdomains
  • Callbacks: execute custom logic on each page visited
  • Concurrency control with adjustable max parallel pages

Installation

pip install linkwalker

Example Usage

Dynamic Browser Walker

import asyncio
from linkwalker.spider.dynamic import BrowserWalker
from linkwalker.spider._types import BrowserWalkOptions
from playwright.async_api import Page

async def on_page(page: Page, html):
    print("Visited:", page.url)

async def main():
    walker = BrowserWalker(headless=True, max_pages=4)
    await walker.start()

    options: BrowserWalkOptions = {
        "https_only": False,
        "clean_url": True,
        "max_depth": 2,
        "on_page": on_page,
        "allow_all_domains": False,
    }

    urls = await walker.walk(origin_url="https://example.com", options=options)
    print(f"Found {len(urls)} URLs")

    await walker.close()

asyncio.run(main())

Static HTTP Walker

import asyncio
from linkwalker.spider.static import HTTPWalker
from linkwalker.spider._types import HTTPWalkOptions

async def on_page(url, html):
    print("Visited:", url)

async def main():
    walker = HTTPWalker(max_pages=5)
    await walker.start()

    options: HTTPWalkOptions = {
        "https_only": False,
        "clean_url": True,
        "max_depth": 2,
        "on_page": on_page,
        "allow_all_domains": False,
        "url_must_contain": ["/tag/", "/author/"],
        "url_must_not_contain": ["/page/"]
    }

    urls = await walker.walk(origin_url="https://quotes.toscrape.com", options=options)
    print(f"Found {len(urls)} URLs")

    await walker.close()

asyncio.run(main())

Contributing

Feel free to submit issues or pull requests. Contributions to improve crawling efficiency, filtering, or feature support are welcome.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linkwalker-0.9.0.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

linkwalker-0.9.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file linkwalker-0.9.0.tar.gz.

File metadata

  • Download URL: linkwalker-0.9.0.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for linkwalker-0.9.0.tar.gz
Algorithm Hash digest
SHA256 b40a06b4e445e2f4fa39db5952b725b64186f60d52a0a7191d048d862df655ec
MD5 75a542c4d5f7549ad4468b5c126bb239
BLAKE2b-256 d377642c46a664c098e65e7928b8fde28fdea67bd1d2b921247fdff184f76b4e

See more details on using hashes here.

Provenance

The following attestation bundles were made for linkwalker-0.9.0.tar.gz:

Publisher: publish.yml on cvcvka5/linkwalker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file linkwalker-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: linkwalker-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for linkwalker-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2f36d976b2a81feaa532dfaa65a8d0e67fdb6e37d1624d0b8a93aeac2d136c2e
MD5 bad4e540d55497d7e2a044241f2570e8
BLAKE2b-256 d8041ffbc25cfd0660895807c4fbea9715088910bf6e988e6ceb8f79cf693233

See more details on using hashes here.

Provenance

The following attestation bundles were made for linkwalker-0.9.0-py3-none-any.whl:

Publisher: publish.yml on cvcvka5/linkwalker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page