Deep URL crawler for Python supporting dynamic and static content, domain restrictions, and callbacks.
Project description
LinkWalker
LinkWalker is a Python library for deep URL crawling and walking. It supports both dynamic browser-based crawling (using Playwright) and static HTTP crawling, allowing you to traverse websites, extract links, and filter URLs with ease. Perfect for developers building scrapers, bots, or web analyzers.
Features
- Dynamic crawling with Playwright (handles JavaScript-heavy pages)
- Static HTTP crawling using aiohttp for lightweight scraping
- Deep crawling with configurable max depth
- URL filtering:
- Include or exclude URLs based on substrings
- Clean URLs by removing query parameters
- Blacklist certain file extensions
- HTTPS-only option
- Domain control: restrict crawling to specific domains or subdomains
- Callbacks: execute custom logic on each page visited
- Concurrency control with adjustable max parallel pages
Installation
pip install linkwalker
Example Usage
Dynamic Browser Walker
import asyncio
from linkwalker.spider.dynamic import BrowserWalker
from linkwalker.spider._types import BrowserWalkOptions
from playwright.async_api import Page
async def on_page(page: Page, html):
print("Visited:", page.url)
async def main():
walker = BrowserWalker(headless=True, max_pages=4)
await walker.start()
options: BrowserWalkOptions = {
"https_only": False,
"clean_url": True,
"max_depth": 2,
"on_page": on_page,
"allow_all_domains": False,
}
urls = await walker.walk(origin_url="https://example.com", options=options)
print(f"Found {len(urls)} URLs")
await walker.close()
asyncio.run(main())
Static HTTP Walker
import asyncio
from linkwalker.spider.static import HTTPWalker
from linkwalker.spider._types import HTTPWalkOptions
async def on_page(url, html):
print("Visited:", url)
async def main():
walker = HTTPWalker(max_pages=5)
await walker.start()
options: HTTPWalkOptions = {
"https_only": False,
"clean_url": True,
"max_depth": 2,
"on_page": on_page,
"allow_all_domains": False,
"url_must_contain": ["/tag/", "/author/"],
"url_must_not_contain": ["/page/"]
}
urls = await walker.walk(origin_url="https://quotes.toscrape.com", options=options)
print(f"Found {len(urls)} URLs")
await walker.close()
asyncio.run(main())
Contributing
Feel free to submit issues or pull requests. Contributions to improve crawling efficiency, filtering, or feature support are welcome.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file linkwalker-0.9.0.tar.gz.
File metadata
- Download URL: linkwalker-0.9.0.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b40a06b4e445e2f4fa39db5952b725b64186f60d52a0a7191d048d862df655ec
|
|
| MD5 |
75a542c4d5f7549ad4468b5c126bb239
|
|
| BLAKE2b-256 |
d377642c46a664c098e65e7928b8fde28fdea67bd1d2b921247fdff184f76b4e
|
Provenance
The following attestation bundles were made for linkwalker-0.9.0.tar.gz:
Publisher:
publish.yml on cvcvka5/linkwalker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
linkwalker-0.9.0.tar.gz -
Subject digest:
b40a06b4e445e2f4fa39db5952b725b64186f60d52a0a7191d048d862df655ec - Sigstore transparency entry: 628653797
- Sigstore integration time:
-
Permalink:
cvcvka5/linkwalker@41379ca99583dfb4d8045fa8f596d3294210c763 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/cvcvka5
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@41379ca99583dfb4d8045fa8f596d3294210c763 -
Trigger Event:
release
-
Statement type:
File details
Details for the file linkwalker-0.9.0-py3-none-any.whl.
File metadata
- Download URL: linkwalker-0.9.0-py3-none-any.whl
- Upload date:
- Size: 9.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f36d976b2a81feaa532dfaa65a8d0e67fdb6e37d1624d0b8a93aeac2d136c2e
|
|
| MD5 |
bad4e540d55497d7e2a044241f2570e8
|
|
| BLAKE2b-256 |
d8041ffbc25cfd0660895807c4fbea9715088910bf6e988e6ceb8f79cf693233
|
Provenance
The following attestation bundles were made for linkwalker-0.9.0-py3-none-any.whl:
Publisher:
publish.yml on cvcvka5/linkwalker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
linkwalker-0.9.0-py3-none-any.whl -
Subject digest:
2f36d976b2a81feaa532dfaa65a8d0e67fdb6e37d1624d0b8a93aeac2d136c2e - Sigstore transparency entry: 628653798
- Sigstore integration time:
-
Permalink:
cvcvka5/linkwalker@41379ca99583dfb4d8045fa8f596d3294210c763 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/cvcvka5
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@41379ca99583dfb4d8045fa8f596d3294210c763 -
Trigger Event:
release
-
Statement type: