Advanced async web crawler with AI agent integration - inspect, discover, crawl, and extract web content
Project description
Web Crawler
An advanced async web crawler toolkit with powerful scraping capabilities and AI agent integration.
๐ Features
- Async/Await Architecture - Built with
asynciofor high-performance concurrent crawling - Comprehensive Tools - Inspect, discover, extract, and crawl websites
- Stealth Mode - Built-in stealth capabilities to avoid detection
- Clean Content - Automatic HTML cleaning and markdown conversion
- AI Agent Ready - Easy integration with LangChain and CrewAI
- Flexible Scoping - Control crawl scope (domain, subdomain, path)
- Smart Discovery - Keyword-based link discovery with relevance scoring
๐ฆ Installation
# Clone the repository
git clone <repository-url>
cd crawler
# Install dependencies with uv
uv sync
# Install Playwright browsers
playwright install chromium
๐ง Core Tools
inspect_site(url)
Analyze website structure, metadata, navigation, and sitemap.
result = await tool.inspect_site("https://example.com")
print(result['metadata']['title'])
print(result['sitemap_summary']['total_urls'])
discover_links(url, keywords, scope="domain")
Find relevant links based on keywords with relevance scoring.
links = await tool.discover_links(
url="https://example.com",
keywords=["documentation", "api", "tutorial"]
)
extract_links(url, topology="mesh", scope="subdomain", max_pages=50)
Crawl website to discover all internal pages.
urls = await tool.extract_links(
url="https://example.com",
topology="mesh", # BFS crawling
max_pages=100
)
extract_content(url, click_selectors=None, screenshot=False)
Extract clean markdown content from any webpage.
content = await tool.extract_content("https://example.com")
print(content['markdown'])
๐ฏ Quick Start
import asyncio
import tool
async def main():
# Extract content from a page
result = await tool.extract_content("https://example.com")
print(f"Title: {result['metadata']['title']}")
print(f"Content:\n{result['markdown'][:500]}")
asyncio.run(main())
๐ Concurrent Crawling
import asyncio
import tool
async def crawl_multiple():
urls = [
"https://example.com",
"https://example.org",
"https://example.net"
]
# Fetch all concurrently for maximum speed
results = await asyncio.gather(*[
tool.extract_content(url) for url in urls
])
return results
results = asyncio.run(crawl_multiple())
๐ค AI Agent Integration
LangChain
from langchain_core.tools import Tool
import asyncio
import tool
langchain_tools = [
Tool(
name="extract_content",
description="Extract content from a webpage",
func=lambda url: asyncio.run(tool.extract_content(url))
)
]
# Use with any LangChain agent
See examples/langchain_agent.py for complete example.
CrewAI
from crewai.tools import BaseTool
import asyncio
import tool
class CrawlerTool(BaseTool):
name: str = "Web Crawler"
description: str = "Crawls and extracts web content"
def _run(self, url: str) -> str:
result = asyncio.run(tool.extract_content(url))
return result['markdown']
# Use with CrewAI agents
See examples/crewai_agents.py for complete multi-agent example.
๐ Project Structure
crawler/
โโโ tool.py # Main crawler tools
โโโ util.py # Utility functions
โโโ main.py # Simple example
โโโ tests/ # Test suite
โ โโโ test_tool.py
โ โโโ test_util.py
โ โโโ test_async.py
โโโ examples/ # Usage examples
โ โโโ basic_usage.py
โ โโโ concurrent_crawling.py
โ โโโ langchain_agent.py
โ โโโ crewai_agents.py
โโโ pyproject.toml # Dependencies
๐งช Testing
# Run all tests
make test
# or
pytest tests/ -v
# Run specific test file
pytest tests/test_tool.py -v
๐ Examples
All examples are in the examples/ directory:
- basic_usage.py - Simple usage of all tools
- concurrent_crawling.py - Performance comparison
- langchain_agent.py - LangChain integration
- crewai_agents.py - Multi-agent CrewAI system
Run any example:
python examples/basic_usage.py
๐ ๏ธ Development
# Run main demo
make run
# Run tests
make test
# Format code
black .
# Type checking
mypy tool.py util.py
๐ API Reference
Tool Functions
async def inspect_site(url: str) -> dict
Returns:
{
"metadata": {
"title": str,
"description": str,
"keywords": str
},
"navigation": {
"header": [{"text": str, "url": str}],
"nav": [...],
"footer": [...]
},
"sitemap_summary": {
"total_urls": int,
"structure_hint": dict
}
}
async def discover_links(url: str, keywords: list[str], scope: str = "domain") -> list[dict]
Returns:
[
{
"url": str,
"text": str,
"score": int,
"matches": [str]
}
]
async def extract_links(url: str, topology: str = "mesh", scope: str = "subdomain", max_pages: int = 50) -> list[str]
Parameters:
topology: "mesh" (BFS), "linear", "hub_and_spoke", "sidebar"scope: "subdomain", "domain", "path"
Returns: List of discovered URLs
async def extract_content(url: str, click_selectors: list[str] = None, screenshot: bool = False) -> dict
Returns:
{
"markdown": str,
"screenshot": str | None,
"metadata": {
"title": str,
"url": str,
"type": str # html, pdf, json
}
}
๐ Features
- โ Async/await for concurrent operations
- โ Playwright stealth mode
- โ Automatic HTML cleaning
- โ Markdown conversion
- โ Sitemap parsing
- โ Robots.txt compliance
- โ Keyword-based discovery
- โ Multiple crawl topologies
- โ Scope control (domain/subdomain/path)
- โ Screenshot support
- โ Dynamic content handling
- โ PDF/JSON detection
๐ License
MIT License - See LICENSE
๐ค Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new features
- Submit a pull request
๐ง Support
For issues and questions, please open a GitHub issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file async_web_crawler-0.1.0.tar.gz.
File metadata
- Download URL: async_web_crawler-0.1.0.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8e8c76bfe267c0056dda2c2b65740edbe36591ff3981eb848f334ca678c5fa0
|
|
| MD5 |
77c9850f1c91cb2f5fc64ff7aece574e
|
|
| BLAKE2b-256 |
4be8a13dfb2d33657839dbc6aac9587756dbe12acb745c8e67d387e9e679c8bc
|
File details
Details for the file async_web_crawler-0.1.0-py3-none-any.whl.
File metadata
- Download URL: async_web_crawler-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75ce5b0017f0bc479b15f9d641b3ba29d3c7aa7882bc5af3d9748927e86e81dc
|
|
| MD5 |
0ef3b1a243b6ff01f1e4f48af6439c16
|
|
| BLAKE2b-256 |
90fbd96911b8508794220cc9aee6325661759790e71e973bb071e188b16eec34
|