Advanced async web crawler with AI agent integration - inspect, discover, crawl, and extract web content

These details have not been verified by PyPI

Project links

Project description

Web Crawler

An advanced async web crawler toolkit with powerful scraping capabilities and AI agent integration.

🚀 Features

Async/Await Architecture - Built with asyncio for high-performance concurrent crawling
Comprehensive Tools - Inspect, discover, extract, and crawl websites
Stealth Mode - Built-in stealth capabilities to avoid detection
Clean Content - Automatic HTML cleaning and markdown conversion
AI Agent Ready - Easy integration with LangChain and CrewAI
Flexible Scoping - Control crawl scope (domain, subdomain, path)
Smart Discovery - Keyword-based link discovery with relevance scoring

📦 Installation

# Clone the repository
git clone <repository-url>
cd crawler

# Install dependencies with uv
uv sync

# Install Playwright browsers
playwright install chromium

🔧 Core Tools

`inspect_site(url)`

Analyze website structure, metadata, navigation, and sitemap.

result = await tool.inspect_site("https://example.com")
print(result['metadata']['title'])
print(result['sitemap_summary']['total_urls'])

`discover_links(url, keywords, scope="domain")`

Find relevant links based on keywords with relevance scoring.

links = await tool.discover_links(
    url="https://example.com",
    keywords=["documentation", "api", "tutorial"]
)

`extract_links(url, topology="mesh", scope="subdomain", max_pages=50)`

Crawl website to discover all internal pages.

urls = await tool.extract_links(
    url="https://example.com",
    topology="mesh",  # BFS crawling
    max_pages=100
)

`extract_content(url, click_selectors=None, screenshot=False)`

Extract clean markdown content from any webpage.

content = await tool.extract_content("https://example.com")
print(content['markdown'])

🎯 Quick Start

import asyncio
import tool

async def main():
    # Extract content from a page
    result = await tool.extract_content("https://example.com")
    print(f"Title: {result['metadata']['title']}")
    print(f"Content:\n{result['markdown'][:500]}")

asyncio.run(main())

🚀 Concurrent Crawling

import asyncio
import tool

async def crawl_multiple():
    urls = [
        "https://example.com",
        "https://example.org",
        "https://example.net"
    ]
    
    # Fetch all concurrently for maximum speed
    results = await asyncio.gather(*[
        tool.extract_content(url) for url in urls
    ])
    
    return results

results = asyncio.run(crawl_multiple())

🤖 AI Agent Integration

LangChain

from langchain_core.tools import Tool
import asyncio
import tool

langchain_tools = [
    Tool(
        name="extract_content",
        description="Extract content from a webpage",
        func=lambda url: asyncio.run(tool.extract_content(url))
    )
]

# Use with any LangChain agent

See examples/langchain_agent.py for complete example.

CrewAI

from crewai.tools import BaseTool
import asyncio
import tool

class CrawlerTool(BaseTool):
    name: str = "Web Crawler"
    description: str = "Crawls and extracts web content"
    
    def _run(self, url: str) -> str:
        result = asyncio.run(tool.extract_content(url))
        return result['markdown']

# Use with CrewAI agents

See examples/crewai_agents.py for complete multi-agent example.

📁 Project Structure

crawler/
├── tool.py              # Main crawler tools
├── util.py              # Utility functions
├── main.py              # Simple example
├── tests/               # Test suite
│   ├── test_tool.py
│   ├── test_util.py
│   └── test_async.py
├── examples/            # Usage examples
│   ├── basic_usage.py
│   ├── concurrent_crawling.py
│   ├── langchain_agent.py
│   └── crewai_agents.py
└── pyproject.toml       # Dependencies

🧪 Testing

# Run all tests
make test
# or
pytest tests/ -v

# Run specific test file
pytest tests/test_tool.py -v

📚 Examples

All examples are in the examples/ directory:

basic_usage.py - Simple usage of all tools
concurrent_crawling.py - Performance comparison
langchain_agent.py - LangChain integration
crewai_agents.py - Multi-agent CrewAI system

Run any example:

python examples/basic_usage.py

🛠️ Development

# Run main demo
make run

# Run tests
make test

# Format code
black .

# Type checking
mypy tool.py util.py

📝 API Reference

Tool Functions

`async def inspect_site(url: str) -> dict`

Returns:

{
    "metadata": {
        "title": str,
        "description": str,
        "keywords": str
    },
    "navigation": {
        "header": [{"text": str, "url": str}],
        "nav": [...],
        "footer": [...]
    },
    "sitemap_summary": {
        "total_urls": int,
        "structure_hint": dict
    }
}

`async def discover_links(url: str, keywords: list[str], scope: str = "domain") -> list[dict]`

Returns:

[
    {
        "url": str,
        "text": str,
        "score": int,
        "matches": [str]
    }
]

`async def extract_links(url: str, topology: str = "mesh", scope: str = "subdomain", max_pages: int = 50) -> list[str]`

Parameters:

topology: "mesh" (BFS), "linear", "hub_and_spoke", "sidebar"
scope: "subdomain", "domain", "path"

Returns: List of discovered URLs

`async def extract_content(url: str, click_selectors: list[str] = None, screenshot: bool = False) -> dict`

Returns:

{
    "markdown": str,
    "screenshot": str | None,
    "metadata": {
        "title": str,
        "url": str,
        "type": str  # html, pdf, json
    }
}

🔒 Features

✅ Async/await for concurrent operations
✅ Playwright stealth mode
✅ Automatic HTML cleaning
✅ Markdown conversion
✅ Sitemap parsing
✅ Robots.txt compliance
✅ Keyword-based discovery
✅ Multiple crawl topologies
✅ Scope control (domain/subdomain/path)
✅ Screenshot support
✅ Dynamic content handling
✅ PDF/JSON detection

📄 License

MIT License - See LICENSE

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new features
Submit a pull request

📧 Support

For issues and questions, please open a GitHub issue.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Nov 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

async_web_crawler-0.1.0.tar.gz (14.3 kB view details)

Uploaded Nov 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

async_web_crawler-0.1.0-py3-none-any.whl (11.8 kB view details)

Uploaded Nov 23, 2025 Python 3

File details

Details for the file async_web_crawler-0.1.0.tar.gz.

File metadata

Download URL: async_web_crawler-0.1.0.tar.gz
Upload date: Nov 23, 2025
Size: 14.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for async_web_crawler-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b8e8c76bfe267c0056dda2c2b65740edbe36591ff3981eb848f334ca678c5fa0`
MD5	`77c9850f1c91cb2f5fc64ff7aece574e`
BLAKE2b-256	`4be8a13dfb2d33657839dbc6aac9587756dbe12acb745c8e67d387e9e679c8bc`

See more details on using hashes here.

File details

Details for the file async_web_crawler-0.1.0-py3-none-any.whl.

File metadata

Download URL: async_web_crawler-0.1.0-py3-none-any.whl
Upload date: Nov 23, 2025
Size: 11.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for async_web_crawler-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`75ce5b0017f0bc479b15f9d641b3ba29d3c7aa7882bc5af3d9748927e86e81dc`
MD5	`0ef3b1a243b6ff01f1e4f48af6439c16`
BLAKE2b-256	`90fbd96911b8508794220cc9aee6325661759790e71e973bb071e188b16eec34`

See more details on using hashes here.

async-web-crawler 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Web Crawler

🚀 Features

📦 Installation

🔧 Core Tools

inspect_site(url)

discover_links(url, keywords, scope="domain")

extract_links(url, topology="mesh", scope="subdomain", max_pages=50)

extract_content(url, click_selectors=None, screenshot=False)

🎯 Quick Start

🚀 Concurrent Crawling

🤖 AI Agent Integration

LangChain

CrewAI

📁 Project Structure

🧪 Testing

📚 Examples

🛠️ Development

📝 API Reference

Tool Functions

async def inspect_site(url: str) -> dict

async def discover_links(url: str, keywords: list[str], scope: str = "domain") -> list[dict]

async def extract_links(url: str, topology: str = "mesh", scope: str = "subdomain", max_pages: int = 50) -> list[str]

async def extract_content(url: str, click_selectors: list[str] = None, screenshot: bool = False) -> dict

🔒 Features

📄 License

🤝 Contributing

📧 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`inspect_site(url)`

`discover_links(url, keywords, scope="domain")`

`extract_links(url, topology="mesh", scope="subdomain", max_pages=50)`

`extract_content(url, click_selectors=None, screenshot=False)`

`async def inspect_site(url: str) -> dict`

`async def discover_links(url: str, keywords: list[str], scope: str = "domain") -> list[dict]`

`async def extract_links(url: str, topology: str = "mesh", scope: str = "subdomain", max_pages: int = 50) -> list[str]`

`async def extract_content(url: str, click_selectors: list[str] = None, screenshot: bool = False) -> dict`