Skip to main content

Advanced async web crawler with AI agent integration - inspect, discover, crawl, and extract web content

Project description

Web Crawler

An advanced async web crawler toolkit with powerful scraping capabilities and AI agent integration.

๐Ÿš€ Features

  • Async/Await Architecture - Built with asyncio for high-performance concurrent crawling
  • Comprehensive Tools - Inspect, discover, extract, and crawl websites
  • Stealth Mode - Built-in stealth capabilities to avoid detection
  • Clean Content - Automatic HTML cleaning and markdown conversion
  • AI Agent Ready - Easy integration with LangChain and CrewAI
  • Flexible Scoping - Control crawl scope (domain, subdomain, path)
  • Smart Discovery - Keyword-based link discovery with relevance scoring

๐Ÿ“ฆ Installation

# Clone the repository
git clone <repository-url>
cd crawler

# Install dependencies with uv
uv sync

# Install Playwright browsers
playwright install chromium

๐Ÿ”ง Core Tools

inspect_site(url)

Analyze website structure, metadata, navigation, and sitemap.

result = await tool.inspect_site("https://example.com")
print(result['metadata']['title'])
print(result['sitemap_summary']['total_urls'])

discover_links(url, keywords, scope="domain")

Find relevant links based on keywords with relevance scoring.

links = await tool.discover_links(
    url="https://example.com",
    keywords=["documentation", "api", "tutorial"]
)

extract_links(url, topology="mesh", scope="subdomain", max_pages=50)

Crawl website to discover all internal pages.

urls = await tool.extract_links(
    url="https://example.com",
    topology="mesh",  # BFS crawling
    max_pages=100
)

extract_content(url, click_selectors=None, screenshot=False)

Extract clean markdown content from any webpage.

content = await tool.extract_content("https://example.com")
print(content['markdown'])

๐ŸŽฏ Quick Start

import asyncio
import tool

async def main():
    # Extract content from a page
    result = await tool.extract_content("https://example.com")
    print(f"Title: {result['metadata']['title']}")
    print(f"Content:\n{result['markdown'][:500]}")

asyncio.run(main())

๐Ÿš€ Concurrent Crawling

import asyncio
import tool

async def crawl_multiple():
    urls = [
        "https://example.com",
        "https://example.org",
        "https://example.net"
    ]
    
    # Fetch all concurrently for maximum speed
    results = await asyncio.gather(*[
        tool.extract_content(url) for url in urls
    ])
    
    return results

results = asyncio.run(crawl_multiple())

๐Ÿค– AI Agent Integration

LangChain

from langchain_core.tools import Tool
import asyncio
import tool

langchain_tools = [
    Tool(
        name="extract_content",
        description="Extract content from a webpage",
        func=lambda url: asyncio.run(tool.extract_content(url))
    )
]

# Use with any LangChain agent

See examples/langchain_agent.py for complete example.

CrewAI

from crewai.tools import BaseTool
import asyncio
import tool

class CrawlerTool(BaseTool):
    name: str = "Web Crawler"
    description: str = "Crawls and extracts web content"
    
    def _run(self, url: str) -> str:
        result = asyncio.run(tool.extract_content(url))
        return result['markdown']

# Use with CrewAI agents

See examples/crewai_agents.py for complete multi-agent example.

๐Ÿ“ Project Structure

crawler/
โ”œโ”€โ”€ tool.py              # Main crawler tools
โ”œโ”€โ”€ util.py              # Utility functions
โ”œโ”€โ”€ main.py              # Simple example
โ”œโ”€โ”€ tests/               # Test suite
โ”‚   โ”œโ”€โ”€ test_tool.py
โ”‚   โ”œโ”€โ”€ test_util.py
โ”‚   โ””โ”€โ”€ test_async.py
โ”œโ”€โ”€ examples/            # Usage examples
โ”‚   โ”œโ”€โ”€ basic_usage.py
โ”‚   โ”œโ”€โ”€ concurrent_crawling.py
โ”‚   โ”œโ”€โ”€ langchain_agent.py
โ”‚   โ””โ”€โ”€ crewai_agents.py
โ””โ”€โ”€ pyproject.toml       # Dependencies

๐Ÿงช Testing

# Run all tests
make test
# or
pytest tests/ -v

# Run specific test file
pytest tests/test_tool.py -v

๐Ÿ“š Examples

All examples are in the examples/ directory:

Run any example:

python examples/basic_usage.py

๐Ÿ› ๏ธ Development

# Run main demo
make run

# Run tests
make test

# Format code
black .

# Type checking
mypy tool.py util.py

๐Ÿ“ API Reference

Tool Functions

async def inspect_site(url: str) -> dict

Returns:

{
    "metadata": {
        "title": str,
        "description": str,
        "keywords": str
    },
    "navigation": {
        "header": [{"text": str, "url": str}],
        "nav": [...],
        "footer": [...]
    },
    "sitemap_summary": {
        "total_urls": int,
        "structure_hint": dict
    }
}

async def discover_links(url: str, keywords: list[str], scope: str = "domain") -> list[dict]

Returns:

[
    {
        "url": str,
        "text": str,
        "score": int,
        "matches": [str]
    }
]

async def extract_links(url: str, topology: str = "mesh", scope: str = "subdomain", max_pages: int = 50) -> list[str]

Parameters:

  • topology: "mesh" (BFS), "linear", "hub_and_spoke", "sidebar"
  • scope: "subdomain", "domain", "path"

Returns: List of discovered URLs

async def extract_content(url: str, click_selectors: list[str] = None, screenshot: bool = False) -> dict

Returns:

{
    "markdown": str,
    "screenshot": str | None,
    "metadata": {
        "title": str,
        "url": str,
        "type": str  # html, pdf, json
    }
}

๐Ÿ”’ Features

  • โœ… Async/await for concurrent operations
  • โœ… Playwright stealth mode
  • โœ… Automatic HTML cleaning
  • โœ… Markdown conversion
  • โœ… Sitemap parsing
  • โœ… Robots.txt compliance
  • โœ… Keyword-based discovery
  • โœ… Multiple crawl topologies
  • โœ… Scope control (domain/subdomain/path)
  • โœ… Screenshot support
  • โœ… Dynamic content handling
  • โœ… PDF/JSON detection

๐Ÿ“„ License

MIT License - See LICENSE

๐Ÿค Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Submit a pull request

๐Ÿ“ง Support

For issues and questions, please open a GitHub issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

async_web_crawler-0.1.0.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

async_web_crawler-0.1.0-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file async_web_crawler-0.1.0.tar.gz.

File metadata

  • Download URL: async_web_crawler-0.1.0.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for async_web_crawler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b8e8c76bfe267c0056dda2c2b65740edbe36591ff3981eb848f334ca678c5fa0
MD5 77c9850f1c91cb2f5fc64ff7aece574e
BLAKE2b-256 4be8a13dfb2d33657839dbc6aac9587756dbe12acb745c8e67d387e9e679c8bc

See more details on using hashes here.

File details

Details for the file async_web_crawler-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for async_web_crawler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 75ce5b0017f0bc479b15f9d641b3ba29d3c7aa7882bc5af3d9748927e86e81dc
MD5 0ef3b1a243b6ff01f1e4f48af6439c16
BLAKE2b-256 90fbd96911b8508794220cc9aee6325661759790e71e973bb071e188b16eec34

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page