Skip to main content

Modern Python library for browser automation and intelligent content extraction with full JavaScript execution support

Project description

๐Ÿ•ท๏ธ Crawailer

The web scraper that doesn't suck at JavaScript โœจ

Stop fighting modern websites. While requests gives you empty <div id="root"></div>, Crawailer actually executes JavaScript and extracts real content from React, Vue, and Angular apps. Finally, web scraping that works in 2025.

โšก Claude Code's new best friend - Your AI assistant can now access ANY website

pip install crawailer

PyPI version Python Support

โœจ Why Developers Choose Crawailer

๐Ÿ”ฅ JavaScript That Actually Works
While other tools timeout or crash, Crawailer executes real JavaScript like a human browser

โšก Stupidly Fast
5-10x faster than BeautifulSoup with C-based parsing that doesn't make you wait

๐Ÿค– AI Assistant Ready
Perfect markdown output that your Claude/GPT/local model will love

๐ŸŽฏ Zero Learning Curve
pip install โ†’ works immediately โ†’ no 47-page configuration guides

๐Ÿงช Production Battle-Tested
18 comprehensive test suites covering every edge case we could think of

๐ŸŽจ Actually Enjoyable
Rich terminal output, helpful errors, progress bars that don't lie

๐Ÿš€ Quick Start

(Honestly, you probably don't need to read these examples - just ask your AI assistant to figure it out. That's what models are for! But here they are anyway...)

๐ŸŽฌ See It In Action

Basic Usage Demo - Crawailer vs requests:

# View the demo locally
asciinema play demos/basic-usage.cast

Claude Code Integration - Give your AI web superpowers:

# View the Claude integration demo  
asciinema play demos/claude-integration.cast

Don't have asciinema? pip install asciinema or run the demos yourself:

# Clone the repo and run demos interactively
git clone https://git.supported.systems/MCP/crawailer.git
cd crawailer
python demo_basic_usage.py
python demo_claude_integration.py
import crawailer as web

# Simple content extraction
content = await web.get("https://example.com")
print(content.markdown)  # Clean, LLM-ready markdown
print(content.text)      # Human-readable text
print(content.title)     # Extracted title

# JavaScript execution for dynamic content
content = await web.get(
    "https://spa-app.com",
    script="document.querySelector('.dynamic-price').textContent"
)
print(f"Price: {content.script_result}")

# Batch processing with JavaScript
results = await web.get_many(
    ["url1", "url2", "url3"],
    script="document.title + ' | ' + document.querySelector('.description')?.textContent"
)
for result in results:
    print(f"{result.title}: {result.script_result}")

# Smart discovery with interaction
research = await web.discover(
    "AI safety papers", 
    script="document.querySelector('.show-more')?.click()",
    max_pages=10
)
# Returns the most relevant content with enhanced extraction

# Compare: Traditional scraping fails on modern sites
# requests.get("https://react-app.com") โ†’ Empty <div id="root"></div>
# Crawailer โ†’ Full content + dynamic data

๐Ÿง  Claude Code MCP Integration

"Hey Claude, go grab that data from the React app" โ† This actually works now

# Add to your Claude Code MCP server
from crawailer.mcp import create_mcp_server

@mcp_tool("web_extract")
async def extract_content(url: str, script: str = ""):
    """Extract content from any website with optional JavaScript execution"""
    content = await web.get(url, script=script)
    return {
        "title": content.title,
        "markdown": content.markdown,
        "script_result": content.script_result,
        "word_count": content.word_count
    }

# ๐ŸŽ‰ No more "I can't access that site" 
# ๐ŸŽ‰ No more copy-pasting content manually
# ๐ŸŽ‰ Your AI can now browse the web like a human

๐ŸŽฏ Design Philosophy

For Robots, By Humans

  • Predictive: Anticipates what you need and provides it
  • Forgiving: Handles errors gracefully with helpful suggestions
  • Efficient: Fast by default, with smart caching and concurrency
  • Composable: Small, focused functions that work well together

Perfect for AI Workflows

  • LLM-Optimized: Clean markdown, structured data, semantic chunking
  • Context-Aware: Extracts relationships and metadata automatically
  • Quality-Focused: Built-in content quality assessment
  • Archive-Ready: Designed for long-term storage and retrieval

๐Ÿ“– Use Cases

๐Ÿค– AI Agents & LLM Applications

Problem: Training data scattered across JavaScript-heavy academic sites

# Research assistant workflow with JavaScript interaction
research = await web.discover(
    "quantum computing breakthroughs",
    script="document.querySelector('.show-abstract')?.click(); return document.querySelector('.full-text')?.textContent"
)
for paper in research:
    # Rich content includes JavaScript-extracted data
    summary = await llm.summarize(paper.markdown)
    dynamic_content = paper.script_result  # JavaScript execution result
    insights = await llm.extract_insights(paper.content + dynamic_content)

๐Ÿ›’ E-commerce Price Monitoring

Problem: Product prices loaded via AJAX, requests sees loading spinners

# Monitor competitor pricing with dynamic content
products = await web.get_many(
    competitor_urls,
    script="return {price: document.querySelector('.price')?.textContent, stock: document.querySelector('.inventory')?.textContent}"
)
for product in products:
    if product.script_result['price'] != cached_price:
        await alert_price_change(product.url, product.script_result)

๐Ÿ”— MCP Servers

Problem: Claude needs reliable web content extraction tools

# Easy MCP integration (with crawailer[mcp])
from crawailer.mcp import create_mcp_server

server = create_mcp_server()
# Automatically exposes web.get, web.discover, etc. as MCP tools

๐Ÿ“Š Social Media & Content Analysis

Problem: Posts and comments load infinitely via JavaScript

# Extract social media discussions with infinite scroll
content = await web.get(
    "https://social-platform.com/topic/ai-safety",
    script="window.scrollTo(0, document.body.scrollHeight); return document.querySelectorAll('.post').length"
)
# Gets full thread content, not just initial page load

๐Ÿ› ๏ธ Installation

# Basic installation
pip install crawailer

# With MCP server capabilities  
pip install crawailer[mcp]

# Everything
pip install crawailer[all]

# Post-install setup (installs Playwright browsers)
crawailer setup

๐Ÿ—๏ธ Architecture

Crawailer is built on modern, focused libraries:

  • ๐ŸŽญ Playwright: Reliable browser automation
  • โšก selectolax: 5-10x faster HTML parsing (C-based)
  • ๐Ÿ“ markdownify: Clean HTMLโ†’Markdown conversion
  • ๐Ÿงน justext: Intelligent content extraction and cleaning
  • ๐Ÿ”„ httpx: Modern async HTTP client

๐Ÿงช Battle-Tested Quality

Crawailer includes 18 comprehensive test suites with real-world scenarios:

  • Modern Frameworks: React, Vue, Angular demos with full JavaScript APIs
  • Mobile Compatibility: Safari iOS, Chrome Android, responsive designs
  • Production Edge Cases: Network failures, memory pressure, browser differences
  • Performance Testing: Stress tests, concurrency, resource management

Want to contribute? We welcome PRs with new test scenarios! Our test sites library shows exactly how different frameworks should behave with JavaScript execution.

๐Ÿ“ Future TODO: Move examples to dedicated repository for community contributions

๐Ÿค Perfect for MCP Projects

MCP servers love Crawailer because it provides:

  • Focused tools: Each function does one thing well
  • Rich outputs: Structured data ready for LLM consumption
  • Smart defaults: Works out of the box with minimal configuration
  • Extensible: Easy to add domain-specific extraction logic
# Example MCP server tool
@mcp_tool("web_research")
async def research_topic(topic: str, depth: str = "comprehensive"):
    results = await web.discover(topic, max_pages=20)
    return {
        "sources": len(results),
        "content": [r.summary for r in results],
        "insights": await analyze_patterns(results)
    }

๐ŸฅŠ Crawailer vs Traditional Tools

Challenge requests & HTTP libs Selenium Crawailer
React/Vue/Angular โŒ Empty templates ๐ŸŸก Slow, complex setup โœ… Just works
Dynamic Pricing โŒ Shows loading spinner ๐ŸŸก Requires waits/timeouts โœ… Intelligent waiting
JavaScript APIs โŒ No access ๐ŸŸก Clunky WebDriver calls โœ… Native page.evaluate()
Speed ๐ŸŸข 100-500ms โŒ 5-15 seconds โœ… 2-5 seconds
Memory ๐ŸŸข 1-5MB โŒ 200-500MB ๐ŸŸก 100-200MB
AI-Ready Output โŒ Raw HTML โŒ Raw HTML โœ… Clean Markdown
Developer Experience ๐ŸŸก Manual parsing โŒ Complex WebDriver โœ… Intuitive API

The bottom line: When JavaScript matters, Crawailer delivers. When it doesn't, use requests.

๐Ÿ“– See complete tool comparison โ†’ (includes Scrapy, Playwright, BeautifulSoup, and more)

๐ŸŽ‰ What Makes It Delightful

JavaScript-Powered Intelligence

# Dynamic content extraction from SPAs
content = await web.get(
    "https://react-app.com",
    script="window.testData?.framework + ' v' + window.React?.version"
)
# Automatically detects: React application with version info
# Extracts: Dynamic content + framework details

# E-commerce with JavaScript-loaded prices
product = await web.get(
    "https://shop.com/product",
    script="document.querySelector('.dynamic-price')?.textContent",
    wait_for=".price-loaded"
) 
# Recognizes product page with dynamic pricing
# Extracts: Real-time price, reviews, availability, specs

Beautiful Output

โœจ Found 15 high-quality sources
๐Ÿ“Š Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs  
๐Ÿ“… Date range: 2023-2024 (recent research)
โšก Average quality score: 8.7/10
๐Ÿ” Key topics: transformers, safety, alignment

Helpful Errors

try:
    content = await web.get("problematic-site.com")
except web.CloudflareProtected:
    # "๐Ÿ’ก Try: await web.get(url, stealth=True)"
except web.PaywallDetected as e:
    # "๐Ÿ” Found archived version: {e.archive_url}"

๐Ÿ“š Documentation

๐Ÿค Contributing

We love contributions! Crawailer is designed to be:

  • Easy to extend: Add new content extractors and browser capabilities
  • Well-tested: Comprehensive test suite with real websites
  • Documented: Every feature has examples and use cases

See CONTRIBUTING.md for details.

๐Ÿ“„ License

MIT License - see LICENSE for details.


๐Ÿš€ Ready to Stop Losing Your Mind?

pip install crawailer
crawailer setup  # Install browser engines

Life's too short for empty <div> tags and "JavaScript required" messages.

Get content that actually exists. From websites that actually work.

โญ Star us if this saves your sanity โ†’ git.supported.systems/MCP/crawailer


Built with โค๏ธ for the age of AI agents and automation

Crawailer: Because robots deserve delightful web experiences too ๐Ÿค–โœจ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawailer-0.1.2.tar.gz (264.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawailer-0.1.2-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file crawailer-0.1.2.tar.gz.

File metadata

  • Download URL: crawailer-0.1.2.tar.gz
  • Upload date:
  • Size: 264.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for crawailer-0.1.2.tar.gz
Algorithm Hash digest
SHA256 876b5952a0a163f2610bd839bd0537411fa0142200a909274445799b541bf1ea
MD5 a45eaa0341b1f1acaae37b99df9230f4
BLAKE2b-256 240f40d9e8323548612bbf523b4d0e8abb841dce369a609268253c1b981ee7a7

See more details on using hashes here.

File details

Details for the file crawailer-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: crawailer-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 28.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for crawailer-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9417042864571d7796dac7aba93b683234a666b9800c5f29c46aad24bd17610d
MD5 7123e6e982281d44bb095dcb371258a6
BLAKE2b-256 4d95912c7dad41a26a369b78e06056bd031a8848cd5f6391b763806dbb5dd88f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page