Skip to main content

Modern Python library for browser automation and intelligent content extraction with full JavaScript execution support

Project description

๐Ÿ•ท๏ธ Crawailer

The web scraper that doesn't suck at JavaScript โœจ

Stop fighting modern websites. While requests gives you empty <div id="root"></div>, Crawailer actually executes JavaScript and extracts real content from React, Vue, and Angular apps. Finally, web scraping that works in 2025.

โšก Claude Code's new best friend - Your AI assistant can now access ANY website

pip install crawailer

PyPI version Python Support

โœจ Why Developers Choose Crawailer

๐Ÿ”ฅ JavaScript That Actually Works
While other tools timeout or crash, Crawailer executes real JavaScript like a human browser

โšก Stupidly Fast
5-10x faster than BeautifulSoup with C-based parsing that doesn't make you wait

๐Ÿค– AI Assistant Ready
Perfect markdown output that your Claude/GPT/local model will love

๐ŸŽฏ Zero Learning Curve
pip install โ†’ works immediately โ†’ no 47-page configuration guides

๐Ÿงช Production Battle-Tested
18 comprehensive test suites covering every edge case we could think of

๐ŸŽจ Actually Enjoyable
Rich terminal output, helpful errors, progress bars that don't lie

๐Ÿš€ Quick Start

(Honestly, you probably don't need to read these examples - just ask your AI assistant to figure it out. That's what models are for! But here they are anyway...)

๐ŸŽฌ See It In Action

Basic Usage Demo - Crawailer vs requests:

# View the demo locally
asciinema play demos/basic-usage.cast

Claude Code Integration - Give your AI web superpowers:

# View the Claude integration demo  
asciinema play demos/claude-integration.cast

Don't have asciinema? pip install asciinema or run the demos yourself:

# Clone the repo and run demos interactively
git clone https://git.supported.systems/MCP/crawailer.git
cd crawailer
python demo_basic_usage.py
python demo_claude_integration.py
import crawailer as web

# Simple content extraction
content = await web.get("https://example.com")
print(content.markdown)  # Clean, LLM-ready markdown
print(content.text)      # Human-readable text
print(content.title)     # Extracted title

# JavaScript execution for dynamic content
content = await web.get(
    "https://spa-app.com",
    script="document.querySelector('.dynamic-price').textContent"
)
print(f"Price: {content.script_result}")

# Batch processing with JavaScript
results = await web.get_many(
    ["url1", "url2", "url3"],
    script="document.title + ' | ' + document.querySelector('.description')?.textContent"
)
for result in results:
    print(f"{result.title}: {result.script_result}")

# Smart discovery with interaction
research = await web.discover(
    "AI safety papers", 
    script="document.querySelector('.show-more')?.click()",
    max_pages=10
)
# Returns the most relevant content with enhanced extraction

# Compare: Traditional scraping fails on modern sites
# requests.get("https://react-app.com") โ†’ Empty <div id="root"></div>
# Crawailer โ†’ Full content + dynamic data

๐Ÿง  Claude Code MCP Integration

"Hey Claude, go grab that data from the React app" โ† This actually works now

# Add to your Claude Code MCP server
from crawailer.mcp import create_mcp_server

@mcp_tool("web_extract")
async def extract_content(url: str, script: str = ""):
    """Extract content from any website with optional JavaScript execution"""
    content = await web.get(url, script=script)
    return {
        "title": content.title,
        "markdown": content.markdown,
        "script_result": content.script_result,
        "word_count": content.word_count
    }

# ๐ŸŽ‰ No more "I can't access that site" 
# ๐ŸŽ‰ No more copy-pasting content manually
# ๐ŸŽ‰ Your AI can now browse the web like a human

๐ŸŽฏ Design Philosophy

For Robots, By Humans

  • Predictive: Anticipates what you need and provides it
  • Forgiving: Handles errors gracefully with helpful suggestions
  • Efficient: Fast by default, with smart caching and concurrency
  • Composable: Small, focused functions that work well together

Perfect for AI Workflows

  • LLM-Optimized: Clean markdown, structured data, semantic chunking
  • Context-Aware: Extracts relationships and metadata automatically
  • Quality-Focused: Built-in content quality assessment
  • Archive-Ready: Designed for long-term storage and retrieval

๐Ÿ“– Use Cases

๐Ÿค– AI Agents & LLM Applications

Problem: Training data scattered across JavaScript-heavy academic sites

# Research assistant workflow with JavaScript interaction
research = await web.discover(
    "quantum computing breakthroughs",
    script="document.querySelector('.show-abstract')?.click(); return document.querySelector('.full-text')?.textContent"
)
for paper in research:
    # Rich content includes JavaScript-extracted data
    summary = await llm.summarize(paper.markdown)
    dynamic_content = paper.script_result  # JavaScript execution result
    insights = await llm.extract_insights(paper.content + dynamic_content)

๐Ÿ›’ E-commerce Price Monitoring

Problem: Product prices loaded via AJAX, requests sees loading spinners

# Monitor competitor pricing with dynamic content
products = await web.get_many(
    competitor_urls,
    script="return {price: document.querySelector('.price')?.textContent, stock: document.querySelector('.inventory')?.textContent}"
)
for product in products:
    if product.script_result['price'] != cached_price:
        await alert_price_change(product.url, product.script_result)

๐Ÿ”— MCP Servers

Problem: Claude needs reliable web content extraction tools

# Easy MCP integration (with crawailer[mcp])
from crawailer.mcp import create_mcp_server

server = create_mcp_server()
# Automatically exposes web.get, web.discover, etc. as MCP tools

๐Ÿ“Š Social Media & Content Analysis

Problem: Posts and comments load infinitely via JavaScript

# Extract social media discussions with infinite scroll
content = await web.get(
    "https://social-platform.com/topic/ai-safety",
    script="window.scrollTo(0, document.body.scrollHeight); return document.querySelectorAll('.post').length"
)
# Gets full thread content, not just initial page load

๐Ÿ› ๏ธ Installation

# Basic installation
pip install crawailer

# With AI features (semantic search, entity extraction)
pip install crawailer[ai]

# With MCP server capabilities  
pip install crawailer[mcp]

# Everything
pip install crawailer[all]

# Post-install setup (installs Playwright browsers)
crawailer setup

๐Ÿ—๏ธ Architecture

Crawailer is built on modern, focused libraries:

  • ๐ŸŽญ Playwright: Reliable browser automation
  • โšก selectolax: 5-10x faster HTML parsing (C-based)
  • ๐Ÿ“ markdownify: Clean HTMLโ†’Markdown conversion
  • ๐Ÿงน justext: Intelligent content extraction and cleaning
  • ๐Ÿ”„ httpx: Modern async HTTP client

๐Ÿงช Battle-Tested Quality

Crawailer includes 18 comprehensive test suites with real-world scenarios:

  • Modern Frameworks: React, Vue, Angular demos with full JavaScript APIs
  • Mobile Compatibility: Safari iOS, Chrome Android, responsive designs
  • Production Edge Cases: Network failures, memory pressure, browser differences
  • Performance Testing: Stress tests, concurrency, resource management

Want to contribute? We welcome PRs with new test scenarios! Our test sites library shows exactly how different frameworks should behave with JavaScript execution.

๐Ÿ“ Future TODO: Move examples to dedicated repository for community contributions

๐Ÿค Perfect for MCP Projects

MCP servers love Crawailer because it provides:

  • Focused tools: Each function does one thing well
  • Rich outputs: Structured data ready for LLM consumption
  • Smart defaults: Works out of the box with minimal configuration
  • Extensible: Easy to add domain-specific extraction logic
# Example MCP server tool
@mcp_tool("web_research")
async def research_topic(topic: str, depth: str = "comprehensive"):
    results = await web.discover(topic, max_pages=20)
    return {
        "sources": len(results),
        "content": [r.summary for r in results],
        "insights": await analyze_patterns(results)
    }

๐ŸฅŠ Crawailer vs Traditional Tools

Challenge requests & HTTP libs Selenium Crawailer
React/Vue/Angular โŒ Empty templates ๐ŸŸก Slow, complex setup โœ… Just works
Dynamic Pricing โŒ Shows loading spinner ๐ŸŸก Requires waits/timeouts โœ… Intelligent waiting
JavaScript APIs โŒ No access ๐ŸŸก Clunky WebDriver calls โœ… Native page.evaluate()
Speed ๐ŸŸข 100-500ms โŒ 5-15 seconds โœ… 2-5 seconds
Memory ๐ŸŸข 1-5MB โŒ 200-500MB ๐ŸŸก 100-200MB
AI-Ready Output โŒ Raw HTML โŒ Raw HTML โœ… Clean Markdown
Developer Experience ๐ŸŸก Manual parsing โŒ Complex WebDriver โœ… Intuitive API

The bottom line: When JavaScript matters, Crawailer delivers. When it doesn't, use requests.

๐Ÿ“– See complete tool comparison โ†’ (includes Scrapy, Playwright, BeautifulSoup, and more)

๐ŸŽ‰ What Makes It Delightful

JavaScript-Powered Intelligence

# Dynamic content extraction from SPAs
content = await web.get(
    "https://react-app.com",
    script="window.testData?.framework + ' v' + window.React?.version"
)
# Automatically detects: React application with version info
# Extracts: Dynamic content + framework details

# E-commerce with JavaScript-loaded prices
product = await web.get(
    "https://shop.com/product",
    script="document.querySelector('.dynamic-price')?.textContent",
    wait_for=".price-loaded"
) 
# Recognizes product page with dynamic pricing
# Extracts: Real-time price, reviews, availability, specs

Beautiful Output

โœจ Found 15 high-quality sources
๐Ÿ“Š Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs  
๐Ÿ“… Date range: 2023-2024 (recent research)
โšก Average quality score: 8.7/10
๐Ÿ” Key topics: transformers, safety, alignment

Helpful Errors

try:
    content = await web.get("problematic-site.com")
except web.CloudflareProtected:
    # "๐Ÿ’ก Try: await web.get(url, stealth=True)"
except web.PaywallDetected as e:
    # "๐Ÿ” Found archived version: {e.archive_url}"

๐Ÿ“š Documentation

๐Ÿค Contributing

We love contributions! Crawailer is designed to be:

  • Easy to extend: Add new content extractors and browser capabilities
  • Well-tested: Comprehensive test suite with real websites
  • Documented: Every feature has examples and use cases

See CONTRIBUTING.md for details.

๐Ÿ“„ License

MIT License - see LICENSE for details.


๐Ÿš€ Ready to Stop Losing Your Mind?

pip install crawailer
crawailer setup  # Install browser engines

Life's too short for empty <div> tags and "JavaScript required" messages.

Get content that actually exists. From websites that actually work.

โญ Star us if this saves your sanity โ†’ git.supported.systems/MCP/crawailer


Built with โค๏ธ for the age of AI agents and automation

Crawailer: Because robots deserve delightful web experiences too ๐Ÿค–โœจ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawailer-0.1.1.tar.gz (265.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawailer-0.1.1-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file crawailer-0.1.1.tar.gz.

File metadata

  • Download URL: crawailer-0.1.1.tar.gz
  • Upload date:
  • Size: 265.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for crawailer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 67a49c4dcfadd646f876e13d2d727f54e65ecd67043721f9cae5b75fe3927a5c
MD5 d918e76aba2eecfff5fee845e969e27d
BLAKE2b-256 4df07cfacbd807b1ac851a852d5b7f671e60259d4553884e4551ce566cafe222

See more details on using hashes here.

File details

Details for the file crawailer-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: crawailer-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for crawailer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f19747209ad8ccd911b0efbfce9ee67eb9ae4cc851a9e60a817ec272a07ecedf
MD5 3147d1c8ac49bccfb0000ef115fb1398
BLAKE2b-256 de6a63b755f6dd40afb45f0e4bb9b169492022ecc7fbcf8c7c9e8a0bf000c4cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page