Modern Python library for browser automation and intelligent content extraction with full JavaScript execution support
Project description
๐ท๏ธ Crawailer
The web scraper that doesn't suck at JavaScript โจ
Stop fighting modern websites. While
requestsgives you empty<div id="root"></div>, Crawailer actually executes JavaScript and extracts real content from React, Vue, and Angular apps. Finally, web scraping that works in 2025.
โก Claude Code's new best friend - Your AI assistant can now access ANY website
pip install crawailer
โจ Why Developers Choose Crawailer
๐ฅ JavaScript That Actually Works
While other tools timeout or crash, Crawailer executes real JavaScript like a human browser
โก Stupidly Fast
5-10x faster than BeautifulSoup with C-based parsing that doesn't make you wait
๐ค AI Assistant Ready
Perfect markdown output that your Claude/GPT/local model will love
๐ฏ Zero Learning Curve
pip install โ works immediately โ no 47-page configuration guides
๐งช Production Battle-Tested
18 comprehensive test suites covering every edge case we could think of
๐จ Actually Enjoyable
Rich terminal output, helpful errors, progress bars that don't lie
๐ Quick Start
(Honestly, you probably don't need to read these examples - just ask your AI assistant to figure it out. That's what models are for! But here they are anyway...)
๐ฌ See It In Action
Basic Usage Demo - Crawailer vs requests:
# View the demo locally
asciinema play demos/basic-usage.cast
Claude Code Integration - Give your AI web superpowers:
# View the Claude integration demo
asciinema play demos/claude-integration.cast
Don't have asciinema? pip install asciinema or run the demos yourself:
# Clone the repo and run demos interactively
git clone https://git.supported.systems/MCP/crawailer.git
cd crawailer
python demo_basic_usage.py
python demo_claude_integration.py
import crawailer as web
# Simple content extraction
content = await web.get("https://example.com")
print(content.markdown) # Clean, LLM-ready markdown
print(content.text) # Human-readable text
print(content.title) # Extracted title
# JavaScript execution for dynamic content
content = await web.get(
"https://spa-app.com",
script="document.querySelector('.dynamic-price').textContent"
)
print(f"Price: {content.script_result}")
# Batch processing with JavaScript
results = await web.get_many(
["url1", "url2", "url3"],
script="document.title + ' | ' + document.querySelector('.description')?.textContent"
)
for result in results:
print(f"{result.title}: {result.script_result}")
# Smart discovery with interaction
research = await web.discover(
"AI safety papers",
script="document.querySelector('.show-more')?.click()",
max_pages=10
)
# Returns the most relevant content with enhanced extraction
# Compare: Traditional scraping fails on modern sites
# requests.get("https://react-app.com") โ Empty <div id="root"></div>
# Crawailer โ Full content + dynamic data
๐ง Claude Code MCP Integration
"Hey Claude, go grab that data from the React app" โ This actually works now
# Add to your Claude Code MCP server
from crawailer.mcp import create_mcp_server
@mcp_tool("web_extract")
async def extract_content(url: str, script: str = ""):
"""Extract content from any website with optional JavaScript execution"""
content = await web.get(url, script=script)
return {
"title": content.title,
"markdown": content.markdown,
"script_result": content.script_result,
"word_count": content.word_count
}
# ๐ No more "I can't access that site"
# ๐ No more copy-pasting content manually
# ๐ Your AI can now browse the web like a human
๐ฏ Design Philosophy
For Robots, By Humans
- Predictive: Anticipates what you need and provides it
- Forgiving: Handles errors gracefully with helpful suggestions
- Efficient: Fast by default, with smart caching and concurrency
- Composable: Small, focused functions that work well together
Perfect for AI Workflows
- LLM-Optimized: Clean markdown, structured data, semantic chunking
- Context-Aware: Extracts relationships and metadata automatically
- Quality-Focused: Built-in content quality assessment
- Archive-Ready: Designed for long-term storage and retrieval
๐ Use Cases
๐ค AI Agents & LLM Applications
Problem: Training data scattered across JavaScript-heavy academic sites
# Research assistant workflow with JavaScript interaction
research = await web.discover(
"quantum computing breakthroughs",
script="document.querySelector('.show-abstract')?.click(); return document.querySelector('.full-text')?.textContent"
)
for paper in research:
# Rich content includes JavaScript-extracted data
summary = await llm.summarize(paper.markdown)
dynamic_content = paper.script_result # JavaScript execution result
insights = await llm.extract_insights(paper.content + dynamic_content)
๐ E-commerce Price Monitoring
Problem: Product prices loaded via AJAX, requests sees loading spinners
# Monitor competitor pricing with dynamic content
products = await web.get_many(
competitor_urls,
script="return {price: document.querySelector('.price')?.textContent, stock: document.querySelector('.inventory')?.textContent}"
)
for product in products:
if product.script_result['price'] != cached_price:
await alert_price_change(product.url, product.script_result)
๐ MCP Servers
Problem: Claude needs reliable web content extraction tools
# Easy MCP integration (with crawailer[mcp])
from crawailer.mcp import create_mcp_server
server = create_mcp_server()
# Automatically exposes web.get, web.discover, etc. as MCP tools
๐ Social Media & Content Analysis
Problem: Posts and comments load infinitely via JavaScript
# Extract social media discussions with infinite scroll
content = await web.get(
"https://social-platform.com/topic/ai-safety",
script="window.scrollTo(0, document.body.scrollHeight); return document.querySelectorAll('.post').length"
)
# Gets full thread content, not just initial page load
๐ ๏ธ Installation
# Basic installation
pip install crawailer
# With AI features (semantic search, entity extraction)
pip install crawailer[ai]
# With MCP server capabilities
pip install crawailer[mcp]
# Everything
pip install crawailer[all]
# Post-install setup (installs Playwright browsers)
crawailer setup
๐๏ธ Architecture
Crawailer is built on modern, focused libraries:
- ๐ญ Playwright: Reliable browser automation
- โก selectolax: 5-10x faster HTML parsing (C-based)
- ๐ markdownify: Clean HTMLโMarkdown conversion
- ๐งน justext: Intelligent content extraction and cleaning
- ๐ httpx: Modern async HTTP client
๐งช Battle-Tested Quality
Crawailer includes 18 comprehensive test suites with real-world scenarios:
- Modern Frameworks: React, Vue, Angular demos with full JavaScript APIs
- Mobile Compatibility: Safari iOS, Chrome Android, responsive designs
- Production Edge Cases: Network failures, memory pressure, browser differences
- Performance Testing: Stress tests, concurrency, resource management
Want to contribute? We welcome PRs with new test scenarios! Our test sites library shows exactly how different frameworks should behave with JavaScript execution.
๐ Future TODO: Move examples to dedicated repository for community contributions
๐ค Perfect for MCP Projects
MCP servers love Crawailer because it provides:
- Focused tools: Each function does one thing well
- Rich outputs: Structured data ready for LLM consumption
- Smart defaults: Works out of the box with minimal configuration
- Extensible: Easy to add domain-specific extraction logic
# Example MCP server tool
@mcp_tool("web_research")
async def research_topic(topic: str, depth: str = "comprehensive"):
results = await web.discover(topic, max_pages=20)
return {
"sources": len(results),
"content": [r.summary for r in results],
"insights": await analyze_patterns(results)
}
๐ฅ Crawailer vs Traditional Tools
| Challenge | requests & HTTP libs |
Selenium | Crawailer |
|---|---|---|---|
| React/Vue/Angular | โ Empty templates | ๐ก Slow, complex setup | โ Just works |
| Dynamic Pricing | โ Shows loading spinner | ๐ก Requires waits/timeouts | โ Intelligent waiting |
| JavaScript APIs | โ No access | ๐ก Clunky WebDriver calls | โ Native page.evaluate() |
| Speed | ๐ข 100-500ms | โ 5-15 seconds | โ 2-5 seconds |
| Memory | ๐ข 1-5MB | โ 200-500MB | ๐ก 100-200MB |
| AI-Ready Output | โ Raw HTML | โ Raw HTML | โ Clean Markdown |
| Developer Experience | ๐ก Manual parsing | โ Complex WebDriver | โ Intuitive API |
The bottom line: When JavaScript matters, Crawailer delivers. When it doesn't, use
requests.๐ See complete tool comparison โ (includes Scrapy, Playwright, BeautifulSoup, and more)
๐ What Makes It Delightful
JavaScript-Powered Intelligence
# Dynamic content extraction from SPAs
content = await web.get(
"https://react-app.com",
script="window.testData?.framework + ' v' + window.React?.version"
)
# Automatically detects: React application with version info
# Extracts: Dynamic content + framework details
# E-commerce with JavaScript-loaded prices
product = await web.get(
"https://shop.com/product",
script="document.querySelector('.dynamic-price')?.textContent",
wait_for=".price-loaded"
)
# Recognizes product page with dynamic pricing
# Extracts: Real-time price, reviews, availability, specs
Beautiful Output
โจ Found 15 high-quality sources
๐ Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs
๐
Date range: 2023-2024 (recent research)
โก Average quality score: 8.7/10
๐ Key topics: transformers, safety, alignment
Helpful Errors
try:
content = await web.get("problematic-site.com")
except web.CloudflareProtected:
# "๐ก Try: await web.get(url, stealth=True)"
except web.PaywallDetected as e:
# "๐ Found archived version: {e.archive_url}"
๐ Documentation
- Tool Comparison: How Crawailer compares to Scrapy, Selenium, BeautifulSoup, etc.
- Getting Started: Installation and first steps
- JavaScript API: Complete JavaScript execution guide
- API Reference: Complete function documentation
- Benchmarks: Performance comparison with other tools
- MCP Integration: Building MCP servers with Crawailer
- Examples: Real-world usage patterns
- Architecture: How Crawailer works internally
๐ค Contributing
We love contributions! Crawailer is designed to be:
- Easy to extend: Add new content extractors and browser capabilities
- Well-tested: Comprehensive test suite with real websites
- Documented: Every feature has examples and use cases
See CONTRIBUTING.md for details.
๐ License
MIT License - see LICENSE for details.
๐ Ready to Stop Losing Your Mind?
pip install crawailer
crawailer setup # Install browser engines
Life's too short for empty <div> tags and "JavaScript required" messages.
Get content that actually exists. From websites that actually work.
โญ Star us if this saves your sanity โ git.supported.systems/MCP/crawailer
Built with โค๏ธ for the age of AI agents and automation
Crawailer: Because robots deserve delightful web experiences too ๐คโจ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawailer-0.1.1.tar.gz.
File metadata
- Download URL: crawailer-0.1.1.tar.gz
- Upload date:
- Size: 265.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67a49c4dcfadd646f876e13d2d727f54e65ecd67043721f9cae5b75fe3927a5c
|
|
| MD5 |
d918e76aba2eecfff5fee845e969e27d
|
|
| BLAKE2b-256 |
4df07cfacbd807b1ac851a852d5b7f671e60259d4553884e4551ce566cafe222
|
File details
Details for the file crawailer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: crawailer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 28.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f19747209ad8ccd911b0efbfce9ee67eb9ae4cc851a9e60a817ec272a07ecedf
|
|
| MD5 |
3147d1c8ac49bccfb0000ef115fb1398
|
|
| BLAKE2b-256 |
de6a63b755f6dd40afb45f0e4bb9b169492022ecc7fbcf8c7c9e8a0bf000c4cf
|