Modern Python library for browser automation and intelligent content extraction with full JavaScript execution support
Project description
๐ท๏ธ Crawailer
The JavaScript-first web scraper that actually works with modern websites
Finally! A Python library that handles React, Vue, Angular, and dynamic content without the headaches. When
requestsfails and Selenium feels like overkill, Crawailer delivers clean, AI-ready content extraction with bulletproof JavaScript execution.
pip install crawailer
โจ Features
- ๐ฏ JavaScript-First: Executes real JavaScript on React, Vue, Angular sites (unlike
requests) - โก Lightning Fast: 5-10x faster HTML processing with C-based selectolax
- ๐ค AI-Optimized: Clean markdown output perfect for LLM training and RAG
- ๐ง Three Ways to Use: Library, CLI tool, or MCP server - your choice
- ๐ฆ Zero Config: Works immediately with sensible defaults
- ๐งช Battle-Tested: 18 comprehensive test suites with 70+ real-world scenarios
- ๐จ Developer Joy: Rich terminal output, helpful errors, progress tracking
๐ Quick Start
import crawailer as web
# Simple content extraction
content = await web.get("https://example.com")
print(content.markdown) # Clean, LLM-ready markdown
print(content.text) # Human-readable text
print(content.title) # Extracted title
# JavaScript execution for dynamic content
content = await web.get(
"https://spa-app.com",
script="document.querySelector('.dynamic-price').textContent"
)
print(f"Price: {content.script_result}")
# Batch processing with JavaScript
results = await web.get_many(
["url1", "url2", "url3"],
script="document.title + ' | ' + document.querySelector('.description')?.textContent"
)
for result in results:
print(f"{result.title}: {result.script_result}")
# Smart discovery with interaction
research = await web.discover(
"AI safety papers",
script="document.querySelector('.show-more')?.click()",
max_pages=10
)
# Returns the most relevant content with enhanced extraction
# Compare: Traditional scraping fails on modern sites
# requests.get("https://react-app.com") โ Empty <div id="root"></div>
# Crawailer โ Full content + dynamic data
๐ฏ Design Philosophy
For Robots, By Humans
- Predictive: Anticipates what you need and provides it
- Forgiving: Handles errors gracefully with helpful suggestions
- Efficient: Fast by default, with smart caching and concurrency
- Composable: Small, focused functions that work well together
Perfect for AI Workflows
- LLM-Optimized: Clean markdown, structured data, semantic chunking
- Context-Aware: Extracts relationships and metadata automatically
- Quality-Focused: Built-in content quality assessment
- Archive-Ready: Designed for long-term storage and retrieval
๐ Use Cases
๐ค AI Agents & LLM Applications
Problem: Training data scattered across JavaScript-heavy academic sites
# Research assistant workflow with JavaScript interaction
research = await web.discover(
"quantum computing breakthroughs",
script="document.querySelector('.show-abstract')?.click(); return document.querySelector('.full-text')?.textContent"
)
for paper in research:
# Rich content includes JavaScript-extracted data
summary = await llm.summarize(paper.markdown)
dynamic_content = paper.script_result # JavaScript execution result
insights = await llm.extract_insights(paper.content + dynamic_content)
๐ E-commerce Price Monitoring
Problem: Product prices loaded via AJAX, requests sees loading spinners
# Monitor competitor pricing with dynamic content
products = await web.get_many(
competitor_urls,
script="return {price: document.querySelector('.price')?.textContent, stock: document.querySelector('.inventory')?.textContent}"
)
for product in products:
if product.script_result['price'] != cached_price:
await alert_price_change(product.url, product.script_result)
๐ MCP Servers
Problem: Claude needs reliable web content extraction tools
# Easy MCP integration (with crawailer[mcp])
from crawailer.mcp import create_mcp_server
server = create_mcp_server()
# Automatically exposes web.get, web.discover, etc. as MCP tools
๐ Social Media & Content Analysis
Problem: Posts and comments load infinitely via JavaScript
# Extract social media discussions with infinite scroll
content = await web.get(
"https://social-platform.com/topic/ai-safety",
script="window.scrollTo(0, document.body.scrollHeight); return document.querySelectorAll('.post').length"
)
# Gets full thread content, not just initial page load
๐ ๏ธ Installation
# Basic installation
pip install crawailer
# With AI features (semantic search, entity extraction)
pip install crawailer[ai]
# With MCP server capabilities
pip install crawailer[mcp]
# Everything
pip install crawailer[all]
# Post-install setup (installs Playwright browsers)
crawailer setup
๐๏ธ Architecture
Crawailer is built on modern, focused libraries:
- ๐ญ Playwright: Reliable browser automation
- โก selectolax: 5-10x faster HTML parsing (C-based)
- ๐ markdownify: Clean HTMLโMarkdown conversion
- ๐งน justext: Intelligent content extraction and cleaning
- ๐ httpx: Modern async HTTP client
๐งช Battle-Tested Quality
Crawailer includes 18 comprehensive test suites with real-world scenarios:
- Modern Frameworks: React, Vue, Angular demos with full JavaScript APIs
- Mobile Compatibility: Safari iOS, Chrome Android, responsive designs
- Production Edge Cases: Network failures, memory pressure, browser differences
- Performance Testing: Stress tests, concurrency, resource management
Want to contribute? We welcome PRs with new test scenarios! Our test sites library shows exactly how different frameworks should behave with JavaScript execution.
๐ Future TODO: Move examples to dedicated repository for community contributions
๐ค Perfect for MCP Projects
MCP servers love Crawailer because it provides:
- Focused tools: Each function does one thing well
- Rich outputs: Structured data ready for LLM consumption
- Smart defaults: Works out of the box with minimal configuration
- Extensible: Easy to add domain-specific extraction logic
# Example MCP server tool
@mcp_tool("web_research")
async def research_topic(topic: str, depth: str = "comprehensive"):
results = await web.discover(topic, max_pages=20)
return {
"sources": len(results),
"content": [r.summary for r in results],
"insights": await analyze_patterns(results)
}
๐ฅ Crawailer vs Traditional Tools
| Challenge | requests & HTTP libs |
Selenium | Crawailer |
|---|---|---|---|
| React/Vue/Angular | โ Empty templates | ๐ก Slow, complex setup | โ Just works |
| Dynamic Pricing | โ Shows loading spinner | ๐ก Requires waits/timeouts | โ Intelligent waiting |
| JavaScript APIs | โ No access | ๐ก Clunky WebDriver calls | โ Native page.evaluate() |
| Speed | ๐ข 100-500ms | โ 5-15 seconds | โ 2-5 seconds |
| Memory | ๐ข 1-5MB | โ 200-500MB | ๐ก 100-200MB |
| AI-Ready Output | โ Raw HTML | โ Raw HTML | โ Clean Markdown |
| Developer Experience | ๐ก Manual parsing | โ Complex WebDriver | โ Intuitive API |
The bottom line: When JavaScript matters, Crawailer delivers. When it doesn't, use
requests.๐ See complete tool comparison โ (includes Scrapy, Playwright, BeautifulSoup, and more)
๐ What Makes It Delightful
JavaScript-Powered Intelligence
# Dynamic content extraction from SPAs
content = await web.get(
"https://react-app.com",
script="window.testData?.framework + ' v' + window.React?.version"
)
# Automatically detects: React application with version info
# Extracts: Dynamic content + framework details
# E-commerce with JavaScript-loaded prices
product = await web.get(
"https://shop.com/product",
script="document.querySelector('.dynamic-price')?.textContent",
wait_for=".price-loaded"
)
# Recognizes product page with dynamic pricing
# Extracts: Real-time price, reviews, availability, specs
Beautiful Output
โจ Found 15 high-quality sources
๐ Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs
๐
Date range: 2023-2024 (recent research)
โก Average quality score: 8.7/10
๐ Key topics: transformers, safety, alignment
Helpful Errors
try:
content = await web.get("problematic-site.com")
except web.CloudflareProtected:
# "๐ก Try: await web.get(url, stealth=True)"
except web.PaywallDetected as e:
# "๐ Found archived version: {e.archive_url}"
๐ Documentation
- Tool Comparison: How Crawailer compares to Scrapy, Selenium, BeautifulSoup, etc.
- Getting Started: Installation and first steps
- JavaScript API: Complete JavaScript execution guide
- API Reference: Complete function documentation
- Benchmarks: Performance comparison with other tools
- MCP Integration: Building MCP servers with Crawailer
- Examples: Real-world usage patterns
- Architecture: How Crawailer works internally
๐ค Contributing
We love contributions! Crawailer is designed to be:
- Easy to extend: Add new content extractors and browser capabilities
- Well-tested: Comprehensive test suite with real websites
- Documented: Every feature has examples and use cases
See CONTRIBUTING.md for details.
๐ License
MIT License - see LICENSE for details.
๐ Ready to Stop Fighting JavaScript?
pip install crawailer
crawailer setup # Install browser engines
Join the revolution: Stop losing data to requests.get() failures. Start extracting real content from real websites that actually use JavaScript.
โญ Star us on GitHub if Crawailer saves your scraping sanity!
Built with โค๏ธ for the age of AI agents and automation
Crawailer: Because robots deserve delightful web experiences too ๐คโจ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawailer-0.1.0.tar.gz.
File metadata
- Download URL: crawailer-0.1.0.tar.gz
- Upload date:
- Size: 261.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9555427c7fd773a177b4e163da92e9831667291b4abccff6a78b22efba87d00a
|
|
| MD5 |
fe67bb2d7db9e0bc1a92a2a4142ad622
|
|
| BLAKE2b-256 |
3014fe228f4e9f837abc5f2ab684787791d313e21ce043bd0fb9e79ee6c6ebe1
|
File details
Details for the file crawailer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: crawailer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b354d8c2a6cd6c74a5a7d7af8fe09fdaca2c7e8e8c7d6b7004486794c1d3f9e2
|
|
| MD5 |
335d816e3f8b4dfc9e459bc9738788c3
|
|
| BLAKE2b-256 |
b3fd0ea65923e48291ca4a969329f51c21db4b33eafa3b173dde7e0bf295e42f
|