Skip to main content

Modern Python library for browser automation and intelligent content extraction with full JavaScript execution support

Project description

๐Ÿ•ท๏ธ Crawailer

The JavaScript-first web scraper that actually works with modern websites

Finally! A Python library that handles React, Vue, Angular, and dynamic content without the headaches. When requests fails and Selenium feels like overkill, Crawailer delivers clean, AI-ready content extraction with bulletproof JavaScript execution.

pip install crawailer

PyPI version Downloads Python Support

โœจ Features

  • ๐ŸŽฏ JavaScript-First: Executes real JavaScript on React, Vue, Angular sites (unlike requests)
  • โšก Lightning Fast: 5-10x faster HTML processing with C-based selectolax
  • ๐Ÿค– AI-Optimized: Clean markdown output perfect for LLM training and RAG
  • ๐Ÿ”ง Three Ways to Use: Library, CLI tool, or MCP server - your choice
  • ๐Ÿ“ฆ Zero Config: Works immediately with sensible defaults
  • ๐Ÿงช Battle-Tested: 18 comprehensive test suites with 70+ real-world scenarios
  • ๐ŸŽจ Developer Joy: Rich terminal output, helpful errors, progress tracking

๐Ÿš€ Quick Start

import crawailer as web

# Simple content extraction
content = await web.get("https://example.com")
print(content.markdown)  # Clean, LLM-ready markdown
print(content.text)      # Human-readable text
print(content.title)     # Extracted title

# JavaScript execution for dynamic content
content = await web.get(
    "https://spa-app.com",
    script="document.querySelector('.dynamic-price').textContent"
)
print(f"Price: {content.script_result}")

# Batch processing with JavaScript
results = await web.get_many(
    ["url1", "url2", "url3"],
    script="document.title + ' | ' + document.querySelector('.description')?.textContent"
)
for result in results:
    print(f"{result.title}: {result.script_result}")

# Smart discovery with interaction
research = await web.discover(
    "AI safety papers", 
    script="document.querySelector('.show-more')?.click()",
    max_pages=10
)
# Returns the most relevant content with enhanced extraction

# Compare: Traditional scraping fails on modern sites
# requests.get("https://react-app.com") โ†’ Empty <div id="root"></div>
# Crawailer โ†’ Full content + dynamic data

๐ŸŽฏ Design Philosophy

For Robots, By Humans

  • Predictive: Anticipates what you need and provides it
  • Forgiving: Handles errors gracefully with helpful suggestions
  • Efficient: Fast by default, with smart caching and concurrency
  • Composable: Small, focused functions that work well together

Perfect for AI Workflows

  • LLM-Optimized: Clean markdown, structured data, semantic chunking
  • Context-Aware: Extracts relationships and metadata automatically
  • Quality-Focused: Built-in content quality assessment
  • Archive-Ready: Designed for long-term storage and retrieval

๐Ÿ“– Use Cases

๐Ÿค– AI Agents & LLM Applications

Problem: Training data scattered across JavaScript-heavy academic sites

# Research assistant workflow with JavaScript interaction
research = await web.discover(
    "quantum computing breakthroughs",
    script="document.querySelector('.show-abstract')?.click(); return document.querySelector('.full-text')?.textContent"
)
for paper in research:
    # Rich content includes JavaScript-extracted data
    summary = await llm.summarize(paper.markdown)
    dynamic_content = paper.script_result  # JavaScript execution result
    insights = await llm.extract_insights(paper.content + dynamic_content)

๐Ÿ›’ E-commerce Price Monitoring

Problem: Product prices loaded via AJAX, requests sees loading spinners

# Monitor competitor pricing with dynamic content
products = await web.get_many(
    competitor_urls,
    script="return {price: document.querySelector('.price')?.textContent, stock: document.querySelector('.inventory')?.textContent}"
)
for product in products:
    if product.script_result['price'] != cached_price:
        await alert_price_change(product.url, product.script_result)

๐Ÿ”— MCP Servers

Problem: Claude needs reliable web content extraction tools

# Easy MCP integration (with crawailer[mcp])
from crawailer.mcp import create_mcp_server

server = create_mcp_server()
# Automatically exposes web.get, web.discover, etc. as MCP tools

๐Ÿ“Š Social Media & Content Analysis

Problem: Posts and comments load infinitely via JavaScript

# Extract social media discussions with infinite scroll
content = await web.get(
    "https://social-platform.com/topic/ai-safety",
    script="window.scrollTo(0, document.body.scrollHeight); return document.querySelectorAll('.post').length"
)
# Gets full thread content, not just initial page load

๐Ÿ› ๏ธ Installation

# Basic installation
pip install crawailer

# With AI features (semantic search, entity extraction)
pip install crawailer[ai]

# With MCP server capabilities  
pip install crawailer[mcp]

# Everything
pip install crawailer[all]

# Post-install setup (installs Playwright browsers)
crawailer setup

๐Ÿ—๏ธ Architecture

Crawailer is built on modern, focused libraries:

  • ๐ŸŽญ Playwright: Reliable browser automation
  • โšก selectolax: 5-10x faster HTML parsing (C-based)
  • ๐Ÿ“ markdownify: Clean HTMLโ†’Markdown conversion
  • ๐Ÿงน justext: Intelligent content extraction and cleaning
  • ๐Ÿ”„ httpx: Modern async HTTP client

๐Ÿงช Battle-Tested Quality

Crawailer includes 18 comprehensive test suites with real-world scenarios:

  • Modern Frameworks: React, Vue, Angular demos with full JavaScript APIs
  • Mobile Compatibility: Safari iOS, Chrome Android, responsive designs
  • Production Edge Cases: Network failures, memory pressure, browser differences
  • Performance Testing: Stress tests, concurrency, resource management

Want to contribute? We welcome PRs with new test scenarios! Our test sites library shows exactly how different frameworks should behave with JavaScript execution.

๐Ÿ“ Future TODO: Move examples to dedicated repository for community contributions

๐Ÿค Perfect for MCP Projects

MCP servers love Crawailer because it provides:

  • Focused tools: Each function does one thing well
  • Rich outputs: Structured data ready for LLM consumption
  • Smart defaults: Works out of the box with minimal configuration
  • Extensible: Easy to add domain-specific extraction logic
# Example MCP server tool
@mcp_tool("web_research")
async def research_topic(topic: str, depth: str = "comprehensive"):
    results = await web.discover(topic, max_pages=20)
    return {
        "sources": len(results),
        "content": [r.summary for r in results],
        "insights": await analyze_patterns(results)
    }

๐ŸฅŠ Crawailer vs Traditional Tools

Challenge requests & HTTP libs Selenium Crawailer
React/Vue/Angular โŒ Empty templates ๐ŸŸก Slow, complex setup โœ… Just works
Dynamic Pricing โŒ Shows loading spinner ๐ŸŸก Requires waits/timeouts โœ… Intelligent waiting
JavaScript APIs โŒ No access ๐ŸŸก Clunky WebDriver calls โœ… Native page.evaluate()
Speed ๐ŸŸข 100-500ms โŒ 5-15 seconds โœ… 2-5 seconds
Memory ๐ŸŸข 1-5MB โŒ 200-500MB ๐ŸŸก 100-200MB
AI-Ready Output โŒ Raw HTML โŒ Raw HTML โœ… Clean Markdown
Developer Experience ๐ŸŸก Manual parsing โŒ Complex WebDriver โœ… Intuitive API

The bottom line: When JavaScript matters, Crawailer delivers. When it doesn't, use requests.

๐Ÿ“– See complete tool comparison โ†’ (includes Scrapy, Playwright, BeautifulSoup, and more)

๐ŸŽ‰ What Makes It Delightful

JavaScript-Powered Intelligence

# Dynamic content extraction from SPAs
content = await web.get(
    "https://react-app.com",
    script="window.testData?.framework + ' v' + window.React?.version"
)
# Automatically detects: React application with version info
# Extracts: Dynamic content + framework details

# E-commerce with JavaScript-loaded prices
product = await web.get(
    "https://shop.com/product",
    script="document.querySelector('.dynamic-price')?.textContent",
    wait_for=".price-loaded"
) 
# Recognizes product page with dynamic pricing
# Extracts: Real-time price, reviews, availability, specs

Beautiful Output

โœจ Found 15 high-quality sources
๐Ÿ“Š Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs  
๐Ÿ“… Date range: 2023-2024 (recent research)
โšก Average quality score: 8.7/10
๐Ÿ” Key topics: transformers, safety, alignment

Helpful Errors

try:
    content = await web.get("problematic-site.com")
except web.CloudflareProtected:
    # "๐Ÿ’ก Try: await web.get(url, stealth=True)"
except web.PaywallDetected as e:
    # "๐Ÿ” Found archived version: {e.archive_url}"

๐Ÿ“š Documentation

๐Ÿค Contributing

We love contributions! Crawailer is designed to be:

  • Easy to extend: Add new content extractors and browser capabilities
  • Well-tested: Comprehensive test suite with real websites
  • Documented: Every feature has examples and use cases

See CONTRIBUTING.md for details.

๐Ÿ“„ License

MIT License - see LICENSE for details.


๐Ÿš€ Ready to Stop Fighting JavaScript?

pip install crawailer
crawailer setup  # Install browser engines

Join the revolution: Stop losing data to requests.get() failures. Start extracting real content from real websites that actually use JavaScript.

โญ Star us on GitHub if Crawailer saves your scraping sanity!


Built with โค๏ธ for the age of AI agents and automation

Crawailer: Because robots deserve delightful web experiences too ๐Ÿค–โœจ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawailer-0.1.0.tar.gz (261.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawailer-0.1.0-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file crawailer-0.1.0.tar.gz.

File metadata

  • Download URL: crawailer-0.1.0.tar.gz
  • Upload date:
  • Size: 261.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for crawailer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9555427c7fd773a177b4e163da92e9831667291b4abccff6a78b22efba87d00a
MD5 fe67bb2d7db9e0bc1a92a2a4142ad622
BLAKE2b-256 3014fe228f4e9f837abc5f2ab684787791d313e21ce043bd0fb9e79ee6c6ebe1

See more details on using hashes here.

File details

Details for the file crawailer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: crawailer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for crawailer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b354d8c2a6cd6c74a5a7d7af8fe09fdaca2c7e8e8c7d6b7004486794c1d3f9e2
MD5 335d816e3f8b4dfc9e459bc9738788c3
BLAKE2b-256 b3fd0ea65923e48291ca4a969329f51c21db4b33eafa3b173dde7e0bf295e42f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page