Skip to main content

๐Ÿ”ฅ๐Ÿ•ท๏ธ Crawl4AI: Open-source LLM Friendly Web Crawler & scraper

Project description

๐Ÿ”ฅ๐Ÿ•ท๏ธ Crawl4AI: LLM Friendly Web Crawler & Scraper

unclecode%2Fcrawl4ai | Trendshift

GitHub Stars PyPI - Downloads GitHub Forks GitHub Issues GitHub Pull Requests License

Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. ๐Ÿ†“๐ŸŒ

New in 0.3.74 โœจ

  • ๐Ÿš€ Blazing Fast Scraping: Significantly improved scraping speed.
  • ๐Ÿ“ฅ Download Manager: Integrated file crawling, downloading, and tracking within CrawlResult.
  • ๐Ÿ“ Markdown Strategy: Flexible system for custom markdown generation and formats.
  • ๐Ÿ”— LLM-Friendly Citations: Auto-converts links to numbered citations with reference lists.
  • ๐Ÿ”Ž Markdown Filter: BM25-based content extraction for cleaner, relevant markdown.
  • ๐Ÿ–ผ๏ธ Image Extraction: Supports srcset, picture, and responsive image formats.
  • ๐Ÿ—‚๏ธ Local/Raw HTML: Crawl file:// paths and raw HTML (raw:) directly.
  • ๐Ÿค– Browser Control: Custom browser setups with stealth integration to bypass bots.
  • โ˜๏ธ API & Cache Boost: CORS, static serving, and enhanced filesystem-based caching.
  • ๐Ÿณ API Gateway: Run as an API service with secure token authentication.
  • ๐Ÿ› ๏ธ Database Upgrades: Optimized for larger content sets with faster caching.
  • ๐Ÿ› Bug Fixes: Resolved browser context issues, memory leaks, and improved error handling.

Try it Now!

โœจ Play around with this Open In Colab

โœจ Visit our Documentation Website

Features โœจ

  • ๐Ÿ†“ Completely free and open-source
  • ๐Ÿš€ Blazing fast performance, outperforming many paid services
  • ๐Ÿค– LLM-friendly output formats (JSON, cleaned HTML, markdown)
  • ๐ŸŒ Multi-browser support (Chromium, Firefox, WebKit)
  • ๐ŸŒ Supports crawling multiple URLs simultaneously
  • ๐ŸŽจ Extracts and returns all media tags (Images, Audio, and Video)
  • ๐Ÿ”— Extracts all external and internal links
  • ๐Ÿ“š Extracts metadata from the page
  • ๐Ÿ”„ Custom hooks for authentication, headers, and page modifications
  • ๐Ÿ•ต๏ธ User-agent customization
  • ๐Ÿ–ผ๏ธ Takes screenshots of pages with enhanced error handling
  • ๐Ÿ“œ Executes multiple custom JavaScripts before crawling
  • ๐Ÿ“Š Generates structured output without LLM using JsonCssExtractionStrategy
  • ๐Ÿ“š Various chunking strategies: topic-based, regex, sentence, and more
  • ๐Ÿง  Advanced extraction strategies: cosine clustering, LLM, and more
  • ๐ŸŽฏ CSS selector support for precise data extraction
  • ๐Ÿ“ Passes instructions/keywords to refine extraction
  • ๐Ÿ”’ Proxy support with authentication for enhanced access
  • ๐Ÿ”„ Session management for complex multi-page crawling
  • ๐ŸŒ Asynchronous architecture for improved performance
  • ๐Ÿ–ผ๏ธ Improved image processing with lazy-loading detection
  • ๐Ÿ•ฐ๏ธ Enhanced handling of delayed content loading
  • ๐Ÿ”‘ Custom headers support for LLM interactions
  • ๐Ÿ–ผ๏ธ iframe content extraction for comprehensive analysis
  • โฑ๏ธ Flexible timeout and delayed content retrieval options

Installation ๐Ÿ› ๏ธ

Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.

Using pip ๐Ÿ

Choose the installation option that best fits your needs:

Basic Installation

For basic web crawling and scraping tasks:

pip install crawl4ai

By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.

๐Ÿ‘‰ Note: When you install Crawl4AI, the setup script should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:

  1. Through the command line:

    playwright install
    
  2. If the above doesn't work, try this more specific command:

    python -m playwright install chromium
    

This second method has proven to be more reliable in some cases.

Installation with Synchronous Version

If you need the synchronous version using Selenium:

pip install crawl4ai[sync]

Development Installation

For contributors who plan to modify the source code:

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

One-Click Deployment ๐Ÿš€

Deploy your own instance of Crawl4AI with one click:

DigitalOcean Referral Badge

๐Ÿ’ก Recommended specs: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.

The deploy will:

  • Set up a Docker container with Crawl4AI
  • Configure Playwright and all dependencies
  • Start the FastAPI server on port 11235
  • Set up health checks and auto-deployment

Using Docker ๐Ÿณ

Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.

Option 1: Docker Hub (Recommended)

# Pull and run from Docker Hub (choose one):
docker pull unclecode/crawl4ai:basic    # Basic crawling features
docker pull unclecode/crawl4ai:all      # Full installation (ML, LLM support)
docker pull unclecode/crawl4ai:gpu      # GPU-enabled version

# Run the container
docker run -p 11235:11235 unclecode/crawl4ai:basic  # Replace 'basic' with your chosen version

# In case to allocate more shared memory for the container
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic

Option 2: Build from Repository

# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai

# Build the image
docker build -t crawl4ai:local \
  --build-arg INSTALL_TYPE=basic \  # Options: basic, all
  .

# Run your local build
docker run -p 11235:11235 crawl4ai:local

Quick test (works for both options):

import requests

# Submit a crawl job
response = requests.post(
    "http://localhost:11235/crawl",
    json={"urls": "https://example.com", "priority": 10}
)
task_id = response.json()["task_id"]

# Get results
result = requests.get(f"http://localhost:11235/task/{task_id}")

For advanced configuration, environment variables, and usage examples, see our Docker Deployment Guide.

Quick Start ๐Ÿš€

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Advanced Usage ๐Ÿ”ฌ

Executing JavaScript and Using CSS Selectors

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            css_selector=".wide-tease-item__description",
            bypass_cache=True
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

Using a Proxy

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Extracting Structured Data without LLM

The JsonCssExtractionStrategy allows for precise extraction of structured data from web pages using CSS selectors.

import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def extract_news_teasers():
    schema = {
        "name": "News Teaser Extractor",
        "baseSelector": ".wide-tease-item__wrapper",
        "fields": [
            {
                "name": "category",
                "selector": ".unibrow span[data-testid='unibrow-text']",
                "type": "text",
            },
            {
                "name": "headline",
                "selector": ".wide-tease-item__headline",
                "type": "text",
            },
            {
                "name": "summary",
                "selector": ".wide-tease-item__description",
                "type": "text",
            },
            {
                "name": "time",
                "selector": "[data-testid='wide-tease-date']",
                "type": "text",
            },
            {
                "name": "image",
                "type": "nested",
                "selector": "picture.teasePicture img",
                "fields": [
                    {"name": "src", "type": "attribute", "attribute": "src"},
                    {"name": "alt", "type": "attribute", "attribute": "alt"},
                ],
            },
            {
                "name": "link",
                "selector": "a[href]",
                "type": "attribute",
                "attribute": "href",
            },
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
        )

        assert result.success, "Failed to crawl the page"

        news_teasers = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(news_teasers)} news teasers")
        print(json.dumps(news_teasers[0], indent=2))

if __name__ == "__main__":
    asyncio.run(extract_news_teasers())

For more advanced usage examples, check out our Examples section in the documentation.

Extracting Structured Data with OpenAI

import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), 
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
            ),            
            bypass_cache=True,
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

Session Management and Dynamic Content Crawling

Crawl4AI excels at handling complex scenarios, such as crawling multiple pages with dynamic content loaded via JavaScript. Here's an example of crawling GitHub commits across multiple pages:

import asyncio
import re
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler

async def crawl_typescript_commits():
    first_commit = ""
    async def on_execution_started(page):
        nonlocal first_commit 
        try:
            while True:
                await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
                commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
                commit = await commit.evaluate('(element) => element.textContent')
                commit = re.sub(r'\s+', '', commit)
                if commit and commit != first_commit:
                    first_commit = commit
                    break
                await asyncio.sleep(0.5)
        except Exception as e:
            print(f"Warning: New content didn't appear after JavaScript execution: {e}")

    async with AsyncWebCrawler(verbose=True) as crawler:
        crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)

        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []

        js_next_page = """
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """

        for page in range(3):  # Crawl 3 pages
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                js=js_next_page if page > 0 else None,
                bypass_cache=True,
                js_only=page > 0
            )

            assert result.success, f"Failed to crawl page {page + 1}"

            soup = BeautifulSoup(result.cleaned_html, 'html.parser')
            commits = soup.select("li")
            all_commits.extend(commits)

            print(f"Page {page + 1}: Found {len(commits)} commits")

        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

if __name__ == "__main__":
    asyncio.run(crawl_typescript_commits())

This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.

For more advanced usage examples, check out our Examples section in the documentation.

Speed Comparison ๐Ÿš€

Crawl4AI is designed with speed as a primary focus. Our goal is to provide the fastest possible response with high-quality data extraction, minimizing abstractions between the data and the user.

We've conducted a speed comparison between Crawl4AI and Firecrawl, a paid service. The results demonstrate Crawl4AI's superior performance:

Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49

Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49

Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89

As you can see, Crawl4AI outperforms Firecrawl significantly:

  • Simple crawl: Crawl4AI is over 4 times faster than Firecrawl.
  • With JavaScript execution: Even when executing JavaScript to load more content (doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.

You can find the full comparison code in our repository at docs/examples/crawl4ai_vs_firecrawl.py.

Documentation ๐Ÿ“š

For detailed documentation, including installation instructions, advanced features, and API reference, visit our Documentation Website.

Crawl4AI Roadmap ๐Ÿ—บ๏ธ

For detailed information on our development plans and upcoming features, check out our Roadmap.

Advanced Crawling Systems ๐Ÿ”ง

  • 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
  • 1. Question-Based Crawler: Natural language driven web discovery and content extraction
  • 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
  • 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations

Specialized Features ๐Ÿ› ๏ธ

  • 4. Automated Schema Generator: Convert natural language to extraction schemas
  • 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
  • 6. Web Embedding Index: Semantic search infrastructure for crawled content

Development Tools ๐Ÿ”จ

  • 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
  • 8. Performance Monitor: Real-time insights into crawler operations
  • 9. Cloud Integration: One-click deployment solutions across cloud providers

Community & Growth ๐ŸŒฑ

  • 10. Sponsorship Program: Structured support system with tiered benefits
  • 11. Educational Content: "How to Crawl" video series and interactive tutorials

Contributing ๐Ÿค

We welcome contributions from the open-source community. Check out our contribution guidelines for more information.

License ๐Ÿ“„

Crawl4AI is released under the Apache 2.0 License.

Contact ๐Ÿ“ง

For questions, suggestions, or feedback, feel free to reach out:

Happy Crawling! ๐Ÿ•ธ๏ธ๐Ÿš€

Mission

Our mission is to unlock the untapped potential of personal and enterprise data in the digital age. In today's world, individuals and organizations generate vast amounts of valuable digital footprints, yet this data remains largely uncapitalized as a true asset.

Our open-source solution empowers developers and innovators to build tools for data extraction and structuring, laying the foundation for a new era of data ownership. By transforming personal and enterprise data into structured, tradeable assets, we're creating opportunities for individuals to capitalize on their digital footprints and for organizations to unlock the value of their collective knowledge.

This democratization of data represents the first step toward a shared data economy, where willing participation in data sharing drives AI advancement while ensuring the benefits flow back to data creators. Through this approach, we're building a future where AI development is powered by authentic human knowledge rather than synthetic alternatives.

Mission Diagram

For a detailed exploration of our vision, opportunities, and pathway forward, please see our full mission statement.

Key Opportunities

  • Data Capitalization: Transform digital footprints into valuable assets that can appear on personal and enterprise balance sheets
  • Authentic Data: Unlock the vast reservoir of real human insights and knowledge for AI advancement
  • Shared Economy: Create new value streams where data creators directly benefit from their contributions

Development Pathway

  1. Open-Source Foundation: Building transparent, community-driven data extraction tools
  2. Data Capitalization Platform: Creating tools to structure and value digital assets
  3. Shared Data Marketplace: Establishing an economic platform for ethical data exchange

For a detailed exploration of our vision, challenges, and solutions, please see our full mission statement.

Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl4ai-0.3.741.tar.gz (113.7 kB view details)

Uploaded Source

Built Distribution

Crawl4AI-0.3.741-py3-none-any.whl (119.7 kB view details)

Uploaded Python 3

File details

Details for the file crawl4ai-0.3.741.tar.gz.

File metadata

  • Download URL: crawl4ai-0.3.741.tar.gz
  • Upload date:
  • Size: 113.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for crawl4ai-0.3.741.tar.gz
Algorithm Hash digest
SHA256 cb579ad2dfcb5c21957ee00d12f32a56e9fe29b39569216619713f9a6046c7c9
MD5 c4362efad195edfd742995bb98590a97
BLAKE2b-256 d69f04f11cf3bc88d8a891005324ab12cac09af58f431f2a12ecb77ba7fb7439

See more details on using hashes here.

File details

Details for the file Crawl4AI-0.3.741-py3-none-any.whl.

File metadata

  • Download URL: Crawl4AI-0.3.741-py3-none-any.whl
  • Upload date:
  • Size: 119.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for Crawl4AI-0.3.741-py3-none-any.whl
Algorithm Hash digest
SHA256 d39eedc2aba64958e686752bcf1dcdc1aae7346d284966d9091cf6baa96aba62
MD5 d8b73e7682850b8573f044590dabc95a
BLAKE2b-256 62218cd1c39a8242756461921b1a8d9543c589f45885bff69f7a02cd412522ff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page