Skip to main content

Integrated website crawler and content analysis library

Project description

OpenCrawl

A powerful web crawling and content analysis library that allows you to crawl websites, analyze their content using LLMs, and store the structured data for further use.

Features

  • Website crawling with Pathik
  • Content analysis using LLMs (Groq, OpenAI, etc.)
  • Structured data extraction from web pages
  • PostgreSQL storage of crawled data
  • Kafka integration for scalable processing

Installation

# Clone the repository
git clone https://github.com/yourusername/opencrawl.git
cd opencrawl

# Install the package
pip install -e .

Docker Setup

OpenCrawl includes a Docker Compose configuration for easy setup of required services:

# Start PostgreSQL and Kafka
docker-compose up -d

# View Kafka UI (optional)
open http://localhost:9000

Troubleshooting Docker Setup

If you encounter errors with Docker volumes, such as:

  • PostgreSQL compatibility issues between versions
  • Zookeeper snapshot/log inconsistency errors

You can use the included cleanup script:

# Run the cleanup script to remove incompatible volumes
./docker-cleanup.sh

# Then restart the services
docker-compose up -d

This will remove all existing Docker volumes and create fresh ones, which is useful when upgrading or when volumes become corrupted.

Quick Start

import asyncio
import os
from opencrawl import OpenCrawl

async def main():
    # Initialize OpenCrawl with API key from environment variable
    crawler = OpenCrawl(
        content_analyzer_config={
            "api_key": os.getenv("GROQ_API_KEY")
        }
    )
    
    # Process a list of URLs
    results = await crawler.process_urls(
        urls=["https://example.com", "https://news.ycombinator.com"],
        verbose=True
    )
    
    # Print the results
    for result in results:
        print(f"URL: {result.get('url')}")
        content_analysis = result.get("content_analysis")
        if content_analysis:
            print(f"Title: {content_analysis.get('title')}")
            print(f"Topics: {', '.join(content_analysis.get('main_topics', []))}")
            print(f"Summary: {content_analysis.get('summary')[:100]}...")
        print("-" * 40)

if __name__ == "__main__":
    asyncio.run(main())

Customizing the ContentAnalyzer

You can customize the ContentAnalyzer with different LLM configurations:

from opencrawl import OpenCrawl

# Initialize with custom model configuration
crawler = OpenCrawl(
    content_analyzer_config={
        "api_key": "your-api-key",
        "model": "openai/gpt-4o",  # Change the model
        "max_tokens": 32000,        # Adjust token limit
        "max_concurrent": 10,       # Increase concurrent requests
        "extra_config": {           # Additional LLM configuration
            "temperature": 0.2,
            "response_format": {"type": "json_object"}
        }
    }
)

# Or configure later
crawler = OpenCrawl()
crawler.setup_content_analyzer(
    api_key="your-api-key",
    model="anthropic/claude-3-opus-20240229",
    max_tokens=100000
)

Database Configuration

OpenCrawl automatically creates the necessary tables in your PostgreSQL database. The default database configuration is included in the Docker Compose setup.

Advanced Usage

Custom Kafka Configuration

crawler = OpenCrawl(
    kafka_config={
        "brokers": "kafka:9092",
        "topic": "custom_topic",
        "max_request_size": 20971520,  # 20MB
    }
)

Processing URLs with Custom Thread and User IDs

results = await crawler.process_urls(
    urls=["https://example.com"],
    user_id="custom-user-id",
    thread_id="custom-thread-id",
    thread_name="My Research Project",
    content_type="both",  # Extract both HTML and Markdown
    parallel=True,        # Process URLs in parallel
    verbose=True          # Show detailed logs
)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencrawl-0.1.0.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencrawl-0.1.0-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file opencrawl-0.1.0.tar.gz.

File metadata

  • Download URL: opencrawl-0.1.0.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for opencrawl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6d015dde25260ee8fcca955dc6b30ffbe983ec70a376f7318ee70f5af44014d2
MD5 e9f5465f74e36c5fb2444fd339cbd3b1
BLAKE2b-256 2a6636018abb85197708a1e28b529f5aa71ce23fc10ce5f9434fea1fff1ae56a

See more details on using hashes here.

File details

Details for the file opencrawl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: opencrawl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for opencrawl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5048c354077e765d551a03c567251b2042cdff72404309f52d26cc1d65d734f8
MD5 cadbb2109e43d47084920aa4377e4be2
BLAKE2b-256 794fe0aa1c59b7be9038b3136ef00a1fe2520ce3166569a0428a36bf31be84b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page