Integrated website crawler and content analysis library

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

OpenCrawl

A powerful web crawling and content analysis library that allows you to crawl websites, analyze their content using LLMs, and store the structured data for further use.

Features

Website crawling with Pathik
Content analysis using LLMs (Groq, OpenAI, etc.)
Structured data extraction from web pages
PostgreSQL storage of crawled data
Kafka integration for scalable processing

Installation

# Clone the repository
git clone https://github.com/yourusername/opencrawl.git
cd opencrawl

# Install the package
pip install -e .

Docker Setup

OpenCrawl includes a Docker Compose configuration for easy setup of required services:

# Start PostgreSQL and Kafka
docker-compose up -d

# View Kafka UI (optional)
open http://localhost:9000

Troubleshooting Docker Setup

If you encounter errors with Docker volumes, such as:

PostgreSQL compatibility issues between versions
Zookeeper snapshot/log inconsistency errors

You can use the included cleanup script:

# Run the cleanup script to remove incompatible volumes
./docker-cleanup.sh

# Then restart the services
docker-compose up -d

This will remove all existing Docker volumes and create fresh ones, which is useful when upgrading or when volumes become corrupted.

Quick Start

import asyncio
import os
from opencrawl import OpenCrawl

async def main():
    # Initialize OpenCrawl with API key from environment variable
    crawler = OpenCrawl(
        content_analyzer_config={
            "api_key": os.getenv("GROQ_API_KEY")
        }
    )
    
    # Process a list of URLs
    results = await crawler.process_urls(
        urls=["https://example.com", "https://news.ycombinator.com"],
        verbose=True
    )
    
    # Print the results
    for result in results:
        print(f"URL: {result.get('url')}")
        content_analysis = result.get("content_analysis")
        if content_analysis:
            print(f"Title: {content_analysis.get('title')}")
            print(f"Topics: {', '.join(content_analysis.get('main_topics', []))}")
            print(f"Summary: {content_analysis.get('summary')[:100]}...")
        print("-" * 40)

if __name__ == "__main__":
    asyncio.run(main())

Customizing the ContentAnalyzer

You can customize the ContentAnalyzer with different LLM configurations:

from opencrawl import OpenCrawl

# Initialize with custom model configuration
crawler = OpenCrawl(
    content_analyzer_config={
        "api_key": "your-api-key",
        "model": "openai/gpt-4o",  # Change the model
        "max_tokens": 32000,        # Adjust token limit
        "max_concurrent": 10,       # Increase concurrent requests
        "extra_config": {           # Additional LLM configuration
            "temperature": 0.2,
            "response_format": {"type": "json_object"}
        }
    }
)

# Or configure later
crawler = OpenCrawl()
crawler.setup_content_analyzer(
    api_key="your-api-key",
    model="anthropic/claude-3-opus-20240229",
    max_tokens=100000
)

Database Configuration

OpenCrawl automatically creates the necessary tables in your PostgreSQL database. The default database configuration is included in the Docker Compose setup.

Advanced Usage

Custom Kafka Configuration

crawler = OpenCrawl(
    kafka_config={
        "brokers": "kafka:9092",
        "topic": "custom_topic",
        "max_request_size": 20971520,  # 20MB
    }
)

Processing URLs with Custom Thread and User IDs

results = await crawler.process_urls(
    urls=["https://example.com"],
    user_id="custom-user-id",
    thread_id="custom-thread-id",
    thread_name="My Research Project",
    content_type="both",  # Extract both HTML and Markdown
    parallel=True,        # Process URLs in parallel
    verbose=True          # Show detailed logs
)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

0.1.7

Apr 3, 2025

0.1.2

Apr 1, 2025

This version

0.1.0

Apr 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencrawl-0.1.0.tar.gz (25.5 kB view details)

Uploaded Apr 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

opencrawl-0.1.0-py3-none-any.whl (28.5 kB view details)

Uploaded Apr 1, 2025 Python 3

File details

Details for the file opencrawl-0.1.0.tar.gz.

File metadata

Download URL: opencrawl-0.1.0.tar.gz
Upload date: Apr 1, 2025
Size: 25.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for opencrawl-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6d015dde25260ee8fcca955dc6b30ffbe983ec70a376f7318ee70f5af44014d2`
MD5	`e9f5465f74e36c5fb2444fd339cbd3b1`
BLAKE2b-256	`2a6636018abb85197708a1e28b529f5aa71ce23fc10ce5f9434fea1fff1ae56a`

See more details on using hashes here.

File details

Details for the file opencrawl-0.1.0-py3-none-any.whl.

File metadata

Download URL: opencrawl-0.1.0-py3-none-any.whl
Upload date: Apr 1, 2025
Size: 28.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for opencrawl-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5048c354077e765d551a03c567251b2042cdff72404309f52d26cc1d65d734f8`
MD5	`cadbb2109e43d47084920aa4377e4be2`
BLAKE2b-256	`794fe0aa1c59b7be9038b3136ef00a1fe2520ce3166569a0428a36bf31be84b4`

See more details on using hashes here.

opencrawl 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OpenCrawl

Features

Installation

Docker Setup

Troubleshooting Docker Setup

Quick Start

Customizing the ContentAnalyzer

Database Configuration

Advanced Usage

Custom Kafka Configuration

Processing URLs with Custom Thread and User IDs

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes