Integrated website crawler and content analysis library
Project description
OpenCrawl
A powerful web crawling and content analysis library that allows you to crawl websites, analyze their content using LLMs, and store the structured data for further use.
Features
- Website crawling with Pathik
- Content analysis using LLMs (Groq, OpenAI, etc.)
- Structured data extraction from web pages
- PostgreSQL storage of crawled data
- Kafka integration for scalable processing
Installation
# Clone the repository
git clone https://github.com/yourusername/opencrawl.git
cd opencrawl
# Install the package
pip install -e .
Docker Setup
OpenCrawl includes a Docker Compose configuration for easy setup of required services:
# Start PostgreSQL and Kafka
docker-compose up -d
# View Kafka UI (optional)
open http://localhost:9000
Troubleshooting Docker Setup
If you encounter errors with Docker volumes, such as:
- PostgreSQL compatibility issues between versions
- Zookeeper snapshot/log inconsistency errors
You can use the included cleanup script:
# Run the cleanup script to remove incompatible volumes
./docker-cleanup.sh
# Then restart the services
docker-compose up -d
This will remove all existing Docker volumes and create fresh ones, which is useful when upgrading or when volumes become corrupted.
Quick Start
import asyncio
import os
from opencrawl import OpenCrawl
async def main():
# Initialize OpenCrawl with API key from environment variable
crawler = OpenCrawl(
content_analyzer_config={
"api_key": os.getenv("GROQ_API_KEY")
}
)
# Process a list of URLs
results = await crawler.process_urls(
urls=["https://example.com", "https://news.ycombinator.com"],
verbose=True
)
# Print the results
for result in results:
print(f"URL: {result.get('url')}")
content_analysis = result.get("content_analysis")
if content_analysis:
print(f"Title: {content_analysis.get('title')}")
print(f"Topics: {', '.join(content_analysis.get('main_topics', []))}")
print(f"Summary: {content_analysis.get('summary')[:100]}...")
print("-" * 40)
if __name__ == "__main__":
asyncio.run(main())
Customizing the ContentAnalyzer
You can customize the ContentAnalyzer with different LLM configurations:
from opencrawl import OpenCrawl
# Initialize with custom model configuration
crawler = OpenCrawl(
content_analyzer_config={
"api_key": "your-api-key",
"model": "openai/gpt-4o", # Change the model
"max_tokens": 32000, # Adjust token limit
"max_concurrent": 10, # Increase concurrent requests
"extra_config": { # Additional LLM configuration
"temperature": 0.2,
"response_format": {"type": "json_object"}
}
}
)
# Or configure later
crawler = OpenCrawl()
crawler.setup_content_analyzer(
api_key="your-api-key",
model="anthropic/claude-3-opus-20240229",
max_tokens=100000
)
Database Configuration
OpenCrawl automatically creates the necessary tables in your PostgreSQL database. The default database configuration is included in the Docker Compose setup.
Advanced Usage
Custom Kafka Configuration
crawler = OpenCrawl(
kafka_config={
"brokers": "kafka:9092",
"topic": "custom_topic",
"max_request_size": 20971520, # 20MB
}
)
Processing URLs with Custom Thread and User IDs
results = await crawler.process_urls(
urls=["https://example.com"],
user_id="custom-user-id",
thread_id="custom-thread-id",
thread_name="My Research Project",
content_type="both", # Extract both HTML and Markdown
parallel=True, # Process URLs in parallel
verbose=True # Show detailed logs
)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opencrawl-0.1.0.tar.gz.
File metadata
- Download URL: opencrawl-0.1.0.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d015dde25260ee8fcca955dc6b30ffbe983ec70a376f7318ee70f5af44014d2
|
|
| MD5 |
e9f5465f74e36c5fb2444fd339cbd3b1
|
|
| BLAKE2b-256 |
2a6636018abb85197708a1e28b529f5aa71ce23fc10ce5f9434fea1fff1ae56a
|
File details
Details for the file opencrawl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: opencrawl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5048c354077e765d551a03c567251b2042cdff72404309f52d26cc1d65d734f8
|
|
| MD5 |
cadbb2109e43d47084920aa4377e4be2
|
|
| BLAKE2b-256 |
794fe0aa1c59b7be9038b3136ef00a1fe2520ce3166569a0428a36bf31be84b4
|