High-performance web crawler and text generator with Transformers
Project description
Opencrawl
This project was created to crawl data in a meaningfull way using open source LLMs. I've been thinking of that most crawlers either use propriatary models from openai or anthropic but rarely have i seen crawlers that are using solely open-source models. So for that reason this project came to to the creation. Its still in the very early stages, but througout the time i will keep maintainting it and adding more features to make crawling more comphrensive with open-source LLMs.
Installation
With pip
pip install opencrawl
With uv
uv add opencrawl
TODO
- Write tests
- create more extraction strategies
- add more proxy strategies
- Captcha bypasses
- better model support from VLLM
Features
Crawler Features
High-Performance Web Crawling
- Async Architecture: Built on
aiohttpanduvloopfor maximum performance - Concurrent Requests: Configurable concurrency limits with semaphore-based control
- Smart Retry Logic: Automatic retries with exponential backoff for failed requests
- Connection Management: Efficient connection pooling and timeout control
Proxy Support
- Proxy Rotation: Automatic proxy rotation from a pool of proxies
- Proxy Validation: Built-in proxy health checking against test endpoints
- Multiple Input Methods: Load proxies from file or comma-separated string
- Proxy Filtering: Automatic removal of invalid proxies
Flexible Configuration
- Custom Headers & Cookies: Set default and per-request headers/cookies
- SSL Control: Enable/disable SSL verification as needed
- Redirect Handling: Configurable redirect following with max redirect limits
- User Agent: Customizable user agent strings
- Request Timeouts: Fine-grained timeout control for each request
Content Extraction
- Multiple Extraction Types:
- HTML: Raw HTML extraction with cleaning options
- Content: Clean text content extraction
- Markdown: Convert HTML to markdown format
- Smart Cleaning: Configurable removal of scripts, styles, navigation, headers, footers
- Metadata Extraction: Automatic extraction of page metadata (title, description, keywords)
- Link & Image Preservation: Optional extraction of links and image URLs
- Minimum Text Filtering: Filter out elements below a minimum text length
Model Features
VLLM-Powered Inference
- Flexible & Easy: Built on VLLM for easy model integration
- Multi-GPU Support: Automatic device mapping across multiple GPUs
- Cross-Platform: Works on Linux, macOS (MPS), and CPU
- Batch Generation: Efficient batch processing for multiple requests
Model Configuration
- Flexible Model Loading: Support for any HuggingFace model
- Data Type Options: Choose between auto, float16, bfloat16, and float32
- Custom Download: Specify custom cache directories for models
- Device Mapping: Automatic or manual device mapping for multi-GPU setups
- Trust Remote Code: Option to trust remote code for specialized models
Advanced Generation Control
- Temperature & Sampling: Fine-tune creativity with temperature, top_p, and top_k
- Token Control: Set min/max tokens, stop sequences, and EOS handling
- Penalties: Apply repetition and length penalties for better generation quality
- Multiple Outputs: Generate multiple sequences and control output diversity
- Stopping Criteria: Custom stopping criteria with stop strings support
Chat & Structured Outputs
- Chat Templates: Built-in support for chat-style interactions
- Structured Outputs: Extract structured data using Pydantic models
- JSON Validation: Automatic parsing and validation of structured responses
- Batch Chat: Efficient batch processing of multiple conversations
Examples
Basic Crawling
Simple web crawling with markdown extraction:
import asyncio
from opencrawl import AsyncCrawler, CrawlerConfig, CrawlRequest, ExtractionType
async def crawl_example():
# Configure the crawler
config = CrawlerConfig(
max_concurrent_requests=5,
extraction_strategy=ExtractionType.MARKDOWN,
)
# Create crawler and fetch content
crawler = AsyncCrawler(config)
await crawler.setup()
response = await crawler.fetch(
CrawlRequest(url="https://example.com")
)
print(response.extracted.content)
await crawler.cleanup()
asyncio.run(crawl_example())
Crawling with LLM Analysis
Combine web crawling with open-source LLM analysis:
import asyncio
from opencrawl import Spider, ModelConfig, CrawlerConfig, CrawlRequest, ExtractionType
async def llm_crawl_example():
# Initialize spider with crawler and model
spider = Spider(
crawl_config=CrawlerConfig(
max_concurrent_requests=5,
extraction_strategy=ExtractionType.MARKDOWN,
),
model_config=ModelConfig(
model="Qwen/Qwen2.5-0.5B-Instruct",
dtype="float16",
device_map="auto",
),
output_path="output.json"
)
# Define task and crawl
task = "Summarize the main content of this webpage."
results = await spider.crawl(
requests=[
CrawlRequest(url="https://example.com"),
CrawlRequest(url="https://example.org"),
],
task=task,
)
for result in results:
print(f"{result.url}: {result.content}")
asyncio.run(llm_crawl_example())
Structured Output Extraction
Extract structured data using Pydantic models:
import asyncio
from pydantic import BaseModel
from opencrawl import Spider, ModelConfig, GenerationConfig, CrawlerConfig, CrawlRequest
class ArticleData(BaseModel):
title: str
summary: str
main_topics: list[str]
async def structured_extraction():
spider = Spider(
crawl_config=CrawlerConfig(),
model_config=ModelConfig(
model="Qwen/Qwen2.5-0.5B-Instruct",
dtype="float16",
gen_config=GenerationConfig(
temperature=0.7,
max_new_tokens=512,
do_sample=True,
structured_outputs=ArticleData,
),
),
)
results = await spider.crawl(
requests=[CrawlRequest(url="https://example.com/article")],
task="Extract the article title, summary, and main topics.",
)
print(results[0].content)
asyncio.run(structured_extraction())
Contributions
This project is in its very early stages, but any contributions towards the project is highgly appreciatied. Just open a PR and i will have an look at it, and if its fits the projects vision, i will gladly merge it in.
Disclaimer
This software is provided "as is", without warranty of any kind, express or implied. The developers of OpenCrawl are not responsible for any damages, legal issues, or consequences arising from the use or misuse of this tool. Users are solely responsible for ensuring their use complies with applicable laws, terms of service, and ethical guidelines.
License
This project is licensed wit the Apache 2.0 license. Please have a look at the license if your not sure what the kind of rules it requires.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opencrawll-0.1.2.tar.gz.
File metadata
- Download URL: opencrawll-0.1.2.tar.gz
- Upload date:
- Size: 22.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29aa06d8ad75ae153878188ba397569a3c8774e6c5d8972f2470ce2d8e150b60
|
|
| MD5 |
21898930366b1942b8b04e97411911f3
|
|
| BLAKE2b-256 |
b29c66340f833e8d3f103db71e53b4eed10688d0c0e18ea19c54d0d220904d90
|
Provenance
The following attestation bundles were made for opencrawll-0.1.2.tar.gz:
Publisher:
workflow.yml on ceyhuncakir/opencrawl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
opencrawll-0.1.2.tar.gz -
Subject digest:
29aa06d8ad75ae153878188ba397569a3c8774e6c5d8972f2470ce2d8e150b60 - Sigstore transparency entry: 648682131
- Sigstore integration time:
-
Permalink:
ceyhuncakir/opencrawl@28eb4268004e72dab03e017f80bc40497f01704d -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/ceyhuncakir
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@28eb4268004e72dab03e017f80bc40497f01704d -
Trigger Event:
push
-
Statement type:
File details
Details for the file opencrawll-0.1.2-py3-none-any.whl.
File metadata
- Download URL: opencrawll-0.1.2-py3-none-any.whl
- Upload date:
- Size: 25.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c50e448fcb95e541588bfbc82f593801c216fd2dbc2d2db0c23661ef54e07d44
|
|
| MD5 |
9af64ba69bbbe0a65d6b6f19a6feea1b
|
|
| BLAKE2b-256 |
d48ee485f0f3e0e2a626948421fb48cf5be271e8835078b5a33f5a3046351483
|
Provenance
The following attestation bundles were made for opencrawll-0.1.2-py3-none-any.whl:
Publisher:
workflow.yml on ceyhuncakir/opencrawl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
opencrawll-0.1.2-py3-none-any.whl -
Subject digest:
c50e448fcb95e541588bfbc82f593801c216fd2dbc2d2db0c23661ef54e07d44 - Sigstore transparency entry: 648682133
- Sigstore integration time:
-
Permalink:
ceyhuncakir/opencrawl@28eb4268004e72dab03e017f80bc40497f01704d -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/ceyhuncakir
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@28eb4268004e72dab03e017f80bc40497f01704d -
Trigger Event:
push
-
Statement type: