High-performance web crawler and text generator with Transformers

Project description

Opencrawl

This project was created to crawl data in a meaningfull way using open source LLMs. I've been thinking of that most crawlers either use propriatary models from openai or anthropic but rarely have i seen crawlers that are using solely open-source models. So for that reason this project came to to the creation. Its still in the very early stages, but througout the time i will keep maintainting it and adding more features to make crawling more comphrensive with open-source LLMs.

Installation

With pip

pip install opencrawl

With uv

uv add opencrawl

TODO

Write tests
create more extraction strategies
add more proxy strategies
Captcha bypasses
better model support from VLLM

Features

Crawler Features

High-Performance Web Crawling

Async Architecture: Built on aiohttp and uvloop for maximum performance
Concurrent Requests: Configurable concurrency limits with semaphore-based control
Smart Retry Logic: Automatic retries with exponential backoff for failed requests
Connection Management: Efficient connection pooling and timeout control

Proxy Support

Proxy Rotation: Automatic proxy rotation from a pool of proxies
Proxy Validation: Built-in proxy health checking against test endpoints
Multiple Input Methods: Load proxies from file or comma-separated string
Proxy Filtering: Automatic removal of invalid proxies

Flexible Configuration

Custom Headers & Cookies: Set default and per-request headers/cookies
SSL Control: Enable/disable SSL verification as needed
Redirect Handling: Configurable redirect following with max redirect limits
User Agent: Customizable user agent strings
Request Timeouts: Fine-grained timeout control for each request

Content Extraction

Multiple Extraction Types:
- HTML: Raw HTML extraction with cleaning options
- Content: Clean text content extraction
- Markdown: Convert HTML to markdown format
Smart Cleaning: Configurable removal of scripts, styles, navigation, headers, footers
Metadata Extraction: Automatic extraction of page metadata (title, description, keywords)
Link & Image Preservation: Optional extraction of links and image URLs
Minimum Text Filtering: Filter out elements below a minimum text length

Model Features

VLLM-Powered Inference

Flexible & Easy: Built on VLLM for easy model integration
Multi-GPU Support: Automatic device mapping across multiple GPUs
Cross-Platform: Works on Linux, macOS (MPS), and CPU
Batch Generation: Efficient batch processing for multiple requests

Model Configuration

Flexible Model Loading: Support for any HuggingFace model
Data Type Options: Choose between auto, float16, bfloat16, and float32
Custom Download: Specify custom cache directories for models
Device Mapping: Automatic or manual device mapping for multi-GPU setups
Trust Remote Code: Option to trust remote code for specialized models

Advanced Generation Control

Temperature & Sampling: Fine-tune creativity with temperature, top_p, and top_k
Token Control: Set min/max tokens, stop sequences, and EOS handling
Penalties: Apply repetition and length penalties for better generation quality
Multiple Outputs: Generate multiple sequences and control output diversity
Stopping Criteria: Custom stopping criteria with stop strings support

Chat & Structured Outputs

Chat Templates: Built-in support for chat-style interactions
Structured Outputs: Extract structured data using Pydantic models
JSON Validation: Automatic parsing and validation of structured responses
Batch Chat: Efficient batch processing of multiple conversations

Examples

Basic Crawling

Simple web crawling with markdown extraction:

import asyncio
from opencrawl import AsyncCrawler, CrawlerConfig, CrawlRequest, ExtractionType

async def crawl_example():
    # Configure the crawler
    config = CrawlerConfig(
        max_concurrent_requests=5,
        extraction_strategy=ExtractionType.MARKDOWN,
    )
    
    # Create crawler and fetch content
    crawler = AsyncCrawler(config)
    await crawler.setup()
    
    response = await crawler.fetch(
        CrawlRequest(url="https://example.com")
    )
    
    print(response.extracted.content)
    await crawler.cleanup()

asyncio.run(crawl_example())

Crawling with LLM Analysis

Combine web crawling with open-source LLM analysis:

import asyncio
from opencrawl import Spider, ModelConfig, CrawlerConfig, CrawlRequest, ExtractionType

async def llm_crawl_example():
    # Initialize spider with crawler and model
    spider = Spider(
        crawl_config=CrawlerConfig(
            max_concurrent_requests=5,
            extraction_strategy=ExtractionType.MARKDOWN,
        ),
        model_config=ModelConfig(
            model="Qwen/Qwen2.5-0.5B-Instruct",
            dtype="float16",
            device_map="auto",
        ),
        output_path="output.json"
    )
    
    # Define task and crawl
    task = "Summarize the main content of this webpage."
    results = await spider.crawl(
        requests=[
            CrawlRequest(url="https://example.com"),
            CrawlRequest(url="https://example.org"),
        ],
        task=task,
    )
    
    for result in results:
        print(f"{result.url}: {result.content}")

asyncio.run(llm_crawl_example())

Structured Output Extraction

Extract structured data using Pydantic models:

import asyncio
from pydantic import BaseModel
from opencrawl import Spider, ModelConfig, GenerationConfig, CrawlerConfig, CrawlRequest

class ArticleData(BaseModel):
    title: str
    summary: str
    main_topics: list[str]

async def structured_extraction():
    spider = Spider(
        crawl_config=CrawlerConfig(),
        model_config=ModelConfig(
            model="Qwen/Qwen2.5-0.5B-Instruct",
            dtype="float16",
            gen_config=GenerationConfig(
                temperature=0.7,
                max_new_tokens=512,
                do_sample=True,
                structured_outputs=ArticleData,
            ),
        ),
    )
    
    results = await spider.crawl(
        requests=[CrawlRequest(url="https://example.com/article")],
        task="Extract the article title, summary, and main topics.",
    )
    
    print(results[0].content)

asyncio.run(structured_extraction())

Contributions

This project is in its very early stages, but any contributions towards the project is highgly appreciatied. Just open a PR and i will have an look at it, and if its fits the projects vision, i will gladly merge it in.

Disclaimer

This software is provided "as is", without warranty of any kind, express or implied. The developers of OpenCrawl are not responsible for any damages, legal issues, or consequences arising from the use or misuse of this tool. Users are solely responsible for ensuring their use complies with applicable laws, terms of service, and ethical guidelines.

License

This project is licensed wit the Apache 2.0 license. Please have a look at the license if your not sure what the kind of rules it requires.

Project details

Release history Release notifications | RSS feed

This version

0.1.5

Oct 28, 2025

0.1.4

Oct 28, 2025

0.1.2

Oct 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencrawll-0.1.5.tar.gz (22.9 kB view details)

Uploaded Oct 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

opencrawll-0.1.5-py3-none-any.whl (25.5 kB view details)

Uploaded Oct 28, 2025 Python 3

File details

Details for the file opencrawll-0.1.5.tar.gz.

File metadata

Download URL: opencrawll-0.1.5.tar.gz
Upload date: Oct 28, 2025
Size: 22.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opencrawll-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`b4427831975213c3fea43c570682495cfd9c6c8d51e484ce48a5005d448bf48f`
MD5	`f0c9dc9fb0f07683752669014851f109`
BLAKE2b-256	`1433dfba5a2bb76e2c2f53cacc50c7af31a9f26119c99ba58ae95f69dce79b82`

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencrawll-0.1.5.tar.gz:

Publisher: workflow.yml on ceyhuncakir/opencrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: opencrawll-0.1.5.tar.gz
- Subject digest: b4427831975213c3fea43c570682495cfd9c6c8d51e484ce48a5005d448bf48f
- Sigstore transparency entry: 648812224
- Sigstore integration time: Oct 28, 2025
Source repository:
- Permalink: ceyhuncakir/opencrawl@c525205027f917b3914bb374f4601830e38a3992
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/ceyhuncakir
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@c525205027f917b3914bb374f4601830e38a3992
- Trigger Event: push

File details

Details for the file opencrawll-0.1.5-py3-none-any.whl.

File metadata

Download URL: opencrawll-0.1.5-py3-none-any.whl
Upload date: Oct 28, 2025
Size: 25.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opencrawll-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c4df5f88406dc44b9710563ca3c33ee726f38b2ab69bfa918bd9992fb2d4246c`
MD5	`8f4948294f43b8fcf381187727a9ac31`
BLAKE2b-256	`6cfeae4dc23b30c1073ed87227be03672065811dc0d60e69e4dfb6afa606c9a0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencrawll-0.1.5-py3-none-any.whl:

Publisher: workflow.yml on ceyhuncakir/opencrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: opencrawll-0.1.5-py3-none-any.whl
- Subject digest: c4df5f88406dc44b9710563ca3c33ee726f38b2ab69bfa918bd9992fb2d4246c
- Sigstore transparency entry: 648812236
- Sigstore integration time: Oct 28, 2025
Source repository:
- Permalink: ceyhuncakir/opencrawl@c525205027f917b3914bb374f4601830e38a3992
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/ceyhuncakir
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@c525205027f917b3914bb374f4601830e38a3992
- Trigger Event: push

opencrawll 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Opencrawl

Installation

TODO

Features

Crawler Features

Model Features

Examples

Basic Crawling

Crawling with LLM Analysis

Structured Output Extraction

Contributions

Disclaimer

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance