Skip to main content

High-performance web crawler and text generator with Transformers

Project description

OpenCrawl Banner

GitHub stars GitHub forks

PyPI version Python versions Downloads Sponsors

Opencrawl

This project was created to crawl data in a meaningfull way using open source LLMs. I've been thinking of that most crawlers either use propriatary models from openai or anthropic but rarely have i seen crawlers that are using solely open-source models. So for that reason this project came to to the creation. Its still in the very early stages, but througout the time i will keep maintainting it and adding more features to make crawling more comphrensive with open-source LLMs.

Installation

With pip

pip install opencrawl

With uv

uv add opencrawl

TODO

  • Write tests
  • create more extraction strategies
  • add more proxy strategies
  • Captcha bypasses
  • better model support from VLLM

Features

Crawler Features

High-Performance Web Crawling
  • Async Architecture: Built on aiohttp and uvloop for maximum performance
  • Concurrent Requests: Configurable concurrency limits with semaphore-based control
  • Smart Retry Logic: Automatic retries with exponential backoff for failed requests
  • Connection Management: Efficient connection pooling and timeout control
Proxy Support
  • Proxy Rotation: Automatic proxy rotation from a pool of proxies
  • Proxy Validation: Built-in proxy health checking against test endpoints
  • Multiple Input Methods: Load proxies from file or comma-separated string
  • Proxy Filtering: Automatic removal of invalid proxies
Flexible Configuration
  • Custom Headers & Cookies: Set default and per-request headers/cookies
  • SSL Control: Enable/disable SSL verification as needed
  • Redirect Handling: Configurable redirect following with max redirect limits
  • User Agent: Customizable user agent strings
  • Request Timeouts: Fine-grained timeout control for each request
Content Extraction
  • Multiple Extraction Types:
    • HTML: Raw HTML extraction with cleaning options
    • Content: Clean text content extraction
    • Markdown: Convert HTML to markdown format
  • Smart Cleaning: Configurable removal of scripts, styles, navigation, headers, footers
  • Metadata Extraction: Automatic extraction of page metadata (title, description, keywords)
  • Link & Image Preservation: Optional extraction of links and image URLs
  • Minimum Text Filtering: Filter out elements below a minimum text length

Model Features

VLLM-Powered Inference
  • Flexible & Easy: Built on VLLM for easy model integration
  • Multi-GPU Support: Automatic device mapping across multiple GPUs
  • Cross-Platform: Works on Linux, macOS (MPS), and CPU
  • Batch Generation: Efficient batch processing for multiple requests
Model Configuration
  • Flexible Model Loading: Support for any HuggingFace model
  • Data Type Options: Choose between auto, float16, bfloat16, and float32
  • Custom Download: Specify custom cache directories for models
  • Device Mapping: Automatic or manual device mapping for multi-GPU setups
  • Trust Remote Code: Option to trust remote code for specialized models
Advanced Generation Control
  • Temperature & Sampling: Fine-tune creativity with temperature, top_p, and top_k
  • Token Control: Set min/max tokens, stop sequences, and EOS handling
  • Penalties: Apply repetition and length penalties for better generation quality
  • Multiple Outputs: Generate multiple sequences and control output diversity
  • Stopping Criteria: Custom stopping criteria with stop strings support
Chat & Structured Outputs
  • Chat Templates: Built-in support for chat-style interactions
  • Structured Outputs: Extract structured data using Pydantic models
  • JSON Validation: Automatic parsing and validation of structured responses
  • Batch Chat: Efficient batch processing of multiple conversations

Examples

Basic Crawling

Simple web crawling with markdown extraction:

import asyncio
from opencrawl import AsyncCrawler, CrawlerConfig, CrawlRequest, ExtractionType

async def crawl_example():
    # Configure the crawler
    config = CrawlerConfig(
        max_concurrent_requests=5,
        extraction_strategy=ExtractionType.MARKDOWN,
    )
    
    # Create crawler and fetch content
    crawler = AsyncCrawler(config)
    await crawler.setup()
    
    response = await crawler.fetch(
        CrawlRequest(url="https://example.com")
    )
    
    print(response.extracted.content)
    await crawler.cleanup()

asyncio.run(crawl_example())

Crawling with LLM Analysis

Combine web crawling with open-source LLM analysis:

import asyncio
from opencrawl import Spider, ModelConfig, CrawlerConfig, CrawlRequest, ExtractionType

async def llm_crawl_example():
    # Initialize spider with crawler and model
    spider = Spider(
        crawl_config=CrawlerConfig(
            max_concurrent_requests=5,
            extraction_strategy=ExtractionType.MARKDOWN,
        ),
        model_config=ModelConfig(
            model="Qwen/Qwen2.5-0.5B-Instruct",
            dtype="float16",
            device_map="auto",
        ),
        output_path="output.json"
    )
    
    # Define task and crawl
    task = "Summarize the main content of this webpage."
    results = await spider.crawl(
        requests=[
            CrawlRequest(url="https://example.com"),
            CrawlRequest(url="https://example.org"),
        ],
        task=task,
    )
    
    for result in results:
        print(f"{result.url}: {result.content}")

asyncio.run(llm_crawl_example())

Structured Output Extraction

Extract structured data using Pydantic models:

import asyncio
from pydantic import BaseModel
from opencrawl import Spider, ModelConfig, GenerationConfig, CrawlerConfig, CrawlRequest

class ArticleData(BaseModel):
    title: str
    summary: str
    main_topics: list[str]

async def structured_extraction():
    spider = Spider(
        crawl_config=CrawlerConfig(),
        model_config=ModelConfig(
            model="Qwen/Qwen2.5-0.5B-Instruct",
            dtype="float16",
            gen_config=GenerationConfig(
                temperature=0.7,
                max_new_tokens=512,
                do_sample=True,
                structured_outputs=ArticleData,
            ),
        ),
    )
    
    results = await spider.crawl(
        requests=[CrawlRequest(url="https://example.com/article")],
        task="Extract the article title, summary, and main topics.",
    )
    
    print(results[0].content)

asyncio.run(structured_extraction())

Contributions

This project is in its very early stages, but any contributions towards the project is highgly appreciatied. Just open a PR and i will have an look at it, and if its fits the projects vision, i will gladly merge it in.

Disclaimer

This software is provided "as is", without warranty of any kind, express or implied. The developers of OpenCrawl are not responsible for any damages, legal issues, or consequences arising from the use or misuse of this tool. Users are solely responsible for ensuring their use complies with applicable laws, terms of service, and ethical guidelines.

License

This project is licensed wit the Apache 2.0 license. Please have a look at the license if your not sure what the kind of rules it requires.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencrawll-0.1.5.tar.gz (22.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencrawll-0.1.5-py3-none-any.whl (25.5 kB view details)

Uploaded Python 3

File details

Details for the file opencrawll-0.1.5.tar.gz.

File metadata

  • Download URL: opencrawll-0.1.5.tar.gz
  • Upload date:
  • Size: 22.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opencrawll-0.1.5.tar.gz
Algorithm Hash digest
SHA256 b4427831975213c3fea43c570682495cfd9c6c8d51e484ce48a5005d448bf48f
MD5 f0c9dc9fb0f07683752669014851f109
BLAKE2b-256 1433dfba5a2bb76e2c2f53cacc50c7af31a9f26119c99ba58ae95f69dce79b82

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencrawll-0.1.5.tar.gz:

Publisher: workflow.yml on ceyhuncakir/opencrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file opencrawll-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: opencrawll-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 25.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opencrawll-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c4df5f88406dc44b9710563ca3c33ee726f38b2ab69bfa918bd9992fb2d4246c
MD5 8f4948294f43b8fcf381187727a9ac31
BLAKE2b-256 6cfeae4dc23b30c1073ed87227be03672065811dc0d60e69e4dfb6afa606c9a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencrawll-0.1.5-py3-none-any.whl:

Publisher: workflow.yml on ceyhuncakir/opencrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page