Skip to main content

Pure Python async web crawling with BM25 + semantic relevance ranking.

Project description

CRL — Crawl Relevance Layers

Pure Python async web crawling + BM25 & semantic relevance ranking.

Python PyPI License: MIT Tests GitHub


What is CRL?

CRL is a pure Python library that combines async web crawling with intelligent relevance ranking. Give it URLs and a query — it crawls, parses, deduplicates, and returns pages ranked by how relevant they are to your query.

Key features:

  • Async BFS deep crawling with pagination following
  • BM25 keyword scoring + sentence-transformers semantic scoring
  • DuckDuckGo search integration (no API key needed)
  • Sitemap.xml auto-discovery
  • JS rendering via Playwright (React, Next.js, Vue, Angular)
  • Structured data extraction (Open Graph, JSON-LD, tables, lists)
  • Per-domain rate limiting, proxy rotation, robots.txt support
  • LRU + disk tiered cache
  • Full CLI with 5 subcommands
  • Output: JSON, CSV, Markdown, SQLite, plain text

Install

pip install crawl-relevance-layers

With semantic ranking (recommended):

pip install crawl-relevance-layers[semantic]

With JS rendering:

pip install crawl-relevance-layers playwright
playwright install chromium

Quick Start

from crl import crawl

results = crawl(
    urls=["https://example.com", "https://python.org"],
    query="python async programming",
    top_k=5,
    mode="both",   # "keyword", "semantic", or "both"
)

for r in results:
    print(r["url"], r["relevance_score"])

Core API

crawl() — fetch + rank

from crl import crawl

results = crawl(
    urls=["https://example.com"],
    query="your query",
    top_k=10,
    mode="both",              # "keyword" | "semantic" | "both"
    semantic_weight=0.5,      # 0.0 = pure keyword, 1.0 = pure semantic
    timeout=10,
    max_connections=20,
    retries=3,
    rate_limit=5.0,           # max requests/sec
    proxies=["http://proxy1:8080"],
    cache=None,               # TieredCache instance
    respect_robots=False,
    min_text_length=100,
    model_name="all-MiniLM-L6-v2",
)

Each result dict contains:

{
    "url": "https://...",
    "title": "Page Title",
    "text": "extracted text...",
    "language": "en",
    "relevance_score": 0.87,
    "keyword_score": 0.91,
    "semantic_score": 0.83,
    "links": ["https://..."],
    "meta": {"description": "..."},
    "structured": {
        "open_graph": {"title": "...", "image": "..."},
        "twitter_card": {"card": "summary"},
        "json_ld": [{"@type": "Article", ...}],
        "tables": [[{"col": "val"}]],
        "lists": [["item1", "item2"]],
        "canonical": "https://...",
        "favicon": "/favicon.ico",
    }
}

acrawl() — async version

import asyncio
from crl import acrawl

results = asyncio.run(acrawl(urls=[...], query="..."))

astream() — streaming, results as they arrive

import asyncio
from crl import astream

async def main():
    async for result in astream(urls=[...], query="python"):
        print(result["url"], result["relevance_score"])

asyncio.run(main())

Deep Crawl

Follow links recursively, auto-paginate, deduplicate content:

from crl import deep_search

results = deep_search(
    urls=["https://docs.python.org"],
    query="asyncio event loop",
    depth=2,                    # follow links 2 levels deep
    max_pages=200,
    max_pages_per_domain=50,
    follow_external=False,
    paginate=True,              # auto-follow next page links
    max_pagination_pages=5,
    similarity_threshold=0.85,  # content dedup threshold
    domain_rate_limits={"docs.python.org": 2.0},  # 2 req/sec for this domain
    use_mmap=True,              # memory-mapped storage for large crawls
    top_k=20,
)

Async version:

from crl import adeep_search
results = await adeep_search(urls=[...], query="...")

Search + Crawl (No URLs needed)

Search DuckDuckGo, crawl results, rank by relevance — all in one call:

from crl import search_and_crawl

results = search_and_crawl(
    query="python async web scraping",
    max_results=10,     # fetch 10 URLs from DDG
    top_k=5,
    timelimit="w",      # last week only: "d", "w", "m", "y"
    region="us-en",
)

Just get URLs from DDG:

from crl import search_urls, search_news_urls

urls = search_urls("python asyncio tutorial", max_results=20)
news = search_news_urls("AI news", max_results=10, timelimit="d")

Sitemap Discovery

from crl import fetch_sitemap_urls_sync

# Auto-discovers sitemap.xml, follows sitemap indexes, handles gzip
urls = fetch_sitemap_urls_sync("https://bbc.com", max_urls=1000)
print(f"Found {len(urls)} URLs")

# Then crawl them
results = deep_search(urls=urls[:50], query="technology news")

Async version:

from crl import fetch_sitemap_urls
urls = await fetch_sitemap_urls("https://bbc.com")

JS Rendering (React / Next.js / Vue / Angular)

For sites that require JavaScript to render content:

from crl.js_renderer import js_crawl_sync, is_available

if is_available():  # checks if playwright is installed
    results = js_crawl_sync(
        urls=["https://react-app.com"],
        query="your query",
        wait_until="networkidle",   # wait for JS to finish
        wait_for_selector="#content",  # optional: wait for element
        extra_wait_ms=500,          # extra wait for lazy JS
        block_resources=True,       # block images/fonts (faster)
        max_concurrent=3,
    )

Install Playwright first:

pip install playwright
playwright install chromium

Structured Data Extraction

Every crawled page automatically includes structured data. Access it directly:

results = crawl(urls=["https://example.com"], query="test")

for r in results:
    s = r["structured"]

    # Open Graph
    print(s["open_graph"].get("title"))
    print(s["open_graph"].get("image"))

    # JSON-LD / Schema.org
    for item in s["json_ld"]:
        print(item.get("@type"), item.get("name"))

    # Tables as list of dicts
    for table in s["tables"]:
        for row in table:
            print(row)

    # Lists
    for lst in s["lists"]:
        print(lst)

    print(s["canonical"])
    print(s["favicon"])

Extract from raw HTML directly:

from crl import extract_structured

data = extract_structured("<html>...</html>", url="https://example.com")

Cache

from crl import TieredCache, crawl

# L1: memory LRU, L2: disk shelve
cache = TieredCache(
    memory_size=500,        # max 500 entries in memory
    memory_ttl=3600,        # 1 hour TTL
    disk_path=".crl_cache", # disk cache path
    disk_ttl=86400,         # 1 day TTL
)

# First call fetches, subsequent calls hit cache
results = crawl(urls=[...], query="...", cache=cache)
results = crawl(urls=[...], query="...", cache=cache)  # instant

Per-Domain Rate Limiting

from crl import DomainRateLimiter, deep_search

results = deep_search(
    urls=["https://news.ycombinator.com"],
    query="python",
    rate_limit=10.0,                          # global: 10 req/sec
    domain_rate_limits={
        "news.ycombinator.com": 1.0,          # 1 req/sec for HN
        "github.com": 2.0,                    # 2 req/sec for GitHub
    },
)

Standalone:

from crl import DomainRateLimiter

limiter = DomainRateLimiter(default_rps=5.0, domain_rps={"slow.com": 1.0})
await limiter.wait("https://slow.com/page")  # waits if needed

Output Formats

from crl import to_json, to_text, to_csv, to_markdown, to_sqlite, save

results = crawl(urls=[...], query="...")

# Print
print(to_json(results))
print(to_text(results))
print(to_csv(results))
print(to_markdown(results))

# Save — format inferred from extension
save(results, "output.json")      # JSON
save(results, "output.csv")       # CSV
save(results, "output.md")        # Markdown
save(results, "output.db")        # SQLite database
save(results, "output.txt")       # Plain text

# SQLite — query with standard sqlite3
import sqlite3
conn = sqlite3.connect("output.db")
rows = conn.execute(
    "SELECT url, title, relevance_score FROM pages ORDER BY relevance_score DESC"
).fetchall()

Progress Reporting

from crl import ProgressReporter, crawl

progress = ProgressReporter()
results = crawl(urls=[...], query="...", progress=progress)
# Prints live progress bar to stderr:
# [CRL] [████████████░░░░░░░░] 6/10 | cached=2 err=0 | ETA 4s

Robots.txt

results = crawl(
    urls=[...],
    query="...",
    respect_robots=True,   # fetch and respect robots.txt per domain
)

Proxy Rotation

results = crawl(
    urls=[...],
    query="...",
    proxies=[
        "http://proxy1:8080",
        "http://user:pass@proxy2:8080",
        "socks5://proxy3:1080",
    ],
)

CLI

CRL ships with a full command-line interface:

crl crawl — fetch and rank URLs

crl crawl https://example.com https://python.org \
    --query "python async" \
    --top-k 5 \
    --mode both \
    --out results.json

# Save as Markdown
crl crawl https://example.com --query "python" --out results.md

# Save as SQLite
crl crawl https://example.com --query "python" --out results.db

crl deep — deep crawl with link following

crl deep https://docs.python.org \
    --query "asyncio" \
    --depth 2 \
    --max-pages 100 \
    --domain-rps docs.python.org:2.0 \
    --use-mmap \
    --out results.json

crl search — search DDG then crawl

crl search "python async web scraping" \
    --max-results 10 \
    --top-k 5 \
    --out results.md

# News search
crl search "AI news" --news --timelimit d --top-k 10

crl js — JS rendering with Playwright

crl js https://react-app.com \
    --query "your query" \
    --wait networkidle \
    --wait-for "#main-content" \
    --extra-wait 500

crl sitemap — discover URLs from sitemap.xml

crl sitemap https://bbc.com --max-urls 500
crl sitemap https://bbc.com --out urls.txt

Common options (all commands)

--mode          keyword | semantic | both (default: both)
--top-k         return top N results
--timeout       per-request timeout in seconds (default: 10)
--retries       retry attempts per URL (default: 3)
--rate-limit    max requests/sec
--proxy         proxy URL (repeat for rotation)
--cache         enable disk+memory cache
--robots        respect robots.txt
--min-text      skip pages shorter than N chars
--fmt           json | text | csv | markdown | sqlite
--out           save to file
--no-progress   disable progress bar

Architecture

crl/
├── __init__.py       # Public API: crawl, acrawl, astream, deep_search, ...
├── fetcher.py        # Async HTTP: retry, backoff, cache, robots, proxy, content-type filter
├── parser.py         # HTML parsing: lxml/BS4, text extraction, link resolution
├── extractor.py      # Structured data: Open Graph, JSON-LD, tables, lists
├── relevance.py      # BM25 + semantic scoring, minmax normalization
├── crawler.py        # Async BFS deep crawler with pagination
├── deduplicator.py   # Shingle hash + Jaccard content dedup
├── paginator.py      # Auto pagination detection (query params, path, next-link)
├── sitemap.py        # Sitemap.xml parser, index recursion, gzip support
├── search.py         # DuckDuckGo integration (free, no API key)
├── js_renderer.py    # Playwright JS rendering
├── cache.py          # MemoryCache (LRU) + DiskCache (shelve) + TieredCache
├── robots.py         # robots.txt fetch, parse, Crawl-delay support
├── ratelimiter.py    # Per-domain async token bucket rate limiter
├── progress.py       # Live progress bar with ETA
├── bridge.py         # ZeroCopyTokenizer, MMapStore, FastHTMLStripper
├── output.py         # JSON, CSV, Text, Markdown, SQLite export
└── cli.py            # Full CLI: crawl, deep, search, js, sitemap

Requirements

  • Python 3.9+
  • httpx[http2,brotli] — async HTTP with HTTP/2 and brotli compression
  • beautifulsoup4 + lxml — HTML parsing
  • rank_bm25 — BM25 keyword scoring
  • numpy — score normalization
  • duckduckgo-search — free search integration

Optional:

  • sentence-transformers + torch — semantic scoring (pip install crl[semantic])
  • playwright — JS rendering (pip install playwright && playwright install chromium)

License

MIT — free to use in commercial and open source projects.


Author

Made by @who_is_the_black_hat · GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl_relevance_layers-1.1.1.tar.gz (53.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawl_relevance_layers-1.1.1-py3-none-any.whl (47.4 kB view details)

Uploaded Python 3

File details

Details for the file crawl_relevance_layers-1.1.1.tar.gz.

File metadata

  • Download URL: crawl_relevance_layers-1.1.1.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for crawl_relevance_layers-1.1.1.tar.gz
Algorithm Hash digest
SHA256 e9579d86b56c88a531a026534653e8c339496a5715d38fb8c53f85352e875c5a
MD5 97fef6fe0ff00a6e7bbefab52ea6ccff
BLAKE2b-256 87b185cc0b0a4f9213f7eac6ee0320076b3dfe93f9ba339cabc64780aae56025

See more details on using hashes here.

File details

Details for the file crawl_relevance_layers-1.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for crawl_relevance_layers-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0567375893978abc5a754849090f3b5728dcf385a3ec7541504dad5a8f9e4e2c
MD5 e841347952b3cd0233dd936bf5d19718
BLAKE2b-256 94ab752d81d381b9701113398631f799600e917c0c2e018f515a51cf5eb92853

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page