Pure Python async web crawling with BM25 + semantic relevance ranking.

These details have not been verified by PyPI

Project links

Project description

CRL — Crawl Relevance Layers

Pure Python async web crawling + BM25 & semantic relevance ranking.

What is CRL?

CRL is a pure Python library that combines async web crawling with intelligent relevance ranking. Give it URLs and a query — it crawls, parses, deduplicates, and returns pages ranked by how relevant they are to your query.

Key features:

Async BFS deep crawling with pagination following
BM25 keyword scoring + sentence-transformers semantic scoring
DuckDuckGo search integration (no API key needed)
Sitemap.xml auto-discovery
JS rendering via Playwright (React, Next.js, Vue, Angular)
Structured data extraction (Open Graph, JSON-LD, tables, lists)
Per-domain rate limiting, proxy rotation, robots.txt support
LRU + disk tiered cache
Full CLI with 5 subcommands
Output: JSON, CSV, Markdown, SQLite, plain text

Install

pip install crawl-relevance-layers

With semantic ranking (recommended):

pip install crawl-relevance-layers[semantic]

With JS rendering:

pip install crawl-relevance-layers playwright
playwright install chromium

Quick Start

from crl import crawl

results = crawl(
    urls=["https://example.com", "https://python.org"],
    query="python async programming",
    top_k=5,
    mode="both",   # "keyword", "semantic", or "both"
)

for r in results:
    print(r["url"], r["relevance_score"])

Core API

`crawl()` — fetch + rank

from crl import crawl

results = crawl(
    urls=["https://example.com"],
    query="your query",
    top_k=10,
    mode="both",              # "keyword" | "semantic" | "both"
    semantic_weight=0.5,      # 0.0 = pure keyword, 1.0 = pure semantic
    timeout=10,
    max_connections=20,
    retries=3,
    rate_limit=5.0,           # max requests/sec
    proxies=["http://proxy1:8080"],
    cache=None,               # TieredCache instance
    respect_robots=False,
    min_text_length=100,
    model_name="all-MiniLM-L6-v2",
)

Each result dict contains:

{
    "url": "https://...",
    "title": "Page Title",
    "text": "extracted text...",
    "language": "en",
    "relevance_score": 0.87,
    "keyword_score": 0.91,
    "semantic_score": 0.83,
    "links": ["https://..."],
    "meta": {"description": "..."},
    "structured": {
        "open_graph": {"title": "...", "image": "..."},
        "twitter_card": {"card": "summary"},
        "json_ld": [{"@type": "Article", ...}],
        "tables": [[{"col": "val"}]],
        "lists": [["item1", "item2"]],
        "canonical": "https://...",
        "favicon": "/favicon.ico",
    }
}

`acrawl()` — async version

import asyncio
from crl import acrawl

results = asyncio.run(acrawl(urls=[...], query="..."))

`astream()` — streaming, results as they arrive

import asyncio
from crl import astream

async def main():
    async for result in astream(urls=[...], query="python"):
        print(result["url"], result["relevance_score"])

asyncio.run(main())

Deep Crawl

Follow links recursively, auto-paginate, deduplicate content:

from crl import deep_search

results = deep_search(
    urls=["https://docs.python.org"],
    query="asyncio event loop",
    depth=2,                    # follow links 2 levels deep
    max_pages=200,
    max_pages_per_domain=50,
    follow_external=False,
    paginate=True,              # auto-follow next page links
    max_pagination_pages=5,
    similarity_threshold=0.85,  # content dedup threshold
    domain_rate_limits={"docs.python.org": 2.0},  # 2 req/sec for this domain
    use_mmap=True,              # memory-mapped storage for large crawls
    top_k=20,
)

Async version:

from crl import adeep_search
results = await adeep_search(urls=[...], query="...")

Search + Crawl (No URLs needed)

Search DuckDuckGo, crawl results, rank by relevance — all in one call:

from crl import search_and_crawl

results = search_and_crawl(
    query="python async web scraping",
    max_results=10,     # fetch 10 URLs from DDG
    top_k=5,
    timelimit="w",      # last week only: "d", "w", "m", "y"
    region="us-en",
)

Just get URLs from DDG:

from crl import search_urls, search_news_urls

urls = search_urls("python asyncio tutorial", max_results=20)
news = search_news_urls("AI news", max_results=10, timelimit="d")

Sitemap Discovery

from crl import fetch_sitemap_urls_sync

# Auto-discovers sitemap.xml, follows sitemap indexes, handles gzip
urls = fetch_sitemap_urls_sync("https://bbc.com", max_urls=1000)
print(f"Found {len(urls)} URLs")

# Then crawl them
results = deep_search(urls=urls[:50], query="technology news")

Async version:

from crl import fetch_sitemap_urls
urls = await fetch_sitemap_urls("https://bbc.com")

JS Rendering (React / Next.js / Vue / Angular)

For sites that require JavaScript to render content:

from crl.js_renderer import js_crawl_sync, is_available

if is_available():  # checks if playwright is installed
    results = js_crawl_sync(
        urls=["https://react-app.com"],
        query="your query",
        wait_until="networkidle",   # wait for JS to finish
        wait_for_selector="#content",  # optional: wait for element
        extra_wait_ms=500,          # extra wait for lazy JS
        block_resources=True,       # block images/fonts (faster)
        max_concurrent=3,
    )

Install Playwright first:

pip install playwright
playwright install chromium

Structured Data Extraction

Every crawled page automatically includes structured data. Access it directly:

results = crawl(urls=["https://example.com"], query="test")

for r in results:
    s = r["structured"]

    # Open Graph
    print(s["open_graph"].get("title"))
    print(s["open_graph"].get("image"))

    # JSON-LD / Schema.org
    for item in s["json_ld"]:
        print(item.get("@type"), item.get("name"))

    # Tables as list of dicts
    for table in s["tables"]:
        for row in table:
            print(row)

    # Lists
    for lst in s["lists"]:
        print(lst)

    print(s["canonical"])
    print(s["favicon"])

Extract from raw HTML directly:

from crl import extract_structured

data = extract_structured("<html>...</html>", url="https://example.com")

Cache

from crl import TieredCache, crawl

# L1: memory LRU, L2: disk shelve
cache = TieredCache(
    memory_size=500,        # max 500 entries in memory
    memory_ttl=3600,        # 1 hour TTL
    disk_path=".crl_cache", # disk cache path
    disk_ttl=86400,         # 1 day TTL
)

# First call fetches, subsequent calls hit cache
results = crawl(urls=[...], query="...", cache=cache)
results = crawl(urls=[...], query="...", cache=cache)  # instant

Per-Domain Rate Limiting

from crl import DomainRateLimiter, deep_search

results = deep_search(
    urls=["https://news.ycombinator.com"],
    query="python",
    rate_limit=10.0,                          # global: 10 req/sec
    domain_rate_limits={
        "news.ycombinator.com": 1.0,          # 1 req/sec for HN
        "github.com": 2.0,                    # 2 req/sec for GitHub
    },
)

Standalone:

from crl import DomainRateLimiter

limiter = DomainRateLimiter(default_rps=5.0, domain_rps={"slow.com": 1.0})
await limiter.wait("https://slow.com/page")  # waits if needed

Output Formats

from crl import to_json, to_text, to_csv, to_markdown, to_sqlite, save

results = crawl(urls=[...], query="...")

# Print
print(to_json(results))
print(to_text(results))
print(to_csv(results))
print(to_markdown(results))

# Save — format inferred from extension
save(results, "output.json")      # JSON
save(results, "output.csv")       # CSV
save(results, "output.md")        # Markdown
save(results, "output.db")        # SQLite database
save(results, "output.txt")       # Plain text

# SQLite — query with standard sqlite3
import sqlite3
conn = sqlite3.connect("output.db")
rows = conn.execute(
    "SELECT url, title, relevance_score FROM pages ORDER BY relevance_score DESC"
).fetchall()

Progress Reporting

from crl import ProgressReporter, crawl

progress = ProgressReporter()
results = crawl(urls=[...], query="...", progress=progress)
# Prints live progress bar to stderr:
# [CRL] [████████████░░░░░░░░] 6/10 | cached=2 err=0 | ETA 4s

Robots.txt

results = crawl(
    urls=[...],
    query="...",
    respect_robots=True,   # fetch and respect robots.txt per domain
)

Proxy Rotation

results = crawl(
    urls=[...],
    query="...",
    proxies=[
        "http://proxy1:8080",
        "http://user:pass@proxy2:8080",
        "socks5://proxy3:1080",
    ],
)

CLI

CRL ships with a full command-line interface:

`crl crawl` — fetch and rank URLs

crl crawl https://example.com https://python.org \
    --query "python async" \
    --top-k 5 \
    --mode both \
    --out results.json

# Save as Markdown
crl crawl https://example.com --query "python" --out results.md

# Save as SQLite
crl crawl https://example.com --query "python" --out results.db

`crl deep` — deep crawl with link following

crl deep https://docs.python.org \
    --query "asyncio" \
    --depth 2 \
    --max-pages 100 \
    --domain-rps docs.python.org:2.0 \
    --use-mmap \
    --out results.json

`crl search` — search DDG then crawl

crl search "python async web scraping" \
    --max-results 10 \
    --top-k 5 \
    --out results.md

# News search
crl search "AI news" --news --timelimit d --top-k 10

`crl js` — JS rendering with Playwright

crl js https://react-app.com \
    --query "your query" \
    --wait networkidle \
    --wait-for "#main-content" \
    --extra-wait 500

`crl sitemap` — discover URLs from sitemap.xml

crl sitemap https://bbc.com --max-urls 500
crl sitemap https://bbc.com --out urls.txt

Common options (all commands)

--mode          keyword | semantic | both (default: both)
--top-k         return top N results
--timeout       per-request timeout in seconds (default: 10)
--retries       retry attempts per URL (default: 3)
--rate-limit    max requests/sec
--proxy         proxy URL (repeat for rotation)
--cache         enable disk+memory cache
--robots        respect robots.txt
--min-text      skip pages shorter than N chars
--fmt           json | text | csv | markdown | sqlite
--out           save to file
--no-progress   disable progress bar

Architecture

crl/
├── __init__.py       # Public API: crawl, acrawl, astream, deep_search, ...
├── fetcher.py        # Async HTTP: retry, backoff, cache, robots, proxy, content-type filter
├── parser.py         # HTML parsing: lxml/BS4, text extraction, link resolution
├── extractor.py      # Structured data: Open Graph, JSON-LD, tables, lists
├── relevance.py      # BM25 + semantic scoring, minmax normalization
├── crawler.py        # Async BFS deep crawler with pagination
├── deduplicator.py   # Shingle hash + Jaccard content dedup
├── paginator.py      # Auto pagination detection (query params, path, next-link)
├── sitemap.py        # Sitemap.xml parser, index recursion, gzip support
├── search.py         # DuckDuckGo integration (free, no API key)
├── js_renderer.py    # Playwright JS rendering
├── cache.py          # MemoryCache (LRU) + DiskCache (shelve) + TieredCache
├── robots.py         # robots.txt fetch, parse, Crawl-delay support
├── ratelimiter.py    # Per-domain async token bucket rate limiter
├── progress.py       # Live progress bar with ETA
├── bridge.py         # ZeroCopyTokenizer, MMapStore, FastHTMLStripper
├── output.py         # JSON, CSV, Text, Markdown, SQLite export
└── cli.py            # Full CLI: crawl, deep, search, js, sitemap

Requirements

Python 3.9+
httpx[http2,brotli] — async HTTP with HTTP/2 and brotli compression
beautifulsoup4 + lxml — HTML parsing
rank_bm25 — BM25 keyword scoring
numpy — score normalization
duckduckgo-search — free search integration

Optional:

sentence-transformers + torch — semantic scoring (pip install crl[semantic])
playwright — JS rendering (pip install playwright && playwright install chromium)

License

MIT — free to use in commercial and open source projects.

Author

Made by @who_is_the_black_hat · GitHub

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.1

Mar 20, 2026

1.1.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl_relevance_layers-1.1.1.tar.gz (53.4 kB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crawl_relevance_layers-1.1.1-py3-none-any.whl (47.4 kB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file crawl_relevance_layers-1.1.1.tar.gz.

File metadata

Download URL: crawl_relevance_layers-1.1.1.tar.gz
Upload date: Mar 20, 2026
Size: 53.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for crawl_relevance_layers-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e9579d86b56c88a531a026534653e8c339496a5715d38fb8c53f85352e875c5a`
MD5	`97fef6fe0ff00a6e7bbefab52ea6ccff`
BLAKE2b-256	`87b185cc0b0a4f9213f7eac6ee0320076b3dfe93f9ba339cabc64780aae56025`

See more details on using hashes here.

File details

Details for the file crawl_relevance_layers-1.1.1-py3-none-any.whl.

File metadata

Download URL: crawl_relevance_layers-1.1.1-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 47.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for crawl_relevance_layers-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0567375893978abc5a754849090f3b5728dcf385a3ec7541504dad5a8f9e4e2c`
MD5	`e841347952b3cd0233dd936bf5d19718`
BLAKE2b-256	`94ab752d81d381b9701113398631f799600e917c0c2e018f515a51cf5eb92853`

See more details on using hashes here.

crawl-relevance-layers 1.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CRL — Crawl Relevance Layers

What is CRL?

Install

Quick Start

Core API

crawl() — fetch + rank

acrawl() — async version

astream() — streaming, results as they arrive

Deep Crawl

Search + Crawl (No URLs needed)

Sitemap Discovery

JS Rendering (React / Next.js / Vue / Angular)

Structured Data Extraction

Cache

Per-Domain Rate Limiting

Output Formats

Progress Reporting

Robots.txt

Proxy Rotation

CLI

crl crawl — fetch and rank URLs

crl deep — deep crawl with link following

crl search — search DDG then crawl

crl js — JS rendering with Playwright

crl sitemap — discover URLs from sitemap.xml

Common options (all commands)

Architecture

Requirements

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`crawl()` — fetch + rank

`acrawl()` — async version

`astream()` — streaming, results as they arrive

`crl crawl` — fetch and rank URLs

`crl deep` — deep crawl with link following

`crl search` — search DDG then crawl

`crl js` — JS rendering with Playwright

`crl sitemap` — discover URLs from sitemap.xml