Skip to main content

Python client library for the Collaboration Tunnel Protocol (TCT)

Project description

Collaboration Tunnel Protocol - Python Client

A Python library for efficiently crawling websites that implement the Collaboration Tunnel Protocol (TCT), achieving up to 90% bandwidth savings through sitemap-first discovery and conditional requests.

Installation

pip install collab-tunnel

Quick Start

from collab_tunnel import CollabTunnelCrawler

# Initialize crawler
crawler = CollabTunnelCrawler(user_agent="MyBot/1.0")

# Fetch sitemap
sitemap = crawler.fetch_sitemap("https://example.com/llm-sitemap.json")

# Crawl items
for item in sitemap.items:
    if crawler.should_fetch(item):  # Zero-fetch optimization
        content = crawler.fetch_content(item['mUrl'], item['contentHash'])
        if content:
            print(f"Title: {content['title']}")
            print(f"Content: {content['content'][:200]}...")

# View stats
stats = crawler.get_stats()
print(f"Bandwidth saved: {stats['savings_percentage']}%")
print(f"Requests skipped: {stats['total_skips']}")

Features

  • Sitemap-First Discovery: Skip 90%+ of unchanged URLs
  • Conditional Requests: 304 Not Modified support
  • ETag Validation: Verify content integrity
  • Bandwidth Tracking: Monitor savings vs traditional crawling
  • Handshake Verification: Validate C-URL ↔ M-URL mapping

Advanced Usage

Crawl Entire Site

from collab_tunnel import crawl_site

results = crawl_site(
    "https://example.com/llm-sitemap.json",
    limit=100,
    user_agent="MyBot/1.0"
)

for result in results:
    print(result['title'], result['canonical_url'])

Filter by Date

from datetime import datetime, timedelta
from collab_tunnel import CollabTunnelCrawler

crawler = CollabTunnelCrawler()
sitemap = crawler.fetch_sitemap("https://example.com/llm-sitemap.json")

# Get items modified in last 7 days
recent_items = sitemap.filter_by_date(
    datetime.now() - timedelta(days=7)
)

for item in recent_items:
    content = crawler.fetch_content(item['mUrl'])
    # Process recent content...

Verify Protocol Compliance

from collab_tunnel import ContentValidator

validator = ContentValidator()

# Check headers
headers = {
    'Content-Type': 'application/json; charset=UTF-8',
    'ETag': 'W/"sha256-abc123..."',
    'Link': '<https://example.com/post/>; rel="canonical"',
    'Cache-Control': 'max-age=0, must-revalidate, stale-while-revalidate=60, stale-if-error=86400',
    'Vary': 'Accept-Encoding'
}

results = validator.check_headers(headers)
if results['compliant']:
    print("✅ Protocol compliant!")
else:
    print("❌ Errors:", results['errors'])

Validate Profile Field

from collab_tunnel import CollabTunnelCrawler

crawler = CollabTunnelCrawler()

# Fetch M-URL content
content = crawler.fetch_content("https://example.com/post/llm/")

# Check profile field
profile = content.get('profile')
if profile == 'tct-1':
    print("✅ Recognized protocol version: tct-1")
elif profile:
    print(f"⚠️ Unknown protocol version: {profile} (forward compatibility)")
    # Future versions - client can decide how to handle
else:
    print("⚠️ No profile field (legacy or non-compliant endpoint)")

# Validate sitemap profile
sitemap = crawler.fetch_sitemap("https://example.com/llm-sitemap.json")
sitemap_profile = sitemap.data.get('profile')
if sitemap_profile == 'tct-1':
    print("✅ Sitemap protocol version: tct-1")

Protocol Overview

The Collaboration Tunnel Protocol (TCT) enables efficient content delivery through:

  1. Bidirectional Handshake

    • C-URL (HTML page) → M-URL (JSON endpoint) via <link rel="alternate">
    • M-URL → C-URL via Link: <C-URL>; rel="canonical" header
  2. Template-Invariant Fingerprinting

    • Content normalized through 6-step pipeline: decode entities, NFKC, casefold, remove Cc (except TAB/LF/CR), collapse ASCII whitespace, trim; then SHA-256
    • Weak ETag format: W/"sha256-..."
    • Stable across theme changes
  3. Sitemap-First Verification

    • JSON sitemap lists (cUrl, mUrl, contentHash)
    • Skip fetch if hash unchanged (90%+ skip rate)
  4. Conditional Request Discipline

    • If-None-Match takes precedence
    • 304 Not Modified for unchanged content

Response Format

M-URL JSON Payload

{
  "profile": "tct-1",
  "llm_url": "https://example.com/post/llm/",
  "canonical_url": "https://example.com/post/",
  "hash": "sha256-e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
  "title": "Article Title",
  "content": "Article content...",
  "modified": "2025-10-23T18:00:00Z"
}

Profile Field: "profile": "tct-1" enables protocol versioning for future compatibility.

HTTP Headers

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Link: <https://example.com/post/>; rel="canonical"
ETag: W/"sha256-e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
Cache-Control: max-age=0, must-revalidate, stale-while-revalidate=60, stale-if-error=86400
Vary: Accept-Encoding

Weak ETag Format: W/"sha256-..." signals semantic (not byte-for-byte) equivalence, per RFC 9110 Section 8.8.1.

Sitemap Format

{
  "version": 1,
  "profile": "tct-1",
  "items": [
    {
      "cUrl": "https://example.com/post/",
      "mUrl": "https://example.com/post/llm/",
      "modified": "2025-10-23T18:00:00Z",
      "contentHash": "sha256-e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
    }
  ]
}

API Reference

CollabTunnelCrawler

Methods:

  • fetch_sitemap(sitemap_url) - Fetch and parse sitemap
  • should_fetch(item) - Check if item needs fetching (zero-fetch logic)
  • fetch_content(m_url, expected_hash) - Fetch M-URL with conditional request
  • verify_handshake(c_url, m_url) - Verify bidirectional handshake
  • get_stats() - Get bandwidth savings statistics

SitemapParser

Properties:

  • items - List of sitemap items
  • version - Sitemap version
  • count - Total number of items

Methods:

  • filter_by_date(since) - Filter items by modification date
  • find_by_canonical(c_url) - Find item by canonical URL
  • get_stats() - Get sitemap statistics

ContentValidator

Static Methods:

  • validate_parity(sitemap_hash, etag, payload_hash) - Compliance: parity-only check
  • validate_etag(etag, content) - Diagnostic: recompute hash from content
  • normalize_minimal(text) - Normalization for diagnostics only (6-step TCT spec algorithm)
  • check_headers(headers) - Check protocol compliance
  • check_head_get_parity(get_headers, head_headers) - Ensure HEAD mirrors GET headers
  • validate_sitemap_item(item) - Validate sitemap item structure

License

MIT License - See LICENSE file for details

Links

Contributing

Contributions welcome! Please open an issue or submit a pull request.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

collab_tunnel-2.0.0.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

collab_tunnel-2.0.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file collab_tunnel-2.0.0.tar.gz.

File metadata

  • Download URL: collab_tunnel-2.0.0.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for collab_tunnel-2.0.0.tar.gz
Algorithm Hash digest
SHA256 125eca6300a8cce73999370daab840f390d8cee666d3efebb8d1e88706e6e103
MD5 73674f2bf20e1ab1225ca323586cf984
BLAKE2b-256 13bc6d8b14c18f69c8494819b5b2777fe45fc816fec3a7268ddfddca55ddc646

See more details on using hashes here.

File details

Details for the file collab_tunnel-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: collab_tunnel-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for collab_tunnel-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1c7cb7b723a29bff7b16baec10e306b1c643b39507259aed191b41069de45a78
MD5 3588e2bddc71b100ca654accbf6d23bd
BLAKE2b-256 1a22e9c95456751ac018d396c7f9fdaf7dcbd8fcb8947ecd48a0e167c622c9da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page