Skip to main content

Advanced browser-activity capture, behavioral profiling, alerting and reporting library for Python.

Project description

Sitewise Crawler 🕷️🧠

Version License Python

Sitewise Crawler is an enterprise-grade Python library for browser-activity monitoring, behavioral intelligence, and automated risk reporting. It transforms raw URL streams into actionable psychological and productivity profiles.


📖 Table of Contents

  1. Installation
  2. Day 1: Integration Blueprint
  3. Core Concepts & DTOs
  4. Deep Dive: Modules
  5. Academic & Institutional Value
  6. Configuration

📦 Installation

# Core library
pip install sitewise-crawler

# Required for dynamic/SPA site support
playwright install chromium

# Optional: Required for PDF report generation
pip install reportlab

🚀 Day 1: Integration Blueprint

If you are a developer integrating this into a new application, here is the standard workflow:

1. The Environment

The library requires a Groq API Key for behavioral analysis.

export GROQ_API_KEY="your_api_key_here"

2. Basic Usage (The "Hello World" of Intelligence)

import os
from sitewise_crawler import create_insight_engine, ProfileBlender, BehaviorProfileSnapshot

# 1. Initialize
engine = create_insight_engine(api_key=os.getenv("GROQ_API_KEY"))

# 2. Get a real-time risk assessment for a URL
risk_result = engine.quick_url_risk_sync("https://example.com")
print(f"Status: {risk_result['status']} | Category: {risk_result['category']}")

# 3. Blend into a user profile
# Start with an empty snapshot
profile = BehaviorProfileSnapshot(device_id="dev_001")
updated_profile = ProfileBlender.update_profile_from_risk(profile, risk_result)

print(f"New Productivity Score: {updated_profile.productivity_rating}")

🏗️ Core Concepts & DTOs

The library communicates via Data Transfer Objects (DTOs) implemented as Pydantic models. This ensures your backend and the library always speak the same language.

Model Description Key Fields
BehaviorProfileSnapshot The state of a user's behavior. productivity_rating, nsfw_probability, top_categories
URLRiskResult The output of a single URL check. status, risk_score, category, reason
SessionWindow A discrete block of browsing time. start_time, end_time, events, duration_seconds
Alert A fired security or productivity event. alert_type, severity, message, evidence

🧠 Deep Dive: Modules

InsightEngine (AI Analysis)

The InsightEngine handles both fast-path (heuristic) and deep-path (LLM) analysis.

  • Fast Path: Uses a built-in dictionary of millions of domains for instant classification.
  • Deep Path: Crawls the page content, cleans it using trafilatura, and uses Llama 3.3 to understand the intent.

SessionAnalyzer (Windowing)

Transforms a continuous stream of logs into sessions.

  • Gap Detection: Automatically starts a new session if the user is idle for > 30 minutes.
  • Aggregation: Computes session-level stats (e.g., "Most distracting hour").

AlertEngine (Custom Rules)

You can extend the alerting logic by adding your own rules.

from sitewise_crawler import BaseAlertRule, Alert, Severity

class MyCustomRule(BaseAlertRule):
    def evaluate(self, ctx):
        if ctx.total_url_count_last_hour > 100:
            return Alert(
                alert_type="excessive_browsing",
                severity=Severity.MEDIUM,
                message="User is browsing at an extreme rate."
            )
        return None

🎓 Academic & Institutional Value

This library was built to satisfy high academic and institutional standards:

  1. Explainability (XAI): Every score or alert includes an evidence object, allowing admins to see why the AI flagged a user.
  2. Efficiency: Uses Exponential Moving Averages (EMA) and Trend Velocity. Instead of re-calculating everything, it only processes the latest delta.
  3. Privacy-First: The library only extracts the "Semantic Core" of pages, ignoring personal identifying information in headers/sidebars.

⚙️ Configuration

The CrawlerConfig allows fine-grained control:

Option Default Description
use_playwright False Set to True for JavaScript-heavy (SPA) sites.
max_depth 3 BFS depth for site discovery.
timeout_ms 30000 Network timeout per page.
rate_limit_delay 1.0 Seconds to wait between requests (politeness).

🤝 Support & Contribution

This library is part of the AegiVara ecosystem. For bug reports or feature requests, please open an issue in the main repository.


License: MIT — Developed by TarXemo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitewise_crawler-0.2.0.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitewise_crawler-0.2.0-py3-none-any.whl (37.8 kB view details)

Uploaded Python 3

File details

Details for the file sitewise_crawler-0.2.0.tar.gz.

File metadata

  • Download URL: sitewise_crawler-0.2.0.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sitewise_crawler-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0d36e2a5a59cef326775d8bf1e4689865431d2e5655ee08b0091e7895a942540
MD5 cdcbb94862695a3f4eb7d218631ea1f0
BLAKE2b-256 9c94b48482f3212673c068703066ab9077233642145c2c0519f9d9f061581d52

See more details on using hashes here.

File details

Details for the file sitewise_crawler-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sitewise_crawler-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7018edfb68b825c5aca5eede3c28c4dd0c667faca517518c39410d5d99190282
MD5 51521844cfb959d6df40653776132552
BLAKE2b-256 344278e0623bc9197a98608e350d38c56fdec2171dd535ab5c3b20ec677246e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page