Advanced browser-activity capture, behavioral profiling, alerting and reporting library for Python.
Project description
Sitewise Crawler 🕷️🧠
Sitewise Crawler is an enterprise-grade Python library for browser-activity monitoring, behavioral intelligence, and automated risk reporting. It transforms raw URL streams into actionable psychological and productivity profiles.
📖 Table of Contents
- Installation
- Day 1: Integration Blueprint
- Core Concepts & DTOs
- Deep Dive: Modules
- Academic & Institutional Value
- Configuration
📦 Installation
# Core library
pip install sitewise-crawler
# Required for dynamic/SPA site support
playwright install chromium
# Optional: Required for PDF report generation
pip install reportlab
🚀 Day 1: Integration Blueprint
If you are a developer integrating this into a new application, here is the standard workflow:
1. The Environment
The library requires a Groq API Key for behavioral analysis.
export GROQ_API_KEY="your_api_key_here"
2. Basic Usage (The "Hello World" of Intelligence)
import os
from sitewise_crawler import create_insight_engine, ProfileBlender, BehaviorProfileSnapshot
# 1. Initialize
engine = create_insight_engine(api_key=os.getenv("GROQ_API_KEY"))
# 2. Get a real-time risk assessment for a URL
risk_result = engine.quick_url_risk_sync("https://example.com")
print(f"Status: {risk_result['status']} | Category: {risk_result['category']}")
# 3. Blend into a user profile
# Start with an empty snapshot
profile = BehaviorProfileSnapshot(device_id="dev_001")
updated_profile = ProfileBlender.update_profile_from_risk(profile, risk_result)
print(f"New Productivity Score: {updated_profile.productivity_rating}")
🏗️ Core Concepts & DTOs
The library communicates via Data Transfer Objects (DTOs) implemented as Pydantic models. This ensures your backend and the library always speak the same language.
| Model | Description | Key Fields |
|---|---|---|
BehaviorProfileSnapshot |
The state of a user's behavior. | productivity_rating, nsfw_probability, top_categories |
URLRiskResult |
The output of a single URL check. | status, risk_score, category, reason |
SessionWindow |
A discrete block of browsing time. | start_time, end_time, events, duration_seconds |
Alert |
A fired security or productivity event. | alert_type, severity, message, evidence |
🧠 Deep Dive: Modules
InsightEngine (AI Analysis)
The InsightEngine handles both fast-path (heuristic) and deep-path (LLM) analysis.
- Fast Path: Uses a built-in dictionary of millions of domains for instant classification.
- Deep Path: Crawls the page content, cleans it using
trafilatura, and uses Llama 3.3 to understand the intent.
SessionAnalyzer (Windowing)
Transforms a continuous stream of logs into sessions.
- Gap Detection: Automatically starts a new session if the user is idle for > 30 minutes.
- Aggregation: Computes session-level stats (e.g., "Most distracting hour").
AlertEngine (Custom Rules)
You can extend the alerting logic by adding your own rules.
from sitewise_crawler import BaseAlertRule, Alert, Severity
class MyCustomRule(BaseAlertRule):
def evaluate(self, ctx):
if ctx.total_url_count_last_hour > 100:
return Alert(
alert_type="excessive_browsing",
severity=Severity.MEDIUM,
message="User is browsing at an extreme rate."
)
return None
🎓 Academic & Institutional Value
This library was built to satisfy high academic and institutional standards:
- Explainability (XAI): Every score or alert includes an
evidenceobject, allowing admins to see why the AI flagged a user. - Efficiency: Uses Exponential Moving Averages (EMA) and Trend Velocity. Instead of re-calculating everything, it only processes the latest delta.
- Privacy-First: The library only extracts the "Semantic Core" of pages, ignoring personal identifying information in headers/sidebars.
⚙️ Configuration
The CrawlerConfig allows fine-grained control:
| Option | Default | Description |
|---|---|---|
use_playwright |
False |
Set to True for JavaScript-heavy (SPA) sites. |
max_depth |
3 |
BFS depth for site discovery. |
timeout_ms |
30000 |
Network timeout per page. |
rate_limit_delay |
1.0 |
Seconds to wait between requests (politeness). |
🤝 Support & Contribution
This library is part of the AegiVara ecosystem. For bug reports or feature requests, please open an issue in the main repository.
License: MIT — Developed by Group 8 FYP 2026.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitewise_crawler-0.2.1.tar.gz.
File metadata
- Download URL: sitewise_crawler-0.2.1.tar.gz
- Upload date:
- Size: 36.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
495ed8c58b976ad2f614286ae4e9bad5c633b0df6cdb619fa14101a047e1f606
|
|
| MD5 |
718c665da5c6b656357ee1ced6b4704e
|
|
| BLAKE2b-256 |
45067594487e7ecb1ebfd1f4dac97d1e70a802b9bb2b8b6b164ad9bd9ecccd5d
|
File details
Details for the file sitewise_crawler-0.2.1-py3-none-any.whl.
File metadata
- Download URL: sitewise_crawler-0.2.1-py3-none-any.whl
- Upload date:
- Size: 38.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
061f7dc65ee86ac2bf9b225431464221ff7154798c31980a2c1bc7130c84bbdb
|
|
| MD5 |
d5a7fa571f68784d67859bc3a59462a9
|
|
| BLAKE2b-256 |
9cd621a33c1cbf2832185e27ba3e361cb5a41544fab972816eb127ff9f82f703
|