Skip to main content

Pulso delivers stateful web fetching with cache, hashes, and domain-aware rules

Project description

Pulso

Python Version License: MIT

Stateful web fetching with intelligent caching, content hashing, and domain-aware policies.

Pulso is a Python library that fetches web content once, remembers it, and only re-fetches when necessary. It's designed for data pipelines, content monitoring systems, and AI workflows where repeated requests and noisy HTML changes create unnecessary overhead.

Table of Contents

Why Pulso

Most web scraping tools focus on getting content. Pulso focuses on not getting it again when nothing has changed.

Built For

  • Deterministic data pipelines - Ensure reproducible results across runs
  • Change detection - Monitor content updates without wasteful re-fetching
  • Content monitoring - Track website changes efficiently
  • AI workflows - Avoid reprocessing identical HTML repeatedly

Core Principles

  • Stateful by design - Every fetch maintains metadata and history
  • Domain-aware policies - Configure TTL and fetch behavior per domain
  • Hash-based identification - Content changes detected via normalized hashes, not timestamps
  • Change detection first - Built-in tracking of content modifications

Key Features

Smart Fetching

Automatic driver selection based on content type:

  • Static pages - Fast fetching with requests
  • Dynamic content - JavaScript rendering with playwright
  • Per-domain configuration - Set driver preference for each domain
import pulso

# Simple fetch with automatic caching
html = pulso.fetch("https://example.com")

Domain-Aware Caching

Configure time-to-live (TTL) and fetch behavior per domain:

pulso.register_domain(
    "example.com",
    ttl="1d",        # Cache for 1 day
    driver="requests"
)

pulso.register_domain(
    "dynamic-site.com",
    ttl="6h",        # Cache for 6 hours
    driver="playwright"
)

Supported TTL formats: 1d (day), 12h (hours), 30m (minutes), 60s (seconds)

Pulso automatically:

  • Returns cached content if still fresh (within TTL)
  • Re-fetches only after TTL expires
  • Respects domain-specific policies consistently

Content Hashing

Intelligent change detection using normalized content hashes:

if pulso.has_changed("https://example.com"):
    print("Content has been updated!")

How it works:

  • HTML is normalized (whitespace, scripts, styles removed)
  • Content hashed with SHA-256
  • Same hash = no meaningful change
  • Different hash = real content update

Change Tracking

Comprehensive metadata for every URL:

metadata = pulso.get_metadata(url)
# Returns:
# {
#   'content_hash': '8f3d9a...',
#   'fetch_time': 1234567890.0,
#   'change_time': 1234567890.0,
#   'change_count': 3
# }

Create snapshots when content changes:

if pulso.has_changed(url):
    snapshot_path = pulso.snapshot(url)
    print(f"Snapshot saved: {snapshot_path}")

Cache Management

Granular cache control:

# Clear specific domain
pulso.cache.clear(domain="example.com")

# Clear specific URL
pulso.cache.clear(url="https://example.com/page")

# Clear entire cache
pulso.cache.clear()

# View registered domains
domains = pulso.get_registered_domains()

Installation

pip install pulso

For Playwright support (dynamic content):

pip install pulso
playwright install

Quick Start

import pulso

# Register domain with policy
pulso.register_domain(
    "news.example.com",
    ttl="12h",
    driver="playwright"
)

# Fetch content (cached automatically)
url = "https://news.example.com/article/123"
html = pulso.fetch(url)

# Check for changes
if pulso.has_changed(url):
    print("Article was updated!")
    pulso.snapshot(url)
else:
    print("No changes detected")

That's it. No manual cache handling, no cron jobs, no duplicate fetch logic.

Usage

Basic Fetching

import pulso

# Fetch with default settings (1 day TTL, requests driver)
html = pulso.fetch("https://example.com")

# Force refresh (bypass cache)
html = pulso.fetch("https://example.com", force=True)

Domain Configuration

# Register multiple domains
pulso.register_domain("api.service.com", ttl="5m", driver="requests")
pulso.register_domain("app.service.com", ttl="1h", driver="playwright")

# View all registered domains
domains = pulso.get_registered_domains()
for domain, policy in domains.items():
    print(f"{domain}: TTL={policy.ttl_seconds}s, Driver={policy.driver}")

Change Detection Workflow

import pulso

url = "https://blog.example.com/post/123"

# First fetch - creates cache entry
html = pulso.fetch(url)

# Later... check if content changed
if pulso.has_changed(url):
    # Content changed - get fresh version
    new_html = pulso.fetch(url, force=True)

    # Save snapshot
    snapshot_path = pulso.snapshot(url)

    # Process new content
    process_updated_content(new_html)

Metadata Inspection

metadata = pulso.get_metadata("https://example.com")

if metadata:
    print(f"Last fetched: {metadata['fetch_time']}")
    print(f"Last changed: {metadata['change_time']}")
    print(f"Total changes: {metadata['change_count']}")
    print(f"Content hash: {metadata['content_hash']}")

Error Handling and Retries

Pulso includes robust error handling with automatic retries and configurable fallback behavior:

import pulso

# Define error callback for monitoring/logging
def report_error(url, exception):
    print(f"Failed to fetch {url}: {exception}")
    # Send to monitoring system, log to file, etc.

# Register domain with error handling
pulso.register_domain(
    "unreliable-api.com",
    ttl="30m",
    driver="requests",
    max_retries=5,              # Retry up to 5 times
    retry_delay=2.0,            # Wait 2 seconds between retries
    fallback_on_error="return_cached",  # Return cached data on failure
    on_error=report_error       # Call this function on each error
)

# When fetch fails after all retries:
# - Logs warnings for each retry attempt
# - Calls on_error callback if provided
# - Returns last cached data (if fallback_on_error="return_cached")
html = pulso.fetch("https://unreliable-api.com/data")

Fallback behaviors:

  • return_cached (default) - Returns last successful fetch from cache, reports error but doesn't crash
  • raise_error - Raises FetchError exception for strict error handling
  • return_none - Returns None, allows graceful degradation
# Example: Graceful degradation
pulso.register_domain(
    "optional-service.com",
    fallback_on_error="return_none"
)

data = pulso.fetch("https://optional-service.com/api")
if data is None:
    print("Service unavailable, using defaults")
    data = get_default_data()

Session-Based Caching

Isolate cache by user, tenant, or context using sessions:

import pulso

# Set session for user-specific caching
pulso.set_session("user_123")

# All cache operations now use user_123 session
html = pulso.fetch("https://example.com")

# Switch to different user
pulso.set_session("user_456")
# This fetches fresh data (different session)
html = pulso.fetch("https://example.com")

# Check current session
current_session = pulso.get_session()  # Returns: "user_456"

Use cases:

  • Multi-tenant applications (isolate cache per tenant)
  • User-specific data caching
  • A/B testing with different cache variants
  • Environment isolation (dev/staging/production)

Session via environment:

# .env file
PULSO_SESSION_ID=production
PULSO_CACHE_DIR=/custom/cache/path

Note: Pulso still reads legacy PULSO_* environment variables for backward compatibility, but prefer the new PULSO_* names.

import pulso

# Load from .env file
pulso.load_config(".env")

Docker Support

Deploy Pulso in containers with Redis for distributed caching:

# docker-compose.yml
version: '3.8'

services:
  app:
    build: .
    environment:
      - PULSO_CACHE_BACKEND=redis
      - PULSO_REDIS_URL=redis://redis:6379/0
      - PULSO_SESSION_ID=production
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data

volumes:
  redis-data:

See DOCKER.md for complete deployment guide.

Examples

Complete working examples are available in the examples/ folder:

See the examples/README.md for detailed documentation on running each example.

API Reference

Core Functions

fetch(url: str, force: bool = False) -> str

Fetch web content with automatic caching.

Parameters:

  • url - URL to fetch
  • force - Force refresh, bypass cache (default: False)

Returns: HTML content as string

has_changed(url: str) -> bool

Check if content has changed since last fetch.

Parameters:

  • url - URL to check

Returns: True if content changed or URL not cached

snapshot(url: str, snapshot_dir: Optional[Path] = None) -> Optional[Path]

Create snapshot of cached HTML.

Parameters:

  • url - URL to snapshot
  • snapshot_dir - Optional snapshot directory

Returns: Path to snapshot file

get_metadata(url: str) -> Optional[dict]

Get metadata for cached URL.

Returns: Dictionary with metadata or None if not cached

register_domain(domain: str, ttl: str = "1d", driver: Literal["requests", "playwright"] = "requests", max_retries: int = 3, retry_delay: float = 1.0, fallback_on_error: Literal["return_cached", "raise_error", "return_none"] = "return_cached", on_error: Optional[Callable] = None) -> None

Register domain with fetch policy and error handling rules.

Parameters:

  • domain - Domain name (e.g., "example.com")
  • ttl - Time-to-live: "1d", "12h", "30m", "60s"
  • driver - Fetch driver: "requests" or "playwright"
  • max_retries - Maximum retry attempts on failure (default: 3)
  • retry_delay - Delay in seconds between retries (default: 1.0)
  • fallback_on_error - Error handling behavior:
    • "return_cached" - Return last cached data if available (default)
    • "raise_error" - Raise FetchError on failure
    • "return_none" - Return None on failure
  • on_error - Optional callback function(url, exception) for error reporting

get_registered_domains() -> Dict[str, DomainPolicy]

Get all registered domains and their policies.

Returns: Dictionary mapping domain names to DomainPolicy objects

set_session(session_id: str) -> None

Set the current session ID for isolated caching.

Parameters:

  • session_id - Unique identifier for this session

Example:

pulso.set_session("user_123")

get_session() -> str

Get the current session ID.

Returns: Current session ID

load_config(env_file: str = ".env") -> None

Load configuration from environment file.

Parameters:

  • env_file - Path to .env file (default: ".env")

Cache Manager

cache.clear(domain: Optional[str] = None, url: Optional[str] = None) -> None

Clear cache entries.

Parameters:

  • domain - Clear all entries for domain
  • url - Clear specific URL
  • (no params) - Clear entire cache

Cache Storage

Pulso stores cache at the user level, not within your project directory.

Locations

  • Linux / macOS: ~/.cache/pulso/
  • Windows: %LOCALAPPDATA%\pulso\

Organization

Cache is structured by domain and URL hashes:

~/.cache/pulso/
├── example.com/
│   ├── a3f2d9e1.json          # Metadata
│   ├── a3f2d9e1.html          # Content
│   └── ...
├── news.site/
│   └── ...
└── snapshots/
    └── ...

This structure makes the cache:

  • Inspectable - Easy to browse and debug
  • Portable - Safe to use across multiple projects
  • Manageable - Simple to clear or backup

Architecture

Mental Model

Pulso is not a web crawler or scraping framework.

Think of it as:

requests + persistent memory + domain policies + content hashing

You call fetch() multiple times on the same URLs, and Pulso intelligently decides whether a network request is actually needed.

Design Principles

Stateful over Stateless

  • Every fetch operation maintains state
  • Content history is preserved automatically
  • No need for external state management

Predictable over Clever

  • Explicit domain policies
  • No magic heuristics
  • Deterministic behavior

Hash-based over Time-based

  • Content identified by normalized hash
  • Immune to trivial HTML changes (whitespace, scripts)
  • Real changes always detected

What Pulso is NOT

  • ❌ Not a full-featured web scraping framework
  • ❌ Not a distributed crawler with spiders
  • ❌ Not a monitoring SaaS or alerting system
  • ❌ Not a proxy or request interceptor

Pulso is a library designed to be embedded in your own applications and data pipelines.

Roadmap

Features under development or consideration:

  • Rate limiting per domain
  • Conditional requests (ETag, Last-Modified headers)
  • DOM-level diffing for granular change detection
  • Change classification (minor vs. major)
  • CLI tools for cache inspection
  • Export adapters for AI/LLM pipelines
  • Async/await support
  • Custom hash functions
  • Webhook notifications

Contributing

Contributions are welcome! This project is in active development.

Development Setup

# Clone repository
git clone https://github.com/jhd3197/pulso.git
cd pulso

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in development mode
pip install -e ".[dev]"

# Install Playwright browsers
playwright install

# Run tests
pytest tests/

Guidelines

  • Write tests for new features
  • Follow existing code style (Black formatter)
  • Update documentation for API changes
  • Keep the API simple and predictable

License

MIT License - see LICENSE file for details.

Project Status

Status: Active Development

The public API is stabilizing around core functions (fetch, has_changed, snapshot) and domain policies. Breaking changes may occur before v1.0.0.


Built with a focus on predictability, state management, and intelligent caching.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulso-0.1.1.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pulso-0.1.1-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file pulso-0.1.1.tar.gz.

File metadata

  • Download URL: pulso-0.1.1.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for pulso-0.1.1.tar.gz
Algorithm Hash digest
SHA256 424f1c6ca028bcef8086f54b106793bf13c0281df550ac7e94d01558f1392893
MD5 373fb6d7c10b0fa3199f4d74eeb84d85
BLAKE2b-256 bba2d98306c8ba4febd5ecd31c939925dd8af7774c62c575147d0673a04f0623

See more details on using hashes here.

File details

Details for the file pulso-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pulso-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for pulso-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eb6b5a604b2ddfee4171d11ffc3137c1e08cb2a3366ef669c49916e238bccf00
MD5 62d3ef01e16ab32fdceb0c5be227bf6e
BLAKE2b-256 4243cab628d6ac569955c800134e9acd6b75a1f40461706fe25dcc956963c65a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page