Pulso delivers stateful web fetching with cache, hashes, and domain-aware rules

These details have not been verified by PyPI

Project links

Project description

Pulso

Stateful web fetching with intelligent caching, content hashing, and domain-aware policies.

Pulso is a Python library that fetches web content once, remembers it, and only re-fetches when necessary. It's designed for data pipelines, content monitoring systems, and AI workflows where repeated requests and noisy HTML changes create unnecessary overhead.

Why Pulso
Key Features
Installation
Quick Start
Usage
Examples
API Reference
Cache Storage
Architecture
Roadmap
Contributing
License

Why Pulso

Most web scraping tools focus on getting content. Pulso focuses on not getting it again when nothing has changed.

Built For

Deterministic data pipelines - Ensure reproducible results across runs
Change detection - Monitor content updates without wasteful re-fetching
Content monitoring - Track website changes efficiently
AI workflows - Avoid reprocessing identical HTML repeatedly

Core Principles

Stateful by design - Every fetch maintains metadata and history
Domain-aware policies - Configure TTL and fetch behavior per domain
Hash-based identification - Content changes detected via normalized hashes, not timestamps
Change detection first - Built-in tracking of content modifications

Key Features

Smart Fetching

Automatic driver selection based on content type:

Static pages - Fast fetching with requests
Dynamic content - JavaScript rendering with playwright
Per-domain configuration - Set driver preference for each domain

import pulso

# Simple fetch with automatic caching
html = pulso.fetch("https://example.com")

Domain-Aware Caching

Configure time-to-live (TTL) and fetch behavior per domain:

pulso.register_domain(
    "example.com",
    ttl="1d",        # Cache for 1 day
    driver="requests"
)

pulso.register_domain(
    "dynamic-site.com",
    ttl="6h",        # Cache for 6 hours
    driver="playwright"
)

Supported TTL formats: 1d (day), 12h (hours), 30m (minutes), 60s (seconds)

Pulso automatically:

Returns cached content if still fresh (within TTL)
Re-fetches only after TTL expires
Respects domain-specific policies consistently

Content Hashing

Intelligent change detection using normalized content hashes:

if pulso.has_changed("https://example.com"):
    print("Content has been updated!")

How it works:

HTML is normalized (whitespace, scripts, styles removed)
Content hashed with SHA-256
Same hash = no meaningful change
Different hash = real content update

Change Tracking

Comprehensive metadata for every URL:

metadata = pulso.get_metadata(url)
# Returns:
# {
#   'content_hash': '8f3d9a...',
#   'fetch_time': 1234567890.0,
#   'change_time': 1234567890.0,
#   'change_count': 3
# }

Create snapshots when content changes:

if pulso.has_changed(url):
    snapshot_path = pulso.snapshot(url)
    print(f"Snapshot saved: {snapshot_path}")

Cache Management

Granular cache control:

# Clear specific domain
pulso.cache.clear(domain="example.com")

# Clear specific URL
pulso.cache.clear(url="https://example.com/page")

# Clear entire cache
pulso.cache.clear()

# View registered domains
domains = pulso.get_registered_domains()

Installation

pip install pulso

For Playwright support (dynamic content):

pip install pulso
playwright install

Quick Start

import pulso

# Register domain with policy
pulso.register_domain(
    "news.example.com",
    ttl="12h",
    driver="playwright"
)

# Fetch content (cached automatically)
url = "https://news.example.com/article/123"
html = pulso.fetch(url)

# Check for changes
if pulso.has_changed(url):
    print("Article was updated!")
    pulso.snapshot(url)
else:
    print("No changes detected")

That's it. No manual cache handling, no cron jobs, no duplicate fetch logic.

Usage

Basic Fetching

import pulso

# Fetch with default settings (1 day TTL, requests driver)
html = pulso.fetch("https://example.com")

# Force refresh (bypass cache)
html = pulso.fetch("https://example.com", force=True)

Domain Configuration

# Register multiple domains
pulso.register_domain("api.service.com", ttl="5m", driver="requests")
pulso.register_domain("app.service.com", ttl="1h", driver="playwright")

# View all registered domains
domains = pulso.get_registered_domains()
for domain, policy in domains.items():
    print(f"{domain}: TTL={policy.ttl_seconds}s, Driver={policy.driver}")

Change Detection Workflow

import pulso

url = "https://blog.example.com/post/123"

# First fetch - creates cache entry
html = pulso.fetch(url)

# Later... check if content changed
if pulso.has_changed(url):
    # Content changed - get fresh version
    new_html = pulso.fetch(url, force=True)

    # Save snapshot
    snapshot_path = pulso.snapshot(url)

    # Process new content
    process_updated_content(new_html)

Metadata Inspection

metadata = pulso.get_metadata("https://example.com")

if metadata:
    print(f"Last fetched: {metadata['fetch_time']}")
    print(f"Last changed: {metadata['change_time']}")
    print(f"Total changes: {metadata['change_count']}")
    print(f"Content hash: {metadata['content_hash']}")

Error Handling and Retries

Pulso includes robust error handling with automatic retries and configurable fallback behavior:

import pulso

# Define error callback for monitoring/logging
def report_error(url, exception):
    print(f"Failed to fetch {url}: {exception}")
    # Send to monitoring system, log to file, etc.

# Register domain with error handling
pulso.register_domain(
    "unreliable-api.com",
    ttl="30m",
    driver="requests",
    max_retries=5,              # Retry up to 5 times
    retry_delay=2.0,            # Wait 2 seconds between retries
    fallback_on_error="return_cached",  # Return cached data on failure
    on_error=report_error       # Call this function on each error
)

# When fetch fails after all retries:
# - Logs warnings for each retry attempt
# - Calls on_error callback if provided
# - Returns last cached data (if fallback_on_error="return_cached")
html = pulso.fetch("https://unreliable-api.com/data")

Fallback behaviors:

return_cached (default) - Returns last successful fetch from cache, reports error but doesn't crash
raise_error - Raises FetchError exception for strict error handling
return_none - Returns None, allows graceful degradation

# Example: Graceful degradation
pulso.register_domain(
    "optional-service.com",
    fallback_on_error="return_none"
)

data = pulso.fetch("https://optional-service.com/api")
if data is None:
    print("Service unavailable, using defaults")
    data = get_default_data()

Session-Based Caching

Isolate cache by user, tenant, or context using sessions:

import pulso

# Set session for user-specific caching
pulso.set_session("user_123")

# All cache operations now use user_123 session
html = pulso.fetch("https://example.com")

# Switch to different user
pulso.set_session("user_456")
# This fetches fresh data (different session)
html = pulso.fetch("https://example.com")

# Check current session
current_session = pulso.get_session()  # Returns: "user_456"

Use cases:

Multi-tenant applications (isolate cache per tenant)
User-specific data caching
A/B testing with different cache variants
Environment isolation (dev/staging/production)

Session via environment:

# .env file
PULSO_SESSION_ID=production
PULSO_CACHE_DIR=/custom/cache/path

Note: Pulso still reads legacy PULSO_* environment variables for backward compatibility, but prefer the new PULSO_* names.

import pulso

# Load from .env file
pulso.load_config(".env")

Docker Support

Deploy Pulso in containers with Redis for distributed caching:

# docker-compose.yml
version: '3.8'

services:
  app:
    build: .
    environment:
      - PULSO_CACHE_BACKEND=redis
      - PULSO_REDIS_URL=redis://redis:6379/0
      - PULSO_SESSION_ID=production
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data

volumes:
  redis-data:

See DOCKER.md for complete deployment guide.

Examples

Complete working examples are available in the examples/ folder:

example.py - Basic usage with domain registration, fetching, and change detection
example_error_handling.py - Error handling patterns with retries and fallback behaviors
example_sessions.py - Session-based caching for multi-tenant applications
example_docker.py - Production Docker deployment with Redis

See the examples/README.md for detailed documentation on running each example.

API Reference

Core Functions

`fetch(url: str, force: bool = False) -> str`

Fetch web content with automatic caching.

Parameters:

url - URL to fetch
force - Force refresh, bypass cache (default: False)

Returns: HTML content as string

`has_changed(url: str) -> bool`

Check if content has changed since last fetch.

Parameters:

url - URL to check

Returns: True if content changed or URL not cached

`snapshot(url: str, snapshot_dir: Optional[Path] = None) -> Optional[Path]`

Create snapshot of cached HTML.

Parameters:

url - URL to snapshot
snapshot_dir - Optional snapshot directory

Returns: Path to snapshot file

`get_metadata(url: str) -> Optional[dict]`

Get metadata for cached URL.

Returns: Dictionary with metadata or None if not cached

`register_domain(domain: str, ttl: str = "1d", driver: Literal["requests", "playwright"] = "requests", max_retries: int = 3, retry_delay: float = 1.0, fallback_on_error: Literal["return_cached", "raise_error", "return_none"] = "return_cached", on_error: Optional[Callable] = None) -> None`

Parameters:

domain - Domain name (e.g., "example.com")
ttl - Time-to-live: "1d", "12h", "30m", "60s"
driver - Fetch driver: "requests" or "playwright"
max_retries - Maximum retry attempts on failure (default: 3)
retry_delay - Delay in seconds between retries (default: 1.0)
fallback_on_error - Error handling behavior:
- "return_cached" - Return last cached data if available (default)
- "raise_error" - Raise FetchError on failure
- "return_none" - Return None on failure
on_error - Optional callback function(url, exception) for error reporting

`get_registered_domains() -> Dict[str, DomainPolicy]`

Get all registered domains and their policies.

Returns: Dictionary mapping domain names to DomainPolicy objects

`set_session(session_id: str) -> None`

Set the current session ID for isolated caching.

Parameters:

session_id - Unique identifier for this session

Example:

pulso.set_session("user_123")

`get_session() -> str`

Get the current session ID.

Returns: Current session ID

`load_config(env_file: str = ".env") -> None`

Load configuration from environment file.

Parameters:

env_file - Path to .env file (default: ".env")

Cache Manager

`cache.clear(domain: Optional[str] = None, url: Optional[str] = None) -> None`

Clear cache entries.

Parameters:

domain - Clear all entries for domain
url - Clear specific URL
(no params) - Clear entire cache

Cache Storage

Pulso stores cache at the user level, not within your project directory.

Locations

Linux / macOS: ~/.cache/pulso/
Windows: %LOCALAPPDATA%\pulso\

Organization

Cache is structured by domain and URL hashes:

~/.cache/pulso/
├── example.com/
│   ├── a3f2d9e1.json          # Metadata
│   ├── a3f2d9e1.html          # Content
│   └── ...
├── news.site/
│   └── ...
└── snapshots/
    └── ...

This structure makes the cache:

Inspectable - Easy to browse and debug
Portable - Safe to use across multiple projects
Manageable - Simple to clear or backup

Architecture

Mental Model

Pulso is not a web crawler or scraping framework.

Think of it as:

requests + persistent memory + domain policies + content hashing

You call fetch() multiple times on the same URLs, and Pulso intelligently decides whether a network request is actually needed.

Design Principles

Stateful over Stateless

Every fetch operation maintains state
Content history is preserved automatically
No need for external state management

Predictable over Clever

Explicit domain policies
No magic heuristics
Deterministic behavior

Hash-based over Time-based

Content identified by normalized hash
Immune to trivial HTML changes (whitespace, scripts)
Real changes always detected

What Pulso is NOT

❌ Not a full-featured web scraping framework
❌ Not a distributed crawler with spiders
❌ Not a monitoring SaaS or alerting system
❌ Not a proxy or request interceptor

Pulso is a library designed to be embedded in your own applications and data pipelines.

Roadmap

Features under development or consideration:

Rate limiting per domain
Conditional requests (ETag, Last-Modified headers)
DOM-level diffing for granular change detection
Change classification (minor vs. major)
CLI tools for cache inspection
Export adapters for AI/LLM pipelines
Async/await support
Custom hash functions
Webhook notifications

Contributing

Contributions are welcome! This project is in active development.

Development Setup

# Clone repository
git clone https://github.com/jhd3197/pulso.git
cd pulso

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in development mode
pip install -e ".[dev]"

# Install Playwright browsers
playwright install

# Run tests
pytest tests/

Guidelines

Write tests for new features
Follow existing code style (Black formatter)
Update documentation for API changes
Keep the API simple and predictable

License

MIT License - see LICENSE file for details.

Project Status

Status: Active Development

The public API is stabilizing around core functions (fetch, has_changed, snapshot) and domain policies. Breaking changes may occur before v1.0.0.

Built with a focus on predictability, state management, and intelligent caching.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

Dec 20, 2025

This version

0.1.1

Nov 27, 2025

0.1.0

Nov 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulso-0.1.1.tar.gz (22.4 kB view details)

Uploaded Nov 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pulso-0.1.1-py3-none-any.whl (19.3 kB view details)

Uploaded Nov 27, 2025 Python 3

File details

Details for the file pulso-0.1.1.tar.gz.

File metadata

Download URL: pulso-0.1.1.tar.gz
Upload date: Nov 27, 2025
Size: 22.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for pulso-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`424f1c6ca028bcef8086f54b106793bf13c0281df550ac7e94d01558f1392893`
MD5	`373fb6d7c10b0fa3199f4d74eeb84d85`
BLAKE2b-256	`bba2d98306c8ba4febd5ecd31c939925dd8af7774c62c575147d0673a04f0623`

See more details on using hashes here.

File details

Details for the file pulso-0.1.1-py3-none-any.whl.

File metadata

Download URL: pulso-0.1.1-py3-none-any.whl
Upload date: Nov 27, 2025
Size: 19.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for pulso-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eb6b5a604b2ddfee4171d11ffc3137c1e08cb2a3366ef669c49916e238bccf00`
MD5	`62d3ef01e16ab32fdceb0c5be227bf6e`
BLAKE2b-256	`4243cab628d6ac569955c800134e9acd6b75a1f40461706fe25dcc956963c65a`

See more details on using hashes here.

pulso 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Pulso

Table of Contents

Why Pulso

Built For

Core Principles

Key Features

Smart Fetching

Domain-Aware Caching

Content Hashing

Change Tracking

Cache Management

Installation

Quick Start

Usage

Basic Fetching

Domain Configuration

Change Detection Workflow

Metadata Inspection

Error Handling and Retries

Session-Based Caching

Docker Support

Examples

API Reference

Core Functions

fetch(url: str, force: bool = False) -> str

has_changed(url: str) -> bool

snapshot(url: str, snapshot_dir: Optional[Path] = None) -> Optional[Path]

get_metadata(url: str) -> Optional[dict]

register_domain(domain: str, ttl: str = "1d", driver: Literal["requests", "playwright"] = "requests", max_retries: int = 3, retry_delay: float = 1.0, fallback_on_error: Literal["return_cached", "raise_error", "return_none"] = "return_cached", on_error: Optional[Callable] = None) -> None

get_registered_domains() -> Dict[str, DomainPolicy]

set_session(session_id: str) -> None

get_session() -> str

load_config(env_file: str = ".env") -> None

Cache Manager

cache.clear(domain: Optional[str] = None, url: Optional[str] = None) -> None

Cache Storage

Locations

Organization

Architecture

Mental Model

Design Principles

What Pulso is NOT

Roadmap

Contributing

Development Setup

Guidelines

License

Project Status

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`fetch(url: str, force: bool = False) -> str`

`has_changed(url: str) -> bool`

`snapshot(url: str, snapshot_dir: Optional[Path] = None) -> Optional[Path]`

`get_metadata(url: str) -> Optional[dict]`

`register_domain(domain: str, ttl: str = "1d", driver: Literal["requests", "playwright"] = "requests", max_retries: int = 3, retry_delay: float = 1.0, fallback_on_error: Literal["return_cached", "raise_error", "return_none"] = "return_cached", on_error: Optional[Callable] = None) -> None`

`get_registered_domains() -> Dict[str, DomainPolicy]`

`set_session(session_id: str) -> None`

`get_session() -> str`

`load_config(env_file: str = ".env") -> None`

`cache.clear(domain: Optional[str] = None, url: Optional[str] = None) -> None`