Skip to main content

LLM-powered classification and extraction for web content

Project description

CatWeb

LLM-powered classification, extraction, and summarization for web content.

Part of the CatLLM ecosystem. Thin wrapper around cat-stack that adds URL fetching and web-specific context injection.

Installation

pip install cat-web        # pulls in cat-stack automatically
pip install cat-web[pdf]   # with PDF support

Quick Start

import catweb as cat

# Classify web pages by topic
results = cat.classify(
    categories=["News", "Opinion", "Tutorial", "Reference"],
    input_data=[
        "https://example.com/article1",
        "https://example.com/article2",
    ],
    api_key="your-api-key",
)

# Extract categories from web content
extracted = cat.extract(
    input_data=["https://example.com/page1", "https://example.com/page2"],
    description="Blog posts about technology",
    api_key="your-api-key",
)

# Summarize web pages
summaries = cat.summarize(
    input_data=["https://example.com/article1"],
    description="News articles",
    api_key="your-api-key",
)

How It Works

CatWeb accepts URLs as input, fetches the web content, strips HTML to plain text, and passes the text through cat-stack's classification/extraction/summarization pipeline. Original URLs are preserved in the output DataFrame's survey_input column.

You can also pass pre-fetched text directly — CatWeb auto-detects whether input is URLs or plain text.

API Reference

classify(categories, input_data, api_key, ...)

Classify web content into predefined categories.

Parameter Type Description
categories list Category names for classification
input_data list/Series URLs or text strings to classify
api_key str API key for the model provider
source_domain str Source domain (injected as prompt context)
content_type str Content type, e.g. "news article", "blog post"
web_metadata dict Additional key-value context for the prompt
timeout int URL fetch timeout in seconds (default 30)
**kwargs All cat-stack classify() parameters (models, creativity, batch_mode, etc.)

extract(input_data, api_key, ...)

Discover categories from web content.

Parameter Type Description
input_data list/Series URLs or text strings
api_key str API key
source_domain str Source domain context
content_type str Content type context
web_metadata dict Additional context
timeout int URL fetch timeout (default 30)
**kwargs All cat-stack extract() parameters

explore(input_data, api_key, ...)

Raw category extraction (with duplicates) for saturation analysis.

Same parameters as extract(), plus all cat-stack explore() parameters.

summarize(input_data, ...)

Summarize web content.

Parameter Type Description
input_data list/Series URLs or text strings
source_domain str Source domain context
content_type str Content type context
web_metadata dict Additional context
timeout int URL fetch timeout (default 30)
**kwargs All cat-stack summarize() parameters (api_key, description, models, etc.)

Web Utilities

from catweb import is_url, fetch_url_text, fetch_urls

# Check if a string is a URL
is_url("https://example.com")  # True
is_url("just text")            # False

# Fetch a single URL
text, error = fetch_url_text("https://example.com")

# Fetch multiple URLs
results = fetch_urls(["https://a.com", "https://b.com"])
# Returns: [(url, text, error), ...]

Multi-Model Ensemble

All cat-stack ensemble features work through **kwargs:

results = cat.classify(
    categories=["Positive", "Negative", "Neutral"],
    input_data=urls,
    models=[
        ("gpt-4o", "openai", "sk-..."),
        ("claude-sonnet-4-5-20250929", "anthropic", "sk-ant-..."),
    ],
    consensus_threshold="majority",
)

License

GPL-3.0-or-later

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_web-0.2.1.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cat_web-0.2.1-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file cat_web-0.2.1.tar.gz.

File metadata

  • Download URL: cat_web-0.2.1.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_web-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b3476e39ae6eb7fcbd0d0ebac0eded6c2117cbc165fd68d1f90eb46c2e6e6120
MD5 6cc30a057dff5366c7000cf55f7ca88b
BLAKE2b-256 9c86e343aeef97174c3fec7425718f6ac80c7caf9da9e676788bbef1bb57222d

See more details on using hashes here.

File details

Details for the file cat_web-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: cat_web-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_web-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 669d08892263520aa59cebec6f92ce04a6c35e6c3a2abce78311b9790dcbc02e
MD5 55a78a7bc46805f4a2e3a76879843991
BLAKE2b-256 8b497aacf320db8627de269d3be9355d82cac5f034a0b8fad553c2ba1896dd43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page