Skip to main content

LLM-powered classification and extraction for web content

Project description

CatWeb

LLM-powered classification, extraction, and summarization for web content.

Part of the CatLLM ecosystem. Thin wrapper around cat-stack that adds URL fetching and web-specific context injection.

Installation

pip install cat-web        # pulls in cat-stack automatically
pip install cat-web[pdf]   # with PDF support

Quick Start

import catweb as cat

# Classify web pages by topic
results = cat.classify(
    categories=["News", "Opinion", "Tutorial", "Reference"],
    input_data=[
        "https://example.com/article1",
        "https://example.com/article2",
    ],
    api_key="your-api-key",
)

# Extract categories from web content
extracted = cat.extract(
    input_data=["https://example.com/page1", "https://example.com/page2"],
    description="Blog posts about technology",
    api_key="your-api-key",
)

# Summarize web pages
summaries = cat.summarize(
    input_data=["https://example.com/article1"],
    description="News articles",
    api_key="your-api-key",
)

How It Works

CatWeb accepts URLs as input, fetches the web content, strips HTML to plain text, and passes the text through cat-stack's classification/extraction/summarization pipeline. Original URLs are preserved in the output DataFrame's survey_input column.

You can also pass pre-fetched text directly — CatWeb auto-detects whether input is URLs or plain text.

API Reference

classify(categories, input_data, api_key, ...)

Classify web content into predefined categories.

Parameter Type Description
categories list Category names for classification
input_data list/Series URLs or text strings to classify
api_key str API key for the model provider
source_domain str Source domain (injected as prompt context)
content_type str Content type, e.g. "news article", "blog post"
web_metadata dict Additional key-value context for the prompt
timeout int URL fetch timeout in seconds (default 30)
**kwargs All cat-stack classify() parameters (models, creativity, batch_mode, etc.)

extract(input_data, api_key, ...)

Discover categories from web content.

Parameter Type Description
input_data list/Series URLs or text strings
api_key str API key
source_domain str Source domain context
content_type str Content type context
web_metadata dict Additional context
timeout int URL fetch timeout (default 30)
**kwargs All cat-stack extract() parameters

explore(input_data, api_key, ...)

Raw category extraction (with duplicates) for saturation analysis.

Same parameters as extract(), plus all cat-stack explore() parameters.

summarize(input_data, ...)

Summarize web content.

Parameter Type Description
input_data list/Series URLs or text strings
source_domain str Source domain context
content_type str Content type context
web_metadata dict Additional context
timeout int URL fetch timeout (default 30)
**kwargs All cat-stack summarize() parameters (api_key, description, models, etc.)

Web Utilities

from catweb import is_url, fetch_url_text, fetch_urls

# Check if a string is a URL
is_url("https://example.com")  # True
is_url("just text")            # False

# Fetch a single URL
text, error = fetch_url_text("https://example.com")

# Fetch multiple URLs
results = fetch_urls(["https://a.com", "https://b.com"])
# Returns: [(url, text, error), ...]

Multi-Model Ensemble

All cat-stack ensemble features work through **kwargs:

results = cat.classify(
    categories=["Positive", "Negative", "Neutral"],
    input_data=urls,
    models=[
        ("gpt-4o", "openai", "sk-..."),
        ("claude-sonnet-4-5-20250929", "anthropic", "sk-ant-..."),
    ],
    consensus_threshold="majority",
)

License

GPL-3.0-or-later

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_web-0.1.0.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cat_web-0.1.0-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file cat_web-0.1.0.tar.gz.

File metadata

  • Download URL: cat_web-0.1.0.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_web-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2dc64b67a64267b7aad49a2cefd45f3561b5a685943efe7e3b5cb7994b2c386b
MD5 3f85de2a63eef7226db3b22483599b72
BLAKE2b-256 ca6a013dadd86a981275e2196487aee68878310bc26d8b5a07db9eb4836360be

See more details on using hashes here.

File details

Details for the file cat_web-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cat_web-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_web-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 66db076df6b2517de663fa69eeb6a8b9ec5606d94cc0cc9515e98fe3c4878f67
MD5 552b395706524fee0312d96075027e68
BLAKE2b-256 b21aaff59039e2164bc8356619c88e90e04aa2c5ce1394aa07ea0453652e9670

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page