LLM-powered classification and extraction for web content

These details have not been verified by PyPI

Project links

Project description

CatWeb

LLM-powered classification, extraction, and summarization for web content.

Part of the CatLLM ecosystem. Thin wrapper around cat-stack that adds URL fetching and web-specific context injection.

Installation

pip install cat-web        # pulls in cat-stack automatically
pip install cat-web[pdf]   # with PDF support

Quick Start

import catweb as cat

# Classify web pages by topic
results = cat.classify(
    categories=["News", "Opinion", "Tutorial", "Reference"],
    input_data=[
        "https://example.com/article1",
        "https://example.com/article2",
    ],
    api_key="your-api-key",
)

# Extract categories from web content
extracted = cat.extract(
    input_data=["https://example.com/page1", "https://example.com/page2"],
    description="Blog posts about technology",
    api_key="your-api-key",
)

# Summarize web pages
summaries = cat.summarize(
    input_data=["https://example.com/article1"],
    description="News articles",
    api_key="your-api-key",
)

How It Works

CatWeb accepts URLs as input, fetches the web content, strips HTML to plain text, and passes the text through cat-stack's classification/extraction/summarization pipeline. Original URLs are preserved in the output DataFrame's survey_input column.

You can also pass pre-fetched text directly — CatWeb auto-detects whether input is URLs or plain text.

API Reference

`classify(categories, input_data, api_key, ...)`

Classify web content into predefined categories.

Parameter	Type	Description
`categories`	list	Category names for classification
`input_data`	list/Series	URLs or text strings to classify
`api_key`	str	API key for the model provider
`source_domain`	str	Source domain (injected as prompt context)
`content_type`	str	Content type, e.g. "news article", "blog post"
`web_metadata`	dict	Additional key-value context for the prompt
`timeout`	int	URL fetch timeout in seconds (default 30)
`**kwargs`		All cat-stack classify() parameters (models, creativity, batch_mode, etc.)

`extract(input_data, api_key, ...)`

Discover categories from web content.

Parameter	Type	Description
`input_data`	list/Series	URLs or text strings
`api_key`	str	API key
`source_domain`	str	Source domain context
`content_type`	str	Content type context
`web_metadata`	dict	Additional context
`timeout`	int	URL fetch timeout (default 30)
`**kwargs`		All cat-stack extract() parameters

`explore(input_data, api_key, ...)`

Raw category extraction (with duplicates) for saturation analysis.

Same parameters as extract(), plus all cat-stack explore() parameters.

`summarize(input_data, ...)`

Summarize web content.

Parameter	Type	Description
`input_data`	list/Series	URLs or text strings
`source_domain`	str	Source domain context
`content_type`	str	Content type context
`web_metadata`	dict	Additional context
`timeout`	int	URL fetch timeout (default 30)
`**kwargs`		All cat-stack summarize() parameters (api_key, description, models, etc.)

Web Utilities

from catweb import is_url, fetch_url_text, fetch_urls

# Check if a string is a URL
is_url("https://example.com")  # True
is_url("just text")            # False

# Fetch a single URL
text, error = fetch_url_text("https://example.com")

# Fetch multiple URLs
results = fetch_urls(["https://a.com", "https://b.com"])
# Returns: [(url, text, error), ...]

Multi-Model Ensemble

All cat-stack ensemble features work through **kwargs:

results = cat.classify(
    categories=["Positive", "Negative", "Neutral"],
    input_data=urls,
    models=[
        ("gpt-4o", "openai", "sk-..."),
        ("claude-sonnet-4-5-20250929", "anthropic", "sk-ant-..."),
    ],
    consensus_threshold="majority",
)

License

GPL-3.0-or-later

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_web-0.1.0.tar.gz (6.6 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cat_web-0.1.0-py3-none-any.whl (11.7 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file cat_web-0.1.0.tar.gz.

File metadata

Download URL: cat_web-0.1.0.tar.gz
Upload date: Mar 19, 2026
Size: 6.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_web-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2dc64b67a64267b7aad49a2cefd45f3561b5a685943efe7e3b5cb7994b2c386b`
MD5	`3f85de2a63eef7226db3b22483599b72`
BLAKE2b-256	`ca6a013dadd86a981275e2196487aee68878310bc26d8b5a07db9eb4836360be`

See more details on using hashes here.

File details

Details for the file cat_web-0.1.0-py3-none-any.whl.

File metadata

Download URL: cat_web-0.1.0-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 11.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_web-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`66db076df6b2517de663fa69eeb6a8b9ec5606d94cc0cc9515e98fe3c4878f67`
MD5	`552b395706524fee0312d96075027e68`
BLAKE2b-256	`b21aaff59039e2164bc8356619c88e90e04aa2c5ce1394aa07ea0453652e9670`

See more details on using hashes here.

cat-web 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CatWeb

Installation

Quick Start

How It Works

API Reference

`classify(categories, input_data, api_key, ...)`

`extract(input_data, api_key, ...)`

`explore(input_data, api_key, ...)`

`summarize(input_data, ...)`

Web Utilities

Multi-Model Ensemble

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes