LLM-powered classification and extraction for web content
Project description
CatWeb
LLM-powered classification, extraction, and summarization for web content.
Part of the CatLLM ecosystem. Thin wrapper around cat-stack that adds URL fetching and web-specific context injection.
Installation
pip install cat-web # pulls in cat-stack automatically
pip install cat-web[pdf] # with PDF support
Quick Start
import catweb as cat
# Classify web pages by topic
results = cat.classify(
categories=["News", "Opinion", "Tutorial", "Reference"],
input_data=[
"https://example.com/article1",
"https://example.com/article2",
],
api_key="your-api-key",
)
# Extract categories from web content
extracted = cat.extract(
input_data=["https://example.com/page1", "https://example.com/page2"],
description="Blog posts about technology",
api_key="your-api-key",
)
# Summarize web pages
summaries = cat.summarize(
input_data=["https://example.com/article1"],
description="News articles",
api_key="your-api-key",
)
How It Works
CatWeb accepts URLs as input, fetches the web content, strips HTML to plain text, and passes the text through cat-stack's classification/extraction/summarization pipeline. Original URLs are preserved in the output DataFrame's survey_input column.
You can also pass pre-fetched text directly — CatWeb auto-detects whether input is URLs or plain text.
API Reference
classify(categories, input_data, api_key, ...)
Classify web content into predefined categories.
| Parameter | Type | Description |
|---|---|---|
categories |
list | Category names for classification |
input_data |
list/Series | URLs or text strings to classify |
api_key |
str | API key for the model provider |
source_domain |
str | Source domain (injected as prompt context) |
content_type |
str | Content type, e.g. "news article", "blog post" |
web_metadata |
dict | Additional key-value context for the prompt |
timeout |
int | URL fetch timeout in seconds (default 30) |
**kwargs |
All cat-stack classify() parameters (models, creativity, batch_mode, etc.) |
extract(input_data, api_key, ...)
Discover categories from web content.
| Parameter | Type | Description |
|---|---|---|
input_data |
list/Series | URLs or text strings |
api_key |
str | API key |
source_domain |
str | Source domain context |
content_type |
str | Content type context |
web_metadata |
dict | Additional context |
timeout |
int | URL fetch timeout (default 30) |
**kwargs |
All cat-stack extract() parameters |
explore(input_data, api_key, ...)
Raw category extraction (with duplicates) for saturation analysis.
Same parameters as extract(), plus all cat-stack explore() parameters.
summarize(input_data, ...)
Summarize web content.
| Parameter | Type | Description |
|---|---|---|
input_data |
list/Series | URLs or text strings |
source_domain |
str | Source domain context |
content_type |
str | Content type context |
web_metadata |
dict | Additional context |
timeout |
int | URL fetch timeout (default 30) |
**kwargs |
All cat-stack summarize() parameters (api_key, description, models, etc.) |
Web Utilities
from catweb import is_url, fetch_url_text, fetch_urls
# Check if a string is a URL
is_url("https://example.com") # True
is_url("just text") # False
# Fetch a single URL
text, error = fetch_url_text("https://example.com")
# Fetch multiple URLs
results = fetch_urls(["https://a.com", "https://b.com"])
# Returns: [(url, text, error), ...]
Multi-Model Ensemble
All cat-stack ensemble features work through **kwargs:
results = cat.classify(
categories=["Positive", "Negative", "Neutral"],
input_data=urls,
models=[
("gpt-4o", "openai", "sk-..."),
("claude-sonnet-4-5-20250929", "anthropic", "sk-ant-..."),
],
consensus_threshold="majority",
)
License
GPL-3.0-or-later
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cat_web-0.1.0.tar.gz.
File metadata
- Download URL: cat_web-0.1.0.tar.gz
- Upload date:
- Size: 6.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dc64b67a64267b7aad49a2cefd45f3561b5a685943efe7e3b5cb7994b2c386b
|
|
| MD5 |
3f85de2a63eef7226db3b22483599b72
|
|
| BLAKE2b-256 |
ca6a013dadd86a981275e2196487aee68878310bc26d8b5a07db9eb4836360be
|
File details
Details for the file cat_web-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cat_web-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66db076df6b2517de663fa69eeb6a8b9ec5606d94cc0cc9515e98fe3c4878f67
|
|
| MD5 |
552b395706524fee0312d96075027e68
|
|
| BLAKE2b-256 |
b21aaff59039e2164bc8356619c88e90e04aa2c5ce1394aa07ea0453652e9670
|