Policy document classification powered by LLMs
Project description
cat-pol
Political text classification and analysis powered by LLMs. A policy-specific wrapper around cat-stack with built-in access to 15 political data sources on HuggingFace.
Installation
pip install cat-pol
With optional extras:
pip install "cat-pol[pdf]" # PDF document processing
pip install "cat-pol[embeddings]" # Embedding-based similarity scoring
pip install "cat-pol[sources]" # Data source loading (datasets, huggingface_hub)
Quick Start
Classify ordinances from a built-in source
import cat_pol as pol
results = pol.classify(
source="city_san_diego",
categories=["Housing", "Public Safety", "Infrastructure", "Finance"],
doc_type="ordinance",
since="2022-01-01",
n=50,
api_key="sk-...",
)
Classify raw text
results = pol.classify(
input_data=[
"The committee voted to approve the rezoning request for parcel 42.",
"Motion to table the budget amendment until the next session.",
],
categories=["Approval", "Rejection", "Deferral", "Amendment"],
document_context="City council meeting minutes",
api_key="sk-...",
)
Optimize prompts with user feedback
result = pol.prompt_tune(
source="city_san_diego",
categories=["Pro-Business", "Pro-Regulation", "Tax Increase", "Tax Decrease"],
doc_type="ordinance",
since="2020-01-01",
n=100,
api_key="sk-...",
sample_size=15,
)
# Use the optimized prompt for full classification
results = pol.classify(
source="city_san_diego",
categories=["Pro-Business", "Pro-Regulation", "Tax Increase", "Tax Decrease"],
system_prompt=result["system_prompt"],
api_key="sk-...",
)
Summarize with different formats
# Bullet points
pol.summarize(source="federal_executive_orders", n=10, format="bullets", api_key="sk-...")
# Full report
pol.summarize(source="federal_laws", n=5, format="report", api_key="sk-...")
# One-liner
pol.summarize(source="social_trump_truth", since="2024-01-01", n=20, format="one-liner", api_key="sk-...")
Discover categories
result = pol.extract(
source="city_berkeley",
n=200,
api_key="sk-...",
)
print(result["top_categories"])
Fetch raw data
# List all sources
pol.list_sources()
pol.list_sources(level="city")
pol.list_sources(level="federal")
# Fetch data
df = pol.fetch_source("city_san_diego", n=100, since="2020-01-01", doc_type="ordinance")
df = pol.fetch_source("federal_executive_orders", n=50)
df = pol.fetch_source("social_trump_truth", since="2024-01-01")
Data Sources
All datasets are public on HuggingFace — no authentication required.
California Cities
| Source | Rows | Types | Repo |
|---|---|---|---|
city_san_diego |
87,983 | ordinances, resolutions | chrissoria/san-diego-ordinances |
city_los_angeles |
34,427 | ordinances | chrissoria/la-ordinances |
city_berkeley |
9,028 | ordinances | chrissoria/berkeley-ordinances |
city_san_francisco |
4,033 | ordinances | chrissoria/sf-ordinances |
city_long_beach |
3,898 | ordinances, resolutions | chrissoria/long-beach-ordinances |
city_bakersfield |
2,655 | ordinances | chrissoria/bakersfield-ordinances |
city_newport_beach |
2,719 | ordinances | chrissoria/newport-beach-ordinances |
city_salinas |
2,574 | ordinances, resolutions | chrissoria/salinas-ordinances |
city_clovis |
2,343 | ordinances | chrissoria/clovis-ordinances |
city_oakland |
1,824 | ordinances | chrissoria/oakland-ordinances |
city_fresno |
706 | ordinances, resolutions | chrissoria/fresno-ordinances |
Federal
| Source | Rows | Types | Repo |
|---|---|---|---|
federal_laws |
5,915 | public laws (1995–present) | chrissoria/federal-public-laws |
federal_executive_orders |
1,530+ | executive orders | chrissoria/executive-orders |
federal_speeches |
305 | SOTU, inaugurals | chrissoria/presidential-speeches |
Social Media
| Source | Rows | Types | Repo |
|---|---|---|---|
social_trump_truth |
32,000+ | Truth Social posts | chrissoria/trump-truth-social |
All sources are updated weekly (Sundays at 9 AM) via automated scrapers. Truth Social is updated daily at 9 AM.
Trump Truth Social Dataset Columns
The social_trump_truth dataset is enriched with metadata, market data, and image descriptions:
Post metadata:
| Column | Description |
|---|---|
date |
Post date (YYYY-MM-DD) |
time |
Post time in UTC (HH:MM:SS) |
day_of_week |
Day name (Monday, Tuesday, etc.) |
datetime |
Full ISO timestamp |
text |
Post text content |
url |
Truth Social post URL |
post_id |
Unique post identifier |
is_president |
Whether Trump was president at time of post |
is_president_elect |
Whether Trump was president-elect at time of post |
replies_count |
Number of replies |
reblogs_count |
Number of reblogs |
favourites_count |
Number of favourites |
media_urls |
Image/video URLs attached to the post |
has_media |
Whether the post has media attachments |
image_alt_text |
AI-generated factual image description (alt-text format) |
Market data (18 tickers):
Each ticker has 7 columns following the convention {ticker}_{metric}:
| Metric | Description |
|---|---|
{ticker}_open |
Daily open price |
{ticker}_close |
Daily close price |
{ticker}_1hr_before |
Price 1 hour before the post |
{ticker}_5min_before |
Price 5 minutes before the post |
{ticker}_at_post |
Price at time of post |
{ticker}_5min_after |
Price 5 minutes after the post |
{ticker}_1hr_after |
Price 1 hour after the post |
Tickers included:
| Ticker | Name | Category |
|---|---|---|
sp500 |
S&P 500 (^GSPC) | Broad market |
dia |
SPDR Dow Jones Industrial Average ETF | Broad market |
qqq |
Invesco QQQ (Nasdaq-100) | Tech/growth |
djt |
Trump Media & Technology Group | Trump-linked |
lmt |
Lockheed Martin | Defense |
war |
Themes US Military Academy ETF | Defense |
xli |
Industrial Select Sector SPDR | Industrials |
xlv |
Health Care Select Sector SPDR | Healthcare |
xph |
SPDR S&P Pharmaceuticals ETF | Pharma |
cnrg |
SPDR S&P Kensho Clean Power ETF | Clean energy |
gld |
SPDR Gold Shares | Gold/commodities |
uso |
United States Oil Fund | Oil/energy |
fxi |
iShares China Large-Cap ETF | China/trade |
eww |
iShares MSCI Mexico ETF | Mexico/trade |
vgk |
Vanguard FTSE Europe ETF | Europe |
ibit |
iShares Bitcoin ETF | Crypto |
tlt |
iShares 20+ Year Treasury Bond ETF | Bonds/rates |
uup |
Invesco DB US Dollar Index | USD strength |
Intraday prices use the highest available resolution: 1-minute (last ~7 days), 5-minute (last ~60 days), or hourly (last ~2 years). Weekend/holiday posts use the most recent trading day's values. The sp500_resolution column indicates the data resolution used.
API
| Function | Description |
|---|---|
classify() |
Classify text into predefined categories |
prompt_tune() |
Optimize classification prompts via user feedback |
extract() |
Discover and normalize categories from text |
explore() |
Raw category extraction (no deduplication) |
summarize() |
Summarize text, PDFs, or image URLs with format options (paragraph, bullets, one-liner, structured, report, alt-text) |
list_sources() |
List available data sources |
fetch_source() |
Fetch raw data from a source |
All functions accept either input_data= (raw text, files, directories) or source= (pull from HuggingFace). All cat-stack parameters (multi-model ensemble, batch mode, chain-of-thought, etc.) pass through via **kwargs.
Ecosystem
| Package | Role |
|---|---|
| cat-stack | Domain-agnostic LLM classification engine |
| cat-pol | Political text classification (this package) |
| cat-vader | Social media text (Reddit, Twitter/X) |
| cat-ademic | Academic papers and citations |
License
GPL-3.0-or-later
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cat_pol-1.1.0.tar.gz.
File metadata
- Download URL: cat_pol-1.1.0.tar.gz
- Upload date:
- Size: 26.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66f145bbc1429244d03765a55a1f0c83ba5c7a3f34234d97889dab4a98b73e05
|
|
| MD5 |
96ed7a741d7f179013da2b6376206dd2
|
|
| BLAKE2b-256 |
ad6f81add2a60b9a75a0454cc00fc70360c8d7332aba5aca7319657619a8c47a
|
File details
Details for the file cat_pol-1.1.0-py3-none-any.whl.
File metadata
- Download URL: cat_pol-1.1.0-py3-none-any.whl
- Upload date:
- Size: 40.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ce7d22f115701cdf65c97a52891728304e60e56f99a1d8e1038d8daa18ceb94
|
|
| MD5 |
759c2a92303f7235cd207bdc46044040
|
|
| BLAKE2b-256 |
9789c73a9335be2892e18f4723dd5203f1611eec43d16db6dfb806c0ca956fc6
|