Skip to main content

Policy document classification powered by LLMs

Project description

cat-pol

Political text classification and analysis powered by LLMs. A policy-specific wrapper around cat-stack with built-in access to 15 political data sources on HuggingFace.

Installation

pip install cat-pol

With optional extras:

pip install "cat-pol[pdf]"         # PDF document processing
pip install "cat-pol[embeddings]"  # Embedding-based similarity scoring
pip install "cat-pol[sources]"     # Data source loading (datasets, huggingface_hub)

Quick Start

Classify ordinances from a built-in source

import catpol as pol

results = pol.classify(
    source="city_san_diego",
    categories=["Housing", "Public Safety", "Infrastructure", "Finance"],
    doc_type="ordinance",
    since="2022-01-01",
    n=50,
    api_key="sk-...",
)

Classify raw text

results = pol.classify(
    input_data=[
        "The committee voted to approve the rezoning request for parcel 42.",
        "Motion to table the budget amendment until the next session.",
    ],
    categories=["Approval", "Rejection", "Deferral", "Amendment"],
    document_context="City council meeting minutes",
    api_key="sk-...",
)

Optimize prompts with user feedback

result = pol.prompt_tune(
    source="city_san_diego",
    categories=["Pro-Business", "Pro-Regulation", "Tax Increase", "Tax Decrease"],
    doc_type="ordinance",
    since="2020-01-01",
    n=100,
    api_key="sk-...",
    sample_size=15,
)

# Use the optimized prompt for full classification
results = pol.classify(
    source="city_san_diego",
    categories=["Pro-Business", "Pro-Regulation", "Tax Increase", "Tax Decrease"],
    system_prompt=result["system_prompt"],
    api_key="sk-...",
)

Summarize with different formats

# Bullet points
pol.summarize(source="federal_executive_orders", n=10, format="bullets", api_key="sk-...")

# Full report
pol.summarize(source="federal_laws", n=5, format="report", api_key="sk-...")

# One-liner
pol.summarize(source="social_trump_truth", since="2024-01-01", n=20, format="one-liner", api_key="sk-...")

Discover categories

result = pol.extract(
    source="city_berkeley",
    n=200,
    api_key="sk-...",
)
print(result["top_categories"])

Fetch raw data

# List all sources
pol.list_sources()
pol.list_sources(level="city")
pol.list_sources(level="federal")

# Fetch data
df = pol.fetch_source("city_san_diego", n=100, since="2020-01-01", doc_type="ordinance")
df = pol.fetch_source("federal_executive_orders", n=50)
df = pol.fetch_source("social_trump_truth", since="2024-01-01")

Data Sources

All datasets are public on HuggingFace — no authentication required.

California Cities

Source Rows Types Repo
city_san_diego 87,983 ordinances, resolutions chrissoria/san-diego-ordinances
city_los_angeles 34,427 ordinances chrissoria/la-ordinances
city_berkeley 9,028 ordinances chrissoria/berkeley-ordinances
city_san_francisco 4,033 ordinances chrissoria/sf-ordinances
city_long_beach 3,898 ordinances, resolutions chrissoria/long-beach-ordinances
city_bakersfield 2,655 ordinances chrissoria/bakersfield-ordinances
city_newport_beach 2,719 ordinances chrissoria/newport-beach-ordinances
city_salinas 2,574 ordinances, resolutions chrissoria/salinas-ordinances
city_clovis 2,343 ordinances chrissoria/clovis-ordinances
city_oakland 1,824 ordinances chrissoria/oakland-ordinances
city_fresno 706 ordinances, resolutions chrissoria/fresno-ordinances

Federal

Source Rows Types Repo
federal_laws 5,915 public laws (1995–present) chrissoria/federal-public-laws
federal_executive_orders 1,530+ executive orders chrissoria/executive-orders
federal_speeches 305 SOTU, inaugurals chrissoria/presidential-speeches

Social Media

Source Rows Types Repo
social_trump_truth 32,000+ Truth Social posts chrissoria/trump-truth-social

All sources are updated weekly (Sundays at 9 AM) via automated scrapers. Truth Social is updated daily at 9 AM.

Trump Truth Social Dataset Columns

The social_trump_truth dataset is enriched with metadata, market data, and image descriptions:

Post metadata:

Column Description
date Post date (YYYY-MM-DD)
time Post time in UTC (HH:MM:SS)
day_of_week Day name (Monday, Tuesday, etc.)
datetime Full ISO timestamp
text Post text content
url Truth Social post URL
post_id Unique post identifier
is_president Whether Trump was president at time of post
is_president_elect Whether Trump was president-elect at time of post
replies_count Number of replies
reblogs_count Number of reblogs
favourites_count Number of favourites
media_urls Image/video URLs attached to the post
has_media Whether the post has media attachments
image_alt_text AI-generated factual image description (alt-text format)

Market data (18 tickers):

Each ticker has 7 columns following the convention {ticker}_{metric}:

Metric Description
{ticker}_open Daily open price
{ticker}_close Daily close price
{ticker}_1hr_before Price 1 hour before the post
{ticker}_5min_before Price 5 minutes before the post
{ticker}_at_post Price at time of post
{ticker}_5min_after Price 5 minutes after the post
{ticker}_1hr_after Price 1 hour after the post

Tickers included:

Ticker Name Category
sp500 S&P 500 (^GSPC) Broad market
dia SPDR Dow Jones Industrial Average ETF Broad market
qqq Invesco QQQ (Nasdaq-100) Tech/growth
djt Trump Media & Technology Group Trump-linked
lmt Lockheed Martin Defense
war Themes US Military Academy ETF Defense
xli Industrial Select Sector SPDR Industrials
xlv Health Care Select Sector SPDR Healthcare
xph SPDR S&P Pharmaceuticals ETF Pharma
cnrg SPDR S&P Kensho Clean Power ETF Clean energy
gld SPDR Gold Shares Gold/commodities
uso United States Oil Fund Oil/energy
fxi iShares China Large-Cap ETF China/trade
eww iShares MSCI Mexico ETF Mexico/trade
vgk Vanguard FTSE Europe ETF Europe
ibit iShares Bitcoin ETF Crypto
tlt iShares 20+ Year Treasury Bond ETF Bonds/rates
uup Invesco DB US Dollar Index USD strength

Intraday prices use the highest available resolution: 1-minute (last ~7 days), 5-minute (last ~60 days), or hourly (last ~2 years). Weekend/holiday posts use the most recent trading day's values. The sp500_resolution column indicates the data resolution used.

API

Function Description
classify() Classify text into predefined categories
prompt_tune() Optimize classification prompts via user feedback
extract() Discover and normalize categories from text
explore() Raw category extraction (no deduplication)
summarize() Summarize text, PDFs, or image URLs with format options (paragraph, bullets, one-liner, structured, report, alt-text)
list_sources() List available data sources
fetch_source() Fetch raw data from a source

All functions accept either input_data= (raw text, files, directories) or source= (pull from HuggingFace). All cat-stack parameters (multi-model ensemble, batch mode, chain-of-thought, etc.) pass through via **kwargs.

Ecosystem

Package Role
cat-stack Domain-agnostic LLM classification engine
cat-pol Political text classification (this package)
cat-vader Social media text (Reddit, Twitter/X)
cat-ademic Academic papers and citations

License

GPL-3.0-or-later

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_pol-1.3.1.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cat_pol-1.3.1-py3-none-any.whl (42.8 kB view details)

Uploaded Python 3

File details

Details for the file cat_pol-1.3.1.tar.gz.

File metadata

  • Download URL: cat_pol-1.3.1.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_pol-1.3.1.tar.gz
Algorithm Hash digest
SHA256 506e76f831376db7080b442b31d973db025d29403707b101564cf6f25a067121
MD5 72842be85792b8939ba26f38e2fa6735
BLAKE2b-256 3f143b9b86ad43c8e18d1d95ce642ee04243357d33da21b1dbc2cbea8f240f76

See more details on using hashes here.

File details

Details for the file cat_pol-1.3.1-py3-none-any.whl.

File metadata

  • Download URL: cat_pol-1.3.1-py3-none-any.whl
  • Upload date:
  • Size: 42.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_pol-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6c441ac841800cc36715d7278d54a1cc8c707112d583ac26fcc767bfd33131b5
MD5 99b65dc77fb0be7769f2fcf66649b8cb
BLAKE2b-256 17d8673f313e8f12efab82150326187fad9a2352f542a2b29336d3b3e313df19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page