Predict Soft News
Project description
notnews: Modern News Classification Library
A fast, modern Python library for classifying news articles as hard news vs. soft news using multiple approaches: URL patterns, machine learning models, and Large Language Models.
Features
🚀 Three Classification Methods:
- URL Pattern Analysis - Lightning-fast classification using URL structure
- ML Models - Trained scikit-learn models for US/UK news prediction
- LLM Classification - Flexible categorization using Claude or OpenAI
🌍 Multi-Region Support:
- US and UK news patterns and models
- Easily extensible to other regions
⚡ Modern Architecture:
- Unified API with consistent interface
- Click-based CLI for command-line usage
- Built with uv_build for 10-35x faster builds
- Type hints and comprehensive error handling
Streamlit Demo: https://notnews-notnews-streamlitstreamlit-app-u8j3a6.streamlit.app/
Quick Start
Python API
import pandas as pd
import notnews
# Load your data
df = pd.read_csv("news_articles.csv")
# Method 1: URL Pattern Classification (fastest)
df_url = notnews.classify_by_url(df, url_col="url", region="us")
print(df_url[["url", "hard_news", "soft_news"]].head())
# Method 2: ML Model Prediction (most accurate)
df_ml = notnews.predict_soft_news(df, text_col="text", region="us")
print(df_ml[["text", "prob_soft_news_us"]].head())
# Method 3: LLM Classification (most flexible)
# Requires ANTHROPIC_API_KEY or OPENAI_API_KEY environment variable
df_llm = notnews.classify_with_llm(df, text_col="text", provider="claude")
print(df_llm[["text", "llm_category", "llm_confidence"]].head())
# Detailed Categories (US only)
df_categories = notnews.predict_news_category(df, text_col="text")
print(df_categories[["text", "pred_category", "prob_soft_news"]].head())
Command Line Interface
# Install the package
pip install notnews
# or with uv
uv add notnews
# URL pattern classification
notnews classify-urls articles.csv --region us --output results.csv
# ML model prediction
notnews predict-ml articles.csv --region uk --text-col content
# LLM classification
notnews classify-llm articles.csv --provider claude --api-key your_key
# Run all methods together
notnews classify-all articles.csv --region us
# Get help
notnews --help
notnews classify-urls --help
Installation
Standard Installation
pip install notnews
Fast Installation with UV
uv add notnews
Requirements
- Python: 3.11, 3.12, or 3.13
- Core: pandas, numpy, scikit-learn 1.3+, nltk
- Web: requests, beautifulsoup4
- CLI: click 8.0+
- Optional: anthropic, openai (for LLM classification)
LLM Setup
For LLM classification, set your API key:
# For Claude
export ANTHROPIC_API_KEY="your_key_here"
# For OpenAI
export OPENAI_API_KEY="your_key_here"
API Reference
Core Functions
classify_by_url(df, url_col="url", region="us")
Classify articles using URL pattern matching.
Args:
df: DataFrame with articlesurl_col: Column containing URLsregion: "us" or "uk" for region-specific patterns
Returns: DataFrame with hard_news and soft_news columns
predict_soft_news(df, text_col="text", region="us")
Predict soft news probability using trained ML models.
Args:
df: DataFrame with articlestext_col: Column containing article textregion: "us" or "uk" for model selection
Returns: DataFrame with prob_soft_news_{region} column
classify_with_llm(df, text_col="text", provider="claude", **kwargs)
Classify articles using Large Language Models.
Args:
df: DataFrame with articlestext_col: Column containing article textprovider: "claude" or "openai"categories: Optional custom categories dictapi_key: Optional API key (uses env var if not provided)
Returns: DataFrame with llm_category, llm_confidence, llm_reasoning columns
Advanced Usage
# Custom LLM categories
custom_categories = {
"breaking": {"description": "Breaking news and urgent updates"},
"analysis": {"description": "In-depth analysis and commentary"},
"lifestyle": {"description": "Lifestyle and entertainment content"}
}
df_custom = notnews.classify_with_llm(
df,
provider="claude",
categories=custom_categories
)
# Fetch content from URLs
content = notnews.fetch_web_content("https://example.com/article")
Model Information
URL Patterns
- US: Politics, economics, international affairs vs. sports, entertainment, lifestyle
- UK: Includes UK-specific patterns like "uk-news", "scottish-news"
ML Models
- US: NYT-based models trained on headline and content text
- UK: URL-based model trained on UK news outlets
- Compatible with scikit-learn 1.3-1.5 (models trained on 0.22+)
Performance
- URL Classification: ~1000 articles/second
- ML Prediction: ~100 articles/second
- LLM Classification: ~1-10 articles/second (API dependent)
Data Sources
- US Model: Based on NYT data
- UK Model: Based on UK news analysis
Applications
Research using notnews:
Documentation
Full documentation: notnews.readthedocs.io
Contributing
We welcome contributions! Please see our Contributor Code of Conduct.
Development Setup
git clone https://github.com/notnews/notnews.git
cd notnews
uv sync --dev
uv run pytest
Authors
- Suriyan Laohaprapanon
- Gaurav Sood
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file notnews-0.3.0.tar.gz.
File metadata
- Download URL: notnews-0.3.0.tar.gz
- Upload date:
- Size: 34.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a0d1993034fc4f0f37e0d194d5c408d4af65fbc04f300c5907a932dd9b87178
|
|
| MD5 |
28859636838347a0b1e384bbd8fe6526
|
|
| BLAKE2b-256 |
9468663d8b5a9a8fad4814333e87b40843bc08a269917a0baac38d920dfe12e6
|
Provenance
The following attestation bundles were made for notnews-0.3.0.tar.gz:
Publisher:
python-publish.yml on notnews/notnews
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
notnews-0.3.0.tar.gz -
Subject digest:
8a0d1993034fc4f0f37e0d194d5c408d4af65fbc04f300c5907a932dd9b87178 - Sigstore transparency entry: 739904531
- Sigstore integration time:
-
Permalink:
notnews/notnews@1a91be9067a5deb641ad3f78d8aaaad85c10d8ec -
Branch / Tag:
refs/heads/master - Owner: https://github.com/notnews
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@1a91be9067a5deb641ad3f78d8aaaad85c10d8ec -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file notnews-0.3.0-py3-none-any.whl.
File metadata
- Download URL: notnews-0.3.0-py3-none-any.whl
- Upload date:
- Size: 35.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fbbe3804d41eaacce167b8fa14034bb0bdb05f32664bfbeb897a78746c8f60c
|
|
| MD5 |
c516977c896802ed2c82845c4ad40093
|
|
| BLAKE2b-256 |
3a9d5ff126f79a34176b9ab702db879022a7c92d772e7aed8b8018fbc6e0f276
|
Provenance
The following attestation bundles were made for notnews-0.3.0-py3-none-any.whl:
Publisher:
python-publish.yml on notnews/notnews
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
notnews-0.3.0-py3-none-any.whl -
Subject digest:
3fbbe3804d41eaacce167b8fa14034bb0bdb05f32664bfbeb897a78746c8f60c - Sigstore transparency entry: 739904535
- Sigstore integration time:
-
Permalink:
notnews/notnews@1a91be9067a5deb641ad3f78d8aaaad85c10d8ec -
Branch / Tag:
refs/heads/master - Owner: https://github.com/notnews
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@1a91be9067a5deb641ad3f78d8aaaad85c10d8ec -
Trigger Event:
workflow_dispatch
-
Statement type: