Skip to main content

Predict Soft News

Project description

notnews: Modern News Classification Library

CI PyPI Build and Deploy Documentation Downloads

A fast, modern Python library for classifying news articles as hard news vs. soft news using multiple approaches: URL patterns, machine learning models, and Large Language Models.

Features

🚀 Three Classification Methods:

  • URL Pattern Analysis - Lightning-fast classification using URL structure
  • ML Models - Trained scikit-learn models for US/UK news prediction
  • LLM Classification - Flexible categorization using Claude or OpenAI

🌍 Multi-Region Support:

  • US and UK news patterns and models
  • Easily extensible to other regions

Modern Architecture:

  • Unified API with consistent interface
  • Click-based CLI for command-line usage
  • Built with uv_build for 10-35x faster builds
  • Type hints and comprehensive error handling

Streamlit Demo: https://notnews-notnews-streamlitstreamlit-app-u8j3a6.streamlit.app/

Quick Start

Python API

import pandas as pd
import notnews

# Load your data
df = pd.read_csv("news_articles.csv")

# Method 1: URL Pattern Classification (fastest)
df_url = notnews.classify_by_url(df, url_col="url", region="us")
print(df_url[["url", "hard_news", "soft_news"]].head())

# Method 2: ML Model Prediction (most accurate)
df_ml = notnews.predict_soft_news(df, text_col="text", region="us")
print(df_ml[["text", "prob_soft_news_us"]].head())

# Method 3: LLM Classification (most flexible)
# Requires ANTHROPIC_API_KEY or OPENAI_API_KEY environment variable
df_llm = notnews.classify_with_llm(df, text_col="text", provider="claude")
print(df_llm[["text", "llm_category", "llm_confidence"]].head())

# Detailed Categories (US only)
df_categories = notnews.predict_news_category(df, text_col="text")
print(df_categories[["text", "pred_category", "prob_soft_news"]].head())

Command Line Interface

# Install the package
pip install notnews
# or with uv
uv add notnews

# URL pattern classification
notnews classify-urls articles.csv --region us --output results.csv

# ML model prediction  
notnews predict-ml articles.csv --region uk --text-col content

# LLM classification
notnews classify-llm articles.csv --provider claude --api-key your_key

# Run all methods together
notnews classify-all articles.csv --region us

# Get help
notnews --help
notnews classify-urls --help

Installation

Standard Installation

pip install notnews

Fast Installation with UV

uv add notnews

Requirements

  • Python: 3.11, 3.12, or 3.13
  • Core: pandas, numpy, scikit-learn 1.3+, nltk
  • Web: requests, beautifulsoup4
  • CLI: click 8.0+
  • Optional: anthropic, openai (for LLM classification)

LLM Setup

For LLM classification, set your API key:

# For Claude
export ANTHROPIC_API_KEY="your_key_here"

# For OpenAI  
export OPENAI_API_KEY="your_key_here"

API Reference

Core Functions

classify_by_url(df, url_col="url", region="us")

Classify articles using URL pattern matching.

Args:

  • df: DataFrame with articles
  • url_col: Column containing URLs
  • region: "us" or "uk" for region-specific patterns

Returns: DataFrame with hard_news and soft_news columns

predict_soft_news(df, text_col="text", region="us")

Predict soft news probability using trained ML models.

Args:

  • df: DataFrame with articles
  • text_col: Column containing article text
  • region: "us" or "uk" for model selection

Returns: DataFrame with prob_soft_news_{region} column

classify_with_llm(df, text_col="text", provider="claude", **kwargs)

Classify articles using Large Language Models.

Args:

  • df: DataFrame with articles
  • text_col: Column containing article text
  • provider: "claude" or "openai"
  • categories: Optional custom categories dict
  • api_key: Optional API key (uses env var if not provided)

Returns: DataFrame with llm_category, llm_confidence, llm_reasoning columns

Advanced Usage

# Custom LLM categories
custom_categories = {
    "breaking": {"description": "Breaking news and urgent updates"},
    "analysis": {"description": "In-depth analysis and commentary"},
    "lifestyle": {"description": "Lifestyle and entertainment content"}
}

df_custom = notnews.classify_with_llm(
    df, 
    provider="claude",
    categories=custom_categories
)

# Fetch content from URLs
content = notnews.fetch_web_content("https://example.com/article")

Model Information

URL Patterns

  • US: Politics, economics, international affairs vs. sports, entertainment, lifestyle
  • UK: Includes UK-specific patterns like "uk-news", "scottish-news"

ML Models

  • US: NYT-based models trained on headline and content text
  • UK: URL-based model trained on UK news outlets
  • Compatible with scikit-learn 1.3-1.5 (models trained on 0.22+)

Performance

  • URL Classification: ~1000 articles/second
  • ML Prediction: ~100 articles/second
  • LLM Classification: ~1-10 articles/second (API dependent)

Data Sources

Applications

Research using notnews:

Documentation

Full documentation: notnews.readthedocs.io

Contributing

We welcome contributions! Please see our Contributor Code of Conduct.

Development Setup

git clone https://github.com/notnews/notnews.git
cd notnews
uv sync --dev
uv run pytest

Authors

  • Suriyan Laohaprapanon
  • Gaurav Sood

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

notnews-0.3.0.tar.gz (34.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

notnews-0.3.0-py3-none-any.whl (35.0 kB view details)

Uploaded Python 3

File details

Details for the file notnews-0.3.0.tar.gz.

File metadata

  • Download URL: notnews-0.3.0.tar.gz
  • Upload date:
  • Size: 34.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for notnews-0.3.0.tar.gz
Algorithm Hash digest
SHA256 8a0d1993034fc4f0f37e0d194d5c408d4af65fbc04f300c5907a932dd9b87178
MD5 28859636838347a0b1e384bbd8fe6526
BLAKE2b-256 9468663d8b5a9a8fad4814333e87b40843bc08a269917a0baac38d920dfe12e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for notnews-0.3.0.tar.gz:

Publisher: python-publish.yml on notnews/notnews

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file notnews-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: notnews-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 35.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for notnews-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3fbbe3804d41eaacce167b8fa14034bb0bdb05f32664bfbeb897a78746c8f60c
MD5 c516977c896802ed2c82845c4ad40093
BLAKE2b-256 3a9d5ff126f79a34176b9ab702db879022a7c92d772e7aed8b8018fbc6e0f276

See more details on using hashes here.

Provenance

The following attestation bundles were made for notnews-0.3.0-py3-none-any.whl:

Publisher: python-publish.yml on notnews/notnews

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page