Skip to main content

Predict Soft News

Project description

notnews: predict soft news using story text and the url structure

CI PyPI Build and Deploy Documentation Downloads

The package provides classifiers for soft news based on story text and URL structure for both US and UK news media. We also provide ways to infer the 'kind' of news---Arts, Books, Science, Sports, Travel, etc.---for US news media.

Modern Features:

  • Traditional ML classifiers - Fast, offline classification using trained models
  • LLM-based classification - Flexible classification using Claude and OpenAI with custom categories
  • Web content fetching - Automatically fetch and classify content from URLs

Streamlit App: https://notnews-notnews-streamlitstreamlit-app-u8j3a6.streamlit.app/

Quick Start

>>> import pandas as pd
>>> from notnews import *

>>> # Get help
>>> help(soft_news_url_cat_us)

Help on method soft_news_url_cat in module notnews.soft_news_url_cat:

soft_news_url_cat(df, col='url') method of builtins.type instance
    Soft News Categorize by URL pattern.

    Using the URL pattern to categorize the soft/hard news of the input
    DataFrame.

    Args:
        df (:obj:`DataFrame`): Pandas DataFrame containing the URL
            column.
        col (str or int): Column's name or location of the URL in
            DataFrame (default: url).

    Returns:
        DataFrame: Pandas DataFrame with additional columns:
            - `soft_lab` set to 1 if URL match with soft news URL pattern.
            - `hard_lab` set to 1 if URL match with hard news URL pattern.

>>> # Load data
>>> df = pd.read_csv('./tests/sample_us.csv')
>>> df
            src                                                url                                               text
0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...
1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...
2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON  In releasing a far more so...
3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...
4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...
5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP  Pittsburgh Symphony Orchestra ...
6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...
7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...
8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...
9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...
>>>
>>> # Get the Soft News URL category
>>> df_soft_news_url_cat_us  = soft_news_url_cat_us(df, col='url')
>>> df_soft_news_url_cat_us
            src                                                url                                               text  soft_lab  hard_lab
0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...       NaN       1.0
1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...       NaN       NaN
2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON  In releasing a far more so...       NaN       1.0
3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...       NaN       1.0
4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...       NaN       1.0
5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP  Pittsburgh Symphony Orchestra ...       1.0       NaN
6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...       NaN       1.0
7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...       NaN       NaN
8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...       NaN       1.0
9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...       NaN       NaN
>>>

Installation

Installation is as easy as typing in:

pip install notnews

For faster installation using UV:

uv add notnews

Requirements

  • Python 3.11, 3.12, or 3.13
  • scikit-learn 1.3+ (models trained with sklearn 0.22+ are automatically compatible)
  • pandas, numpy, nltk, and other standard scientific Python packages

Compatibility

This package includes automatic compatibility layers to ensure models trained with older scikit-learn versions (0.22+) work seamlessly with modern scikit-learn versions (1.3-1.5). Version warnings from scikit-learn are expected and harmless.

API

For detailed API documentation including all 6 functions (soft_news_url_cat_us, pred_soft_news_us, pred_what_news_us, soft_news_url_cat_uk, pred_soft_news_uk, llm_classify_news), command line usage, and examples, please see project documentation.

Underlying Data

  • For more information about how to get the underlying data for UK model, see here. For information about the data underlying the US model, see here

Applications

We use the model to estimate the supply of not news in the US and the UK.

Documentation

For more information, please see project documentation.

Authors

Suriyan Laohaprapanon and Gaurav Sood

Contributor Code of Conduct

The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct

License

The package is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

notnews-0.2.5.tar.gz (45.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

notnews-0.2.5-py3-none-any.whl (45.0 MB view details)

Uploaded Python 3

File details

Details for the file notnews-0.2.5.tar.gz.

File metadata

  • Download URL: notnews-0.2.5.tar.gz
  • Upload date:
  • Size: 45.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for notnews-0.2.5.tar.gz
Algorithm Hash digest
SHA256 9b7a93293cbee0b33bedd99e2095e9e2174009746324340ef085468817dd5a1f
MD5 89e2f926392e6f4ad7c902ff6be43227
BLAKE2b-256 0fa129a1fb7863dd08994f0beb83c508554ac5c34502d412d1629325c2d305fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for notnews-0.2.5.tar.gz:

Publisher: python-publish.yml on notnews/notnews

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file notnews-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: notnews-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 45.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for notnews-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8da75c73614a1639804397cb5e13e3ecfd4d1562184432a0fecb4e4f6c7d3261
MD5 bb3225cf02540247149d4e8208620831
BLAKE2b-256 fdbd67fcd41edd4e974660708ad5de5f05ef27eff05899b717a3d9048d8eaa29

See more details on using hashes here.

Provenance

The following attestation bundles were made for notnews-0.2.5-py3-none-any.whl:

Publisher: python-publish.yml on notnews/notnews

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page