Predict Soft News
Project description
notnews: predict soft news using story text and the url structure
The package provides classifiers for soft news based on story text and URL structure for both US and UK news media. We also provide ways to infer the 'kind' of news---Arts, Books, Science, Sports, Travel, etc.---for US news media.
Modern Features:
- Traditional ML classifiers - Fast, offline classification using trained models
- LLM-based classification - Flexible classification using Claude and OpenAI with custom categories
- Web content fetching - Automatically fetch and classify content from URLs
Streamlit App: https://notnews-notnews-streamlitstreamlit-app-u8j3a6.streamlit.app/
Quick Start
>>> import pandas as pd
>>> from notnews import *
>>> # Get help
>>> help(soft_news_url_cat_us)
Help on method soft_news_url_cat in module notnews.soft_news_url_cat:
soft_news_url_cat(df, col='url') method of builtins.type instance
Soft News Categorize by URL pattern.
Using the URL pattern to categorize the soft/hard news of the input
DataFrame.
Args:
df (:obj:`DataFrame`): Pandas DataFrame containing the URL
column.
col (str or int): Column's name or location of the URL in
DataFrame (default: url).
Returns:
DataFrame: Pandas DataFrame with additional columns:
- `soft_lab` set to 1 if URL match with soft news URL pattern.
- `hard_lab` set to 1 if URL match with hard news URL pattern.
>>> # Load data
>>> df = pd.read_csv('./tests/sample_us.csv')
>>> df
src url text
0 nyt http://www.nytimes.com/2017/02/11/us/politics/... Mr. Kushner on something of a crash course in ...
1 huffingtonpost http://grvrdr.huffingtonpost.com/302/redirect?... Authorities are still searching for a man susp...
2 nyt http://www.nytimes.com/2016/09/19/us/politics/... Photo WASHINGTON — In releasing a far more so...
3 google http://www.foxnews.com/world/2016/07/17/turkey... The Turkish government on Sunday ratcheted up ...
4 nyt http://www.nytimes.com/interactive/2016/08/29/... NYTimes.com no longer supports Internet Explor...
5 yahoo https://www.yahoo.com/news/pittsburgh-symphony... PITTSBURGH AP — Pittsburgh Symphony Orchestra ...
6 foxnews http://www.foxnews.com/politics/2016/08/13/cli... Hillary Clintons campaign is questioning a rep...
7 foxnews http://www.foxnews.com/us/2017/04/15/april-gir... April the giraffe has given birth at a New Yor...
8 foxnews http://www.foxnews.com/politics/2017/05/03/hil... Want FOX News Halftime Report in your inbox ev...
9 nyt http://www.nytimes.com/2016/09/06/obituaries/p... Shes an extremely liberated woman Ms. DeCrow s...
>>>
>>> # Get the Soft News URL category
>>> df_soft_news_url_cat_us = soft_news_url_cat_us(df, col='url')
>>> df_soft_news_url_cat_us
src url text soft_lab hard_lab
0 nyt http://www.nytimes.com/2017/02/11/us/politics/... Mr. Kushner on something of a crash course in ... NaN 1.0
1 huffingtonpost http://grvrdr.huffingtonpost.com/302/redirect?... Authorities are still searching for a man susp... NaN NaN
2 nyt http://www.nytimes.com/2016/09/19/us/politics/... Photo WASHINGTON — In releasing a far more so... NaN 1.0
3 google http://www.foxnews.com/world/2016/07/17/turkey... The Turkish government on Sunday ratcheted up ... NaN 1.0
4 nyt http://www.nytimes.com/interactive/2016/08/29/... NYTimes.com no longer supports Internet Explor... NaN 1.0
5 yahoo https://www.yahoo.com/news/pittsburgh-symphony... PITTSBURGH AP — Pittsburgh Symphony Orchestra ... 1.0 NaN
6 foxnews http://www.foxnews.com/politics/2016/08/13/cli... Hillary Clintons campaign is questioning a rep... NaN 1.0
7 foxnews http://www.foxnews.com/us/2017/04/15/april-gir... April the giraffe has given birth at a New Yor... NaN NaN
8 foxnews http://www.foxnews.com/politics/2017/05/03/hil... Want FOX News Halftime Report in your inbox ev... NaN 1.0
9 nyt http://www.nytimes.com/2016/09/06/obituaries/p... Shes an extremely liberated woman Ms. DeCrow s... NaN NaN
>>>
Installation
Installation is as easy as typing in:
pip install notnews
For faster installation using UV:
uv add notnews
Requirements
- Python 3.11, 3.12, or 3.13
- scikit-learn 1.3+ (models trained with sklearn 0.22+ are automatically compatible)
- pandas, numpy, nltk, and other standard scientific Python packages
Compatibility
This package includes automatic compatibility layers to ensure models trained with older scikit-learn versions (0.22+) work seamlessly with modern scikit-learn versions (1.3-1.5). Version warnings from scikit-learn are expected and harmless.
API
For detailed API documentation including all 6 functions (soft_news_url_cat_us, pred_soft_news_us, pred_what_news_us, soft_news_url_cat_uk, pred_soft_news_uk, llm_classify_news), command line usage, and examples, please see project documentation.
Underlying Data
- For more information about how to get the underlying data for UK model, see here. For information about the data underlying the US model, see here
Applications
We use the model to estimate the supply of not news in the US and the UK.
Documentation
For more information, please see project documentation.
Authors
Suriyan Laohaprapanon and Gaurav Sood
Contributor Code of Conduct
The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct
License
The package is released under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file notnews-0.2.5.tar.gz.
File metadata
- Download URL: notnews-0.2.5.tar.gz
- Upload date:
- Size: 45.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b7a93293cbee0b33bedd99e2095e9e2174009746324340ef085468817dd5a1f
|
|
| MD5 |
89e2f926392e6f4ad7c902ff6be43227
|
|
| BLAKE2b-256 |
0fa129a1fb7863dd08994f0beb83c508554ac5c34502d412d1629325c2d305fb
|
Provenance
The following attestation bundles were made for notnews-0.2.5.tar.gz:
Publisher:
python-publish.yml on notnews/notnews
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
notnews-0.2.5.tar.gz -
Subject digest:
9b7a93293cbee0b33bedd99e2095e9e2174009746324340ef085468817dd5a1f - Sigstore transparency entry: 737392884
- Sigstore integration time:
-
Permalink:
notnews/notnews@4377eb6a532be5d09940114ade759933f12def93 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/notnews
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@4377eb6a532be5d09940114ade759933f12def93 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file notnews-0.2.5-py3-none-any.whl.
File metadata
- Download URL: notnews-0.2.5-py3-none-any.whl
- Upload date:
- Size: 45.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8da75c73614a1639804397cb5e13e3ecfd4d1562184432a0fecb4e4f6c7d3261
|
|
| MD5 |
bb3225cf02540247149d4e8208620831
|
|
| BLAKE2b-256 |
fdbd67fcd41edd4e974660708ad5de5f05ef27eff05899b717a3d9048d8eaa29
|
Provenance
The following attestation bundles were made for notnews-0.2.5-py3-none-any.whl:
Publisher:
python-publish.yml on notnews/notnews
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
notnews-0.2.5-py3-none-any.whl -
Subject digest:
8da75c73614a1639804397cb5e13e3ecfd4d1562184432a0fecb4e4f6c7d3261 - Sigstore transparency entry: 737392888
- Sigstore integration time:
-
Permalink:
notnews/notnews@4377eb6a532be5d09940114ade759933f12def93 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/notnews
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@4377eb6a532be5d09940114ade759933f12def93 -
Trigger Event:
workflow_dispatch
-
Statement type: