Skip to main content

Multi-source financial news library with quality scoring and ticker extraction

Project description

newsquant

The financial news data library for Python.

pip install newsquant

PyPI Python License: MIT


Fetch, extract, and score financial news articles from 13 sources in one call. Returns clean Article objects with full text, extracted tickers, and a quality score — no database required.

What gap does this fill? Existing options are either single-source (newsapi-python, feedparser), return no article body (yfinance), or have no quality layer at all. newsquant covers multiple sources, runs full-text extraction via trafilatura, deduplicates across sources, and scores every article on a 0–1 quality scale.


Install

pip install newsquant

Python 3.9+ required. For PostgreSQL persistence add the optional extra:

pip install "newsquant[postgres]"

Quickstart

from newsquant import Scraper

scraper = Scraper(sources=["yahoofinance", "cnbc"])
articles = scraper.fetch(days_back=1)

for a in articles:
    print(a.title, a.tickers, a.quality_score)

Examples

Filter by ticker

scraper = Scraper(sources=["yahoofinance", "benzinga", "cnbc"])
articles = scraper.fetch(tickers=["AAPL", "MSFT"], days_back=7)

for a in articles:
    print(f"[{', '.join(a.tickers)}] {a.title}")
    print(f"  source={a.source_name}  quality={a.quality_score:.2f}  words={a.word_count}")

Use API sources

scraper = Scraper(
    sources=["newsapi", "finnhub"],
    newsapi_key="YOUR_KEY",
    finnhub_api_key="YOUR_KEY",
)
articles = scraper.fetch(tickers=["NVDA"], days_back=3, min_quality=0.8)

API keys can also be set via environment variables — see Configuration.

Persist to a database

# SQLite
articles = scraper.fetch(save_to="sqlite:///./financial_news.db")

# PostgreSQL
articles = scraper.fetch(save_to="postgresql://user:pw@localhost/mydb")

Tables are created automatically if they don't exist. fetch() always returns the Article list regardless.

Custom source

Subclass BaseFetcher to add any source the pipeline doesn't cover.

from newsquant import Scraper, BaseFetcher, SourceConfig
from scraper.models.article import RawArticle
from datetime import datetime, timezone

class MySource(BaseFetcher):
    def __init__(self, api_key: str):
        super().__init__(SourceConfig(name="my_source", type="api", rate_limit_rps=2.0))
        self._api_key = api_key

    def fetch(self, from_dt=None, to_dt=None, ticker=None, **kwargs) -> list[RawArticle]:
        resp = self._get(
            "https://api.example.com/news",
            params={"key": self._api_key, "symbol": ticker},
        )
        return [
            RawArticle(
                url=item["url"],
                title=item["title"],
                published_at=datetime.fromisoformat(item["published_at"]),
                source_name=self.config.name,
                summary=item.get("summary"),
            )
            for item in resp.json()["articles"]
        ]

# Mix custom and built-in sources freely
scraper = Scraper(sources=[MySource(api_key="secret"), "cnbc", "yahoofinance"])
articles = scraper.fetch(tickers=["AAPL"])

BaseFetcher provides _get() and _post() with automatic rate limiting and exponential-backoff retries. The rest of the pipeline (extraction, dedup, quality scoring, ticker extraction) runs automatically.


Built-in sources

Name Type Full text Notes
yahoofinance RSS
cnbc RSS
motleyfool RSS
benzinga RSS
businessinsider RSS
fortune RSS
prnewswire RSS Financial press releases only
bloomberg RSS Title + summary (paywalled)
wsj RSS Title + summary (paywalled)
ft RSS Title + summary (paywalled)
seekingalpha RSS Title + summary (paywalled)
newsapi API Key required
finnhub API Key required

Scraper() with no sources argument defaults to all RSS sources.


The Article object

Every item returned by fetch() is a Pydantic model with these fields:

Field Type Description
title str Article headline
body str Full extracted text
summary str Lead paragraph or RSS summary
url str Canonical URL
source_name str Source identifier (e.g. "cnbc_rss")
author str | None Byline if available
published_at datetime Publication time (UTC)
tickers list[str] Extracted ticker symbols
quality_score float 0–1 composite quality score
quality_flags list[str] Flags that reduced the score
word_count int Body word count
language str Detected language code
is_paywall bool Paywall detected
is_duplicate bool Exact duplicate (URL or body hash)
is_near_duplicate bool Near-duplicate (SimHash)
is_metadata_only bool Full-text extraction skipped

Configuration

Set these in a .env file or as environment variables. Only the API keys for sources you actually use are required.

# Required for Scraper(sources=["newsapi"])
NEWSAPI_KEY=

# Required for Scraper(sources=["finnhub"])
FINNHUB_API_KEY=

# Optional — defaults shown
DATABASE_URL=sqlite:///./financial_news.db
MIN_WORD_COUNT=150
LANGUAGE_CONFIDENCE_THRESHOLD=0.95
REQUEST_TIMEOUT_SECONDS=30
MAX_RETRIES=3
LOG_LEVEL=INFO

Copy .env.example to get started:

cp .env.example .env

CLI

A command-line interface ships alongside the Python API for ops tasks:

# One-time setup
scraper db init

# Run sources
scraper scrape --all
scraper scrape --source cnbc_rss

# Historical backfill (GDELT and Wayback Machine)
scraper backfill --source gdelt --start 2020-01-01 --end 2025-01-01 --workers 4

# Query stored articles
scraper query --ticker AAPL --min-quality 0.8 --format csv

# Real-time daemon
scraper scheduler start --daemon

Contributing

Bug reports and pull requests are welcome. For major changes, open an issue first to discuss what you'd like to change.

git clone https://github.com/your-username/newsquant
cd newsquant
pip install -e ".[dev]"
pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newsquant-0.1.1.tar.gz (40.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

newsquant-0.1.1-py3-none-any.whl (49.1 kB view details)

Uploaded Python 3

File details

Details for the file newsquant-0.1.1.tar.gz.

File metadata

  • Download URL: newsquant-0.1.1.tar.gz
  • Upload date:
  • Size: 40.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for newsquant-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3c0ffafa17a0af9d676020e84874c0f56bd9ef71101b0263e7d60ab474a79ca8
MD5 9414d103ad48840362b799b515b11667
BLAKE2b-256 bb9c31c7a25ebdff0642e09ebb91d97438949f5e5394dc4f0b5068475a8e2e88

See more details on using hashes here.

File details

Details for the file newsquant-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: newsquant-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 49.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for newsquant-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9834b80d06789eb60073991a8e37ae5c797f003545d63d5f093a62f817aa179d
MD5 6437964b006aba5dbc8c11bf6def79c6
BLAKE2b-256 1e4cdcf95d6588042a53d3ad5c62fa3135265e04e72d20f53faf14f37be142dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page