Skip to main content

A polite, RSS-first Economic Times news and sentiment collector for market research.

Project description

et-scraper-safe

PyPI version Python versions License: MIT

A polite, RSS-first Python library for collecting public Economic Times news, deduplicating it, scoring sentiment + market impact, mapping headlines to NSE stock symbols, and persisting everything to SQLite + CSV — built for swing-trading and market research pipelines.

PyPI: https://pypi.org/project/et-scraper-safe/


Why "safe"?

This library is designed to be a good citizen of the web:

  • ✅ Respects robots.txt before fetching any HTML page
  • ✅ Prefers RSS feeds over HTML scraping
  • ✅ Configurable delay between HTML requests
  • ✅ Shared requests.Session with retry, exponential backoff, and 429/5xx handling
  • ✅ Identifiable User-Agent
  • ❌ Does not bypass logins, paywalls, captchas, Cloudflare, or rate limits

If robots.txt disallows a URL, the request is skipped — full stop.


What's new in 0.2.0

  • 🔁 Reliable HTTPrequests.Session with retry, exponential backoff, 429/5xx handling
  • 🗃️ SQLite storenews_raw, news_clean, stock_news_map, sentiment_scores, scrape_logs
  • 📈 Stock mapping — built-in NSE large-cap list, extensible via DataFrame
  • 🧠 Structured sentiment{sentiment, impact, confidence, reason, score} payload
  • 🪪 URL hashing + dedup — every row keyed by SHA-1 of its link
  • 🕒 Published date parsing — UTC ISO published_at column
  • 📜 Logging — proper logging setup instead of print
  • 🧾 Run logs — every CLI run records stats and status in scrape_logs

Install

pip install --upgrade et-scraper-safe

Requires Python 3.9+.


Quick start

1. As a command-line tool

et-scraper-safe

This runs the full pipeline:

RSS Collector → Dedup → Sentiment + Impact → NSE Stock Mapping → CSV + SQLite + run log

Outputs:

  • data/raw/et_news_raw_*.csv
  • data/clean/et_news_clean_*.csv
  • data/news.db (SQLite, 5 tables)

2. As a Python library

from et_scraper import (
    fetch_all_rss_news,
    analyze_sentiment,
    map_news_to_stock,
    load_default_stocks,
    init_db,
    upsert_clean,
    upsert_sentiment,
    upsert_stock_map,
)
import pandas as pd

# 1. Fetch + dedup (returns a DataFrame keyed by url_hash)
df = fetch_all_rss_news()

# 2. Structured sentiment per headline
sent = pd.DataFrame(df["title"].apply(analyze_sentiment).tolist())
df = pd.concat([df, sent], axis=1)

# 3. Map headlines to NSE symbols
stocks = load_default_stocks()
df["symbols"] = df["title"].apply(lambda t: map_news_to_stock(t, stocks))

# 4. Persist to SQLite
init_db()
upsert_clean(df)
upsert_sentiment(df)

Architecture

ET / Moneycontrol / NSE
        ↓
RSS Collector            (et_scraper.rss_collector)
        ↓
HTML Article Collector   (et_scraper.html_collector — robots-aware, retrying)
        ↓
Cleaner + Dedup          (et_scraper.dedup — SHA-1 url_hash)
        ↓
Stock Mapper             (et_scraper.stock_mapper)
        ↓
Sentiment Engine         (et_scraper.sentiment — Level 1: keyword)
        ↓
Impact Scorer            (et_scraper.sentiment.analyze_sentiment)
        ↓
Database                 (et_scraper.database — SQLite)
        ↓
Swing Trading Signal Engine  ← your code

Suggested swing-trading scoring

final_score = (
    technical_score      * 0.40 +
    news_sentiment_score * 0.25 +
    news_impact_score    * 0.15 +
    volume_score         * 0.10 +
    sector_score         * 0.10
)

et-scraper-safe provides the two news inputs; the rest comes from your technical + market data pipeline.


Sentiment engine roadmap

Level Engine Status
1 Keyword lexicon ✅ Built in (zero ML deps)
2 FinBERT Wrap analyze_sentiment shape
3 LLM sentiment Wrap analyze_sentiment shape
4 Market-impact model Wrap analyze_sentiment shape

The analyze_sentiment return shape is intentionally stable so you can swap engines without changing downstream code:

{
  "sentiment":  "Bullish" | "Bearish" | "Neutral",
  "impact":     "High" | "Medium" | "Low",
  "confidence": 0-100,
  "reason":     "Short explanation",
  "score":      int,
}

Categories collected

Category Source
latest Top stories RSS
markets Markets RSS
stocks Stocks RSS
economy Economy RSS
business Company / business RSS
ipo IPO RSS
mutual_funds Mutual funds RSS
commodities Commodities RSS
forex Forex RSS

Feed URLs live in et_scraper/config.py and can be extended.


SQLite schema

init_db() creates these tables in data/news.db (path configurable):

Table Key Purpose
news_raw Append-only raw fetched rows
news_clean url_hash PK Deduplicated, parsed news rows
stock_news_map (url_hash, symbol) PK Many-to-many news ↔ stock mapping
sentiment_scores url_hash PK sentiment, impact, confidence, score
scrape_logs run_id PK Per-run stats and status

Indexes on news_clean.published_at and stock_news_map.symbol.


DataFrame columns (after the full pipeline)

Column Description
url_hash SHA-1 of the article URL (dedup key)
source Always "economic_times"
category One of the categories above
title Article headline
summary RSS summary / description
link Canonical article URL
published Raw publish string from the feed
published_at UTC ISO publish timestamp (parsed)
fetched_at UTC ISO timestamp when the row was collected
sentiment Bullish / Bearish / Neutral
impact High / Medium / Low
confidence 0–100 heuristic confidence
reason Short explanation of which signals fired
score Integer = positive_word_count − negative_word_count
symbols List of NSE symbols mentioned in the title

Public API

Symbol Module
fetch_all_rss_news() -> DataFrame et_scraper.rss_collector
analyze_sentiment(text) -> dict et_scraper.sentiment
sentiment_score(text) -> int et_scraper.sentiment
sentiment_label(score) -> str et_scraper.sentiment
map_news_to_stock(title, stocks_df=None) et_scraper.stock_mapper
load_default_stocks() -> DataFrame et_scraper.stock_mapper
url_hash(url) -> str et_scraper.dedup
drop_duplicates(df, subset='url_hash') et_scraper.dedup
create_session(...)requests.Session et_scraper.http_session
init_db(db_path=...) et_scraper.database
save_to_sqlite(df, db_path=..., table=...) et_scraper.database
upsert_clean(df) / upsert_sentiment(df) et_scraper.database
upsert_stock_map(df) / log_scrape(...) et_scraper.database
save_dataframe(df, folder, name) et_scraper.storage
get_logger(name) et_scraper.logging_setup

Lower-level helpers:

Symbol Module
can_fetch(url, user_agent="*") -> bool et_scraper.robots_checker
fetch_public_page(url) -> BeautifulSoup|None et_scraper.html_collector
extract_headlines(soup) -> list[str] et_scraper.parser
extract_article_text(soup) -> str et_scraper.parser

Project layout

et_scraper_safe/
├── pyproject.toml          # PyPI packaging metadata
├── LICENSE                 # MIT
├── README.md
├── requirements.txt        # For running from source
├── main.py                 # Convenience runner (same as the CLI)
├── et_scraper/
│   ├── __init__.py
│   ├── cli.py              # Entry point for `et-scraper-safe`
│   ├── config.py           # RSS feed URLs, headers, timeouts
│   ├── http_session.py     # Retry/backoff requests.Session
│   ├── logging_setup.py    # Centralized logger
│   ├── robots_checker.py   # robots.txt enforcement
│   ├── rss_collector.py    # RSS → DataFrame + dedup + date parsing
│   ├── html_collector.py   # Polite, robots-aware HTML fetcher
│   ├── parser.py           # Headline / article-text extraction
│   ├── sentiment.py        # Lexicon sentiment + impact + confidence
│   ├── stock_mapper.py     # Headline → NSE symbol mapping
│   ├── dedup.py            # url_hash + drop_duplicates
│   ├── storage.py          # Timestamped CSV writer
│   └── database.py         # SQLite schema + upserts + run logs
└── data/
    ├── raw/                # Raw scraped CSVs
    ├── clean/              # Cleaned + scored CSVs
    └── news.db             # SQLite store (created on first run)

Development

Run from source:

git clone <your-fork>
cd et_scraper_safe
pip install -r requirements.txt
python main.py

Build + publish a new version (maintainers only):

# 1. Bump version in pyproject.toml AND et_scraper/__init__.py
# 2. Build and upload
rm -rf dist build *.egg-info
python -m build
TWINE_USERNAME=__token__ TWINE_PASSWORD="$PYPI_API_TOKEN" python -m twine upload dist/*

Disclaimer

This library only collects data that the Economic Times publishes openly via RSS or pages allowed by their robots.txt. It is intended for personal research and educational use. You are responsible for complying with the Economic Times' Terms of Service and any applicable laws when using this library or the data it collects.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

et_scraper_safe-0.2.1.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

et_scraper_safe-0.2.1-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file et_scraper_safe-0.2.1.tar.gz.

File metadata

  • Download URL: et_scraper_safe-0.2.1.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for et_scraper_safe-0.2.1.tar.gz
Algorithm Hash digest
SHA256 22207684edad3c8dd435cc687e0d83438e818700c2c0030b8b2f1f909415929a
MD5 0bbce78817076b59cc041e0bedf26867
BLAKE2b-256 05204d58ba4d926e22bc1cfb043846a377a286146d528411ca2bb0b6ab7140f8

See more details on using hashes here.

File details

Details for the file et_scraper_safe-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for et_scraper_safe-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0e64b4358a79c320229b75fc3a00d37629611efdaf55c9b6fa31465f3cd702e9
MD5 ef34f3d3cba526134d488319731ff08a
BLAKE2b-256 1791da38ab3642c20778f294e1931c98932d6b70e2d1ac97609d6d582e249dfd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page