A polite, RSS-first Economic Times news and sentiment collector for market research.
Project description
et-scraper-safe
A polite, RSS-first Python library for collecting public Economic Times news, deduplicating it, scoring sentiment + market impact, mapping headlines to NSE stock symbols, and persisting everything to SQLite + CSV — built for swing-trading and market research pipelines.
Why "safe"?
This library is designed to be a good citizen of the web:
- ✅ Respects
robots.txtbefore fetching any HTML page - ✅ Prefers RSS feeds over HTML scraping
- ✅ Configurable delay between HTML requests
- ✅ Shared
requests.Sessionwith retry, exponential backoff, and 429/5xx handling - ✅ Identifiable
User-Agent - ❌ Does not bypass logins, paywalls, captchas, Cloudflare, or rate limits
If robots.txt disallows a URL, the request is skipped — full stop.
What's new in 0.2.0
- 🔁 Reliable HTTP —
requests.Sessionwith retry, exponential backoff, 429/5xx handling - 🗃️ SQLite store —
news_raw,news_clean,stock_news_map,sentiment_scores,scrape_logs - 📈 Stock mapping — built-in NSE large-cap list, extensible via DataFrame
- 🧠 Structured sentiment —
{sentiment, impact, confidence, reason, score}payload - 🪪 URL hashing + dedup — every row keyed by SHA-1 of its link
- 🕒 Published date parsing — UTC ISO
published_atcolumn - 📜 Logging — proper
loggingsetup instead ofprint - 🧾 Run logs — every CLI run records stats and status in
scrape_logs
Install
pip install --upgrade et-scraper-safe
Requires Python 3.9+.
Quick start
1. As a command-line tool
et-scraper-safe
This runs the full pipeline:
RSS Collector → Dedup → Sentiment + Impact → NSE Stock Mapping → CSV + SQLite + run log
Outputs:
data/raw/et_news_raw_*.csvdata/clean/et_news_clean_*.csvdata/news.db(SQLite, 5 tables)
2. As a Python library
from et_scraper import (
fetch_all_rss_news,
analyze_sentiment,
map_news_to_stock,
load_default_stocks,
init_db,
upsert_clean,
upsert_sentiment,
upsert_stock_map,
)
import pandas as pd
# 1. Fetch + dedup (returns a DataFrame keyed by url_hash)
df = fetch_all_rss_news()
# 2. Structured sentiment per headline
sent = pd.DataFrame(df["title"].apply(analyze_sentiment).tolist())
df = pd.concat([df, sent], axis=1)
# 3. Map headlines to NSE symbols
stocks = load_default_stocks()
df["symbols"] = df["title"].apply(lambda t: map_news_to_stock(t, stocks))
# 4. Persist to SQLite
init_db()
upsert_clean(df)
upsert_sentiment(df)
Architecture
ET / Moneycontrol / NSE
↓
RSS Collector (et_scraper.rss_collector)
↓
HTML Article Collector (et_scraper.html_collector — robots-aware, retrying)
↓
Cleaner + Dedup (et_scraper.dedup — SHA-1 url_hash)
↓
Stock Mapper (et_scraper.stock_mapper)
↓
Sentiment Engine (et_scraper.sentiment — Level 1: keyword)
↓
Impact Scorer (et_scraper.sentiment.analyze_sentiment)
↓
Database (et_scraper.database — SQLite)
↓
Swing Trading Signal Engine ← your code
Suggested swing-trading scoring
final_score = (
technical_score * 0.40 +
news_sentiment_score * 0.25 +
news_impact_score * 0.15 +
volume_score * 0.10 +
sector_score * 0.10
)
et-scraper-safe provides the two news inputs; the rest comes from your
technical + market data pipeline.
Sentiment engine roadmap
| Level | Engine | Status |
|---|---|---|
| 1 | Keyword lexicon | ✅ Built in (zero ML deps) |
| 2 | FinBERT | Wrap analyze_sentiment shape |
| 3 | LLM sentiment | Wrap analyze_sentiment shape |
| 4 | Market-impact model | Wrap analyze_sentiment shape |
The analyze_sentiment return shape is intentionally stable so you can
swap engines without changing downstream code:
{
"sentiment": "Bullish" | "Bearish" | "Neutral",
"impact": "High" | "Medium" | "Low",
"confidence": 0-100,
"reason": "Short explanation",
"score": int,
}
Categories collected
| Category | Source |
|---|---|
latest |
Top stories RSS |
markets |
Markets RSS |
stocks |
Stocks RSS |
economy |
Economy RSS |
business |
Company / business RSS |
ipo |
IPO RSS |
mutual_funds |
Mutual funds RSS |
commodities |
Commodities RSS |
forex |
Forex RSS |
Feed URLs live in et_scraper/config.py and can be extended.
SQLite schema
init_db() creates these tables in data/news.db (path configurable):
| Table | Key | Purpose |
|---|---|---|
news_raw |
— | Append-only raw fetched rows |
news_clean |
url_hash PK |
Deduplicated, parsed news rows |
stock_news_map |
(url_hash, symbol) PK |
Many-to-many news ↔ stock mapping |
sentiment_scores |
url_hash PK |
sentiment, impact, confidence, score |
scrape_logs |
run_id PK |
Per-run stats and status |
Indexes on news_clean.published_at and stock_news_map.symbol.
DataFrame columns (after the full pipeline)
| Column | Description |
|---|---|
url_hash |
SHA-1 of the article URL (dedup key) |
source |
Always "economic_times" |
category |
One of the categories above |
title |
Article headline |
summary |
RSS summary / description |
link |
Canonical article URL |
published |
Raw publish string from the feed |
published_at |
UTC ISO publish timestamp (parsed) |
fetched_at |
UTC ISO timestamp when the row was collected |
sentiment |
Bullish / Bearish / Neutral |
impact |
High / Medium / Low |
confidence |
0–100 heuristic confidence |
reason |
Short explanation of which signals fired |
score |
Integer = positive_word_count − negative_word_count |
symbols |
List of NSE symbols mentioned in the title |
Public API
| Symbol | Module |
|---|---|
fetch_all_rss_news() -> DataFrame |
et_scraper.rss_collector |
analyze_sentiment(text) -> dict |
et_scraper.sentiment |
sentiment_score(text) -> int |
et_scraper.sentiment |
sentiment_label(score) -> str |
et_scraper.sentiment |
map_news_to_stock(title, stocks_df=None) |
et_scraper.stock_mapper |
load_default_stocks() -> DataFrame |
et_scraper.stock_mapper |
url_hash(url) -> str |
et_scraper.dedup |
drop_duplicates(df, subset='url_hash') |
et_scraper.dedup |
create_session(...) → requests.Session |
et_scraper.http_session |
init_db(db_path=...) |
et_scraper.database |
save_to_sqlite(df, db_path=..., table=...) |
et_scraper.database |
upsert_clean(df) / upsert_sentiment(df) |
et_scraper.database |
upsert_stock_map(df) / log_scrape(...) |
et_scraper.database |
save_dataframe(df, folder, name) |
et_scraper.storage |
get_logger(name) |
et_scraper.logging_setup |
Lower-level helpers:
| Symbol | Module |
|---|---|
can_fetch(url, user_agent="*") -> bool |
et_scraper.robots_checker |
fetch_public_page(url) -> BeautifulSoup|None |
et_scraper.html_collector |
extract_headlines(soup) -> list[str] |
et_scraper.parser |
extract_article_text(soup) -> str |
et_scraper.parser |
Project layout
et_scraper_safe/
├── pyproject.toml # PyPI packaging metadata
├── LICENSE # MIT
├── README.md
├── requirements.txt # For running from source
├── main.py # Convenience runner (same as the CLI)
├── et_scraper/
│ ├── __init__.py
│ ├── cli.py # Entry point for `et-scraper-safe`
│ ├── config.py # RSS feed URLs, headers, timeouts
│ ├── http_session.py # Retry/backoff requests.Session
│ ├── logging_setup.py # Centralized logger
│ ├── robots_checker.py # robots.txt enforcement
│ ├── rss_collector.py # RSS → DataFrame + dedup + date parsing
│ ├── html_collector.py # Polite, robots-aware HTML fetcher
│ ├── parser.py # Headline / article-text extraction
│ ├── sentiment.py # Lexicon sentiment + impact + confidence
│ ├── stock_mapper.py # Headline → NSE symbol mapping
│ ├── dedup.py # url_hash + drop_duplicates
│ ├── storage.py # Timestamped CSV writer
│ └── database.py # SQLite schema + upserts + run logs
└── data/
├── raw/ # Raw scraped CSVs
├── clean/ # Cleaned + scored CSVs
└── news.db # SQLite store (created on first run)
Development
Run from source:
git clone <your-fork>
cd et_scraper_safe
pip install -r requirements.txt
python main.py
Build + publish a new version (maintainers only):
# 1. Bump version in pyproject.toml AND et_scraper/__init__.py
# 2. Build and upload
rm -rf dist build *.egg-info
python -m build
TWINE_USERNAME=__token__ TWINE_PASSWORD="$PYPI_API_TOKEN" python -m twine upload dist/*
Disclaimer
This library only collects data that the Economic Times publishes openly via
RSS or pages allowed by their robots.txt. It is intended for personal
research and educational use. You are responsible for complying with the
Economic Times' Terms of Service and any applicable laws when using this
library or the data it collects.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file et_scraper_safe-0.2.1.tar.gz.
File metadata
- Download URL: et_scraper_safe-0.2.1.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22207684edad3c8dd435cc687e0d83438e818700c2c0030b8b2f1f909415929a
|
|
| MD5 |
0bbce78817076b59cc041e0bedf26867
|
|
| BLAKE2b-256 |
05204d58ba4d926e22bc1cfb043846a377a286146d528411ca2bb0b6ab7140f8
|
File details
Details for the file et_scraper_safe-0.2.1-py3-none-any.whl.
File metadata
- Download URL: et_scraper_safe-0.2.1-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e64b4358a79c320229b75fc3a00d37629611efdaf55c9b6fa31465f3cd702e9
|
|
| MD5 |
ef34f3d3cba526134d488319731ff08a
|
|
| BLAKE2b-256 |
1791da38ab3642c20778f294e1931c98932d6b70e2d1ac97609d6d582e249dfd
|