A polite, RSS-first Economic Times news and sentiment collector for market research.

These details have not been verified by PyPI

Project links

Project description

et-scraper-safe

A polite, RSS-first Python library for collecting public Economic Times news, deduplicating it, scoring sentiment + market impact, mapping headlines to NSE stock symbols, and persisting everything to SQLite + CSV — built for swing-trading and market research pipelines.

PyPI: https://pypi.org/project/et-scraper-safe/

Why "safe"?

This library is designed to be a good citizen of the web:

✅ Respects robots.txt before fetching any HTML page
✅ Prefers RSS feeds over HTML scraping
✅ Configurable delay between HTML requests
✅ Shared requests.Session with retry, exponential backoff, and 429/5xx handling
✅ Identifiable User-Agent
❌ Does not bypass logins, paywalls, captchas, Cloudflare, or rate limits

If robots.txt disallows a URL, the request is skipped — full stop.

What's new in 0.2.0

🔁 Reliable HTTP — requests.Session with retry, exponential backoff, 429/5xx handling
🗃️ SQLite store — news_raw, news_clean, stock_news_map, sentiment_scores, scrape_logs
📈 Stock mapping — built-in NSE large-cap list, extensible via DataFrame
🧠 Structured sentiment — {sentiment, impact, confidence, reason, score} payload
🪪 URL hashing + dedup — every row keyed by SHA-1 of its link
🕒 Published date parsing — UTC ISO published_at column
📜 Logging — proper logging setup instead of print
🧾 Run logs — every CLI run records stats and status in scrape_logs

Install

pip install --upgrade et-scraper-safe

Requires Python 3.9+.

Quick start

1. As a command-line tool

et-scraper-safe

This runs the full pipeline:

RSS Collector → Dedup → Sentiment + Impact → NSE Stock Mapping → CSV + SQLite + run log

Outputs:

data/raw/et_news_raw_*.csv
data/clean/et_news_clean_*.csv
data/news.db (SQLite, 5 tables)

2. As a Python library

from et_scraper import (
    fetch_all_rss_news,
    analyze_sentiment,
    map_news_to_stock,
    load_default_stocks,
    init_db,
    upsert_clean,
    upsert_sentiment,
    upsert_stock_map,
)
import pandas as pd

# 1. Fetch + dedup (returns a DataFrame keyed by url_hash)
df = fetch_all_rss_news()

# 2. Structured sentiment per headline
sent = pd.DataFrame(df["title"].apply(analyze_sentiment).tolist())
df = pd.concat([df, sent], axis=1)

# 3. Map headlines to NSE symbols
stocks = load_default_stocks()
df["symbols"] = df["title"].apply(lambda t: map_news_to_stock(t, stocks))

# 4. Persist to SQLite
init_db()
upsert_clean(df)
upsert_sentiment(df)

Architecture

ET / Moneycontrol / NSE
        ↓
RSS Collector            (et_scraper.rss_collector)
        ↓
HTML Article Collector   (et_scraper.html_collector — robots-aware, retrying)
        ↓
Cleaner + Dedup          (et_scraper.dedup — SHA-1 url_hash)
        ↓
Stock Mapper             (et_scraper.stock_mapper)
        ↓
Sentiment Engine         (et_scraper.sentiment — Level 1: keyword)
        ↓
Impact Scorer            (et_scraper.sentiment.analyze_sentiment)
        ↓
Database                 (et_scraper.database — SQLite)
        ↓
Swing Trading Signal Engine  ← your code

Suggested swing-trading scoring

final_score = (
    technical_score      * 0.40 +
    news_sentiment_score * 0.25 +
    news_impact_score    * 0.15 +
    volume_score         * 0.10 +
    sector_score         * 0.10
)

et-scraper-safe provides the two news inputs; the rest comes from your technical + market data pipeline.

Sentiment engine roadmap

Level	Engine	Status
1	Keyword lexicon	✅ Built in (zero ML deps)
2	FinBERT	Wrap `analyze_sentiment` shape
3	LLM sentiment	Wrap `analyze_sentiment` shape
4	Market-impact model	Wrap `analyze_sentiment` shape

The analyze_sentiment return shape is intentionally stable so you can swap engines without changing downstream code:

{
  "sentiment":  "Bullish" | "Bearish" | "Neutral",
  "impact":     "High" | "Medium" | "Low",
  "confidence": 0-100,
  "reason":     "Short explanation",
  "score":      int,
}

Categories collected

Category	Source
`latest`	Top stories RSS
`markets`	Markets RSS
`stocks`	Stocks RSS
`economy`	Economy RSS
`business`	Company / business RSS
`ipo`	IPO RSS
`mutual_funds`	Mutual funds RSS
`commodities`	Commodities RSS
`forex`	Forex RSS

Feed URLs live in et_scraper/config.py and can be extended.

SQLite schema

init_db() creates these tables in data/news.db (path configurable):

Table	Key	Purpose
`news_raw`	—	Append-only raw fetched rows
`news_clean`	`url_hash` PK	Deduplicated, parsed news rows
`stock_news_map`	(`url_hash`, `symbol`) PK	Many-to-many news ↔ stock mapping
`sentiment_scores`	`url_hash` PK	sentiment, impact, confidence, score
`scrape_logs`	`run_id` PK	Per-run stats and status

Indexes on news_clean.published_at and stock_news_map.symbol.

DataFrame columns (after the full pipeline)

Column	Description
`url_hash`	SHA-1 of the article URL (dedup key)
`source`	Always `"economic_times"`
`category`	One of the categories above
`title`	Article headline
`summary`	RSS summary / description
`link`	Canonical article URL
`published`	Raw publish string from the feed
`published_at`	UTC ISO publish timestamp (parsed)
`fetched_at`	UTC ISO timestamp when the row was collected
`sentiment`	`Bullish` / `Bearish` / `Neutral`
`impact`	`High` / `Medium` / `Low`
`confidence`	0–100 heuristic confidence
`reason`	Short explanation of which signals fired
`score`	Integer = positive_word_count − negative_word_count
`symbols`	List of NSE symbols mentioned in the title

Public API

Symbol	Module
`fetch_all_rss_news() -> DataFrame`	`et_scraper.rss_collector`
`analyze_sentiment(text) -> dict`	`et_scraper.sentiment`
`sentiment_score(text) -> int`	`et_scraper.sentiment`
`sentiment_label(score) -> str`	`et_scraper.sentiment`
`map_news_to_stock(title, stocks_df=None)`	`et_scraper.stock_mapper`
`load_default_stocks() -> DataFrame`	`et_scraper.stock_mapper`
`url_hash(url) -> str`	`et_scraper.dedup`
`drop_duplicates(df, subset='url_hash')`	`et_scraper.dedup`
`create_session(...)` → `requests.Session`	`et_scraper.http_session`
`init_db(db_path=...)`	`et_scraper.database`
`save_to_sqlite(df, db_path=..., table=...)`	`et_scraper.database`
`upsert_clean(df)` / `upsert_sentiment(df)`	`et_scraper.database`
`upsert_stock_map(df)` / `log_scrape(...)`	`et_scraper.database`
`save_dataframe(df, folder, name)`	`et_scraper.storage`
`get_logger(name)`	`et_scraper.logging_setup`

Lower-level helpers:

Symbol	Module
`can_fetch(url, user_agent="*") -> bool`	`et_scraper.robots_checker`
`fetch_public_page(url) -> BeautifulSoup\|None`	`et_scraper.html_collector`
`extract_headlines(soup) -> list[str]`	`et_scraper.parser`
`extract_article_text(soup) -> str`	`et_scraper.parser`

Project layout

et_scraper_safe/
├── pyproject.toml          # PyPI packaging metadata
├── LICENSE                 # MIT
├── README.md
├── requirements.txt        # For running from source
├── main.py                 # Convenience runner (same as the CLI)
├── et_scraper/
│   ├── __init__.py
│   ├── cli.py              # Entry point for `et-scraper-safe`
│   ├── config.py           # RSS feed URLs, headers, timeouts
│   ├── http_session.py     # Retry/backoff requests.Session
│   ├── logging_setup.py    # Centralized logger
│   ├── robots_checker.py   # robots.txt enforcement
│   ├── rss_collector.py    # RSS → DataFrame + dedup + date parsing
│   ├── html_collector.py   # Polite, robots-aware HTML fetcher
│   ├── parser.py           # Headline / article-text extraction
│   ├── sentiment.py        # Lexicon sentiment + impact + confidence
│   ├── stock_mapper.py     # Headline → NSE symbol mapping
│   ├── dedup.py            # url_hash + drop_duplicates
│   ├── storage.py          # Timestamped CSV writer
│   └── database.py         # SQLite schema + upserts + run logs
└── data/
    ├── raw/                # Raw scraped CSVs
    ├── clean/              # Cleaned + scored CSVs
    └── news.db             # SQLite store (created on first run)

Development

Run from source:

git clone <your-fork>
cd et_scraper_safe
pip install -r requirements.txt
python main.py

Build + publish a new version (maintainers only):

# 1. Bump version in pyproject.toml AND et_scraper/__init__.py
# 2. Build and upload
rm -rf dist build *.egg-info
python -m build
TWINE_USERNAME=__token__ TWINE_PASSWORD="$PYPI_API_TOKEN" python -m twine upload dist/*

Disclaimer

This library only collects data that the Economic Times publishes openly via RSS or pages allowed by their robots.txt. It is intended for personal research and educational use. You are responsible for complying with the Economic Times' Terms of Service and any applicable laws when using this library or the data it collects.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

May 6, 2026

0.2.0

May 6, 2026

0.1.0

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

et_scraper_safe-0.2.1.tar.gz (19.4 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

et_scraper_safe-0.2.1-py3-none-any.whl (19.1 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file et_scraper_safe-0.2.1.tar.gz.

File metadata

Download URL: et_scraper_safe-0.2.1.tar.gz
Upload date: May 6, 2026
Size: 19.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for et_scraper_safe-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`22207684edad3c8dd435cc687e0d83438e818700c2c0030b8b2f1f909415929a`
MD5	`0bbce78817076b59cc041e0bedf26867`
BLAKE2b-256	`05204d58ba4d926e22bc1cfb043846a377a286146d528411ca2bb0b6ab7140f8`

See more details on using hashes here.

File details

Details for the file et_scraper_safe-0.2.1-py3-none-any.whl.

File metadata

Download URL: et_scraper_safe-0.2.1-py3-none-any.whl
Upload date: May 6, 2026
Size: 19.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for et_scraper_safe-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e64b4358a79c320229b75fc3a00d37629611efdaf55c9b6fa31465f3cd702e9`
MD5	`ef34f3d3cba526134d488319731ff08a`
BLAKE2b-256	`1791da38ab3642c20778f294e1931c98932d6b70e2d1ac97609d6d582e249dfd`

See more details on using hashes here.

et-scraper-safe 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

et-scraper-safe

Why "safe"?

What's new in 0.2.0

Install

Quick start

1. As a command-line tool

2. As a Python library

Architecture

Suggested swing-trading scoring

Sentiment engine roadmap

Categories collected

SQLite schema

DataFrame columns (after the full pipeline)

Public API

Project layout

Development

Disclaimer

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes