Skip to main content

A polite, RSS-first Economic Times news and sentiment collector for market research.

Project description

et-scraper-safe

PyPI version Python versions License: MIT

A polite, RSS-first Python library for collecting public Economic Times news headlines and tagging them with simple sentiment — built for market and swing-trading research pipelines.

PyPI: https://pypi.org/project/et-scraper-safe/


Why "safe"?

This library is designed to be a good citizen of the web:

  • ✅ Respects robots.txt before fetching any HTML page
  • ✅ Prefers RSS feeds over HTML scraping
  • ✅ Adds a configurable delay between HTML requests
  • ✅ Sends a clear, identifiable User-Agent
  • ❌ Does not bypass logins, paywalls, captchas, Cloudflare, or rate limits
  • ❌ Does not scrape any content the publisher has restricted

If robots.txt disallows a URL, the request is skipped — full stop.


Install

pip install et-scraper-safe

Requires Python 3.9+.


Quick start

1. As a command-line tool

After install, a console command is available:

et-scraper-safe

This will:

  1. Fetch all configured Economic Times RSS feeds.
  2. Score each headline's sentiment.
  3. Save raw + cleaned CSVs into ./data/raw/ and ./data/clean/.
  4. Print a summary of bullish / bearish / neutral counts.

2. As a Python library

from et_scraper import (
    fetch_all_rss_news,
    sentiment_score,
    sentiment_label,
    save_dataframe,
)

df = fetch_all_rss_news()
df["sentiment_score"] = df["title"].apply(sentiment_score)
df["sentiment_label"] = df["sentiment_score"].apply(sentiment_label)

print(df.head())
save_dataframe(df, folder="data/clean", name="et_news_clean")

Categories collected

Category Source
latest Top stories RSS
markets Markets RSS
stocks Stocks RSS
economy Economy RSS
business Company / business RSS
ipo IPO RSS
mutual_funds Mutual funds RSS
commodities Commodities RSS
forex Forex RSS

Feed URLs are defined in et_scraper/config.py and can be extended.


Output schema

Each row of the returned pandas.DataFrame has:

Column Description
source Always "economic_times"
category One of the categories above
title Article headline
summary RSS summary / description
link Canonical article URL
published Publish timestamp from the feed
fetched_at UTC ISO timestamp when the row was collected
sentiment_score Integer = positive_word_count − negative_word_count
sentiment_label "Bullish", "Bearish", or "Neutral"

Example:

source,category,title,summary,link,published,sentiment_score,sentiment_label
economic_times,stocks,Tata Motors shares rally...,...,link,...,2,Bullish
economic_times,economy,Rupee falls against dollar...,...,link,...,-1,Bearish

Public API

Symbol What it does
fetch_all_rss_news() -> DataFrame Fetch all configured RSS feeds into a DataFrame.
sentiment_score(text: str) -> int Lexicon-based score: positive − negative word counts.
sentiment_label(score: int) -> str Map a score to "Bullish" / "Bearish" / "Neutral".
save_dataframe(df, folder, name) Save a DataFrame to a timestamped CSV; returns the path.
__version__ Library version string.

Lower-level helpers (use only if you really need raw HTML):

Symbol Module
can_fetch(url, user_agent="*") -> bool et_scraper.robots_checker
fetch_public_page(url) -> BeautifulSoup|None et_scraper.html_collector
extract_headlines(soup) -> list[str] et_scraper.parser
extract_article_text(soup) -> str et_scraper.parser

Use in a swing trading pipeline

Economic Times News (this library)
        ↓
Headline Sentiment (this library)
        ↓
Stock Symbol Mapping (your code)
        ↓
Technical Indicators (your code)
        ↓
Final Swing Score (your code)

Project layout

et_scraper_safe/
├── pyproject.toml          # PyPI packaging metadata
├── LICENSE                 # MIT
├── README.md
├── requirements.txt        # For running from source
├── main.py                 # Convenience runner (same as the CLI)
├── et_scraper/
│   ├── __init__.py
│   ├── cli.py              # Entry point for `et-scraper-safe` console command
│   ├── config.py           # RSS feed URLs, headers, timeouts
│   ├── robots_checker.py   # robots.txt enforcement
│   ├── rss_collector.py    # RSS → DataFrame
│   ├── html_collector.py   # Polite, robots-aware HTML fetcher
│   ├── parser.py           # Headline / article-text extraction
│   ├── sentiment.py        # Lexicon-based sentiment
│   └── storage.py          # Timestamped CSV writer
└── data/
    ├── raw/                # Raw scraped CSVs
    └── clean/              # Cleaned + scored CSVs

Development

Run from source:

git clone <your-fork>
cd et_scraper_safe
pip install -r requirements.txt
python main.py

Build + publish a new version (maintainers only):

# 1. Bump version in pyproject.toml and et_scraper/__init__.py
# 2. Build and upload
rm -rf dist build *.egg-info
python -m build
TWINE_USERNAME=__token__ TWINE_PASSWORD="$PYPI_API_TOKEN" python -m twine upload dist/*

Disclaimer

This library only collects data that the Economic Times publishes openly via RSS or pages allowed by their robots.txt. It is intended for personal research and educational use. You are responsible for complying with the Economic Times' Terms of Service and any applicable laws when using this library or the data it collects.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

et_scraper_safe-0.2.0.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

et_scraper_safe-0.2.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file et_scraper_safe-0.2.0.tar.gz.

File metadata

  • Download URL: et_scraper_safe-0.2.0.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for et_scraper_safe-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ace99dc325b4975ee259a418664790f376268f89831dd93062161fb082d28d41
MD5 e6e6d5aaa270c9688c7cc2cb70296588
BLAKE2b-256 f1af01c4df8ef38ed5db6b591b229dadb2276598c0cc481cba390276230648e1

See more details on using hashes here.

File details

Details for the file et_scraper_safe-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for et_scraper_safe-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fd2e6e63795a78df06e50b3fccd7f9365770f86ac92d8fa85663f25746bc4835
MD5 6bad8a3d5df176d8340d923c86c7f350
BLAKE2b-256 4e50dc4f15d820bae020d2b73229d7674e09f9d00b18a4c9d9ac340cf6bd8652

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page