Skip to main content

news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research.

Project description

news-watch: Indonesia's top news websites scraper

PyPI version Build Status PyPI Downloads

news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research

⚠️ Ethical Considerations & Disclaimer ⚠️

Purpose: For educational and research purposes only. Not designed for commercial use that could be detrimental to news source providers.

User Responsibility: Users must comply with each website's Terms of Service and robots.txt. Aggressive scraping may lead to IP blocking. Scrape responsibly and respect server limitations.

Installation

Using pip (standard)

pip install news-watch
playwright install chromium

Development setup: see https://okky.dev/news-watch/getting-started/

Performance Notes

⚠️ Works best locally. Cloud environments (Google Colab, servers) may experience degraded performance or blocking due to anti-bot measures.

Some scrapers may work on a local machine but fail on remote servers, Linux CI, or GitHub Actions. This usually happens because of anti-bot protection, rate limits, geolocation differences, JavaScript rendering differences, or sudden source-side changes.

Usage

To run the scraper from the command line:

newswatch --method <search|latest> -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> -v

Command-Line Arguments

Argument Description
--method Retrieval method: search (default) or latest
-k, --keywords Comma-separated keywords to scrape (required for search, optional for latest)
-sd, --start_date Start date in YYYY-MM-DD format (required for search, ignored in latest)
-s, --scrapers Scrapers to use: specific names (e.g., "kompas,viva"), "auto" (default, platform-appropriate), or "all" (force all, may fail)
-of, --output_format Output format: csv, xlsx, or json (default: csv)
-o, --output_path Custom output file path (optional)
-v, --verbose Show detailed logging output (default: silent)
--list_scrapers List all supported scrapers and exit
--health-report Run source health probes and print status table. JSON/CSV via --output_path
--limit Maximum number of articles to collect in latest mode
--max-pages Maximum pages to fetch per scraper in latest mode
--scraper-timeout Per-scraper timeout in seconds
--progress Print per-scraper progress lines

Examples

# Basic usage
newswatch --keywords ihsg --start_date 2025-01-01

# Latest monitoring mode
newswatch --method latest --scrapers "antaranews,kompas,viva"

# Multiple keywords with specific scraper
newswatch -k "ihsg,bank" -s "tempo" --output_format xlsx -v

# List available scrapers
newswatch --list_scrapers

Python API Usage

import newswatch as nw

# Basic scraping - returns list of article dictionaries
articles = nw.scrape("ekonomi,politik", "2025-01-01")
print(f"Found {len(articles)} articles")

# Get results as pandas DataFrame for analysis
df = nw.scrape_to_dataframe("teknologi,startup", "2025-01-01")
print(df['source'].value_counts())

# Latest monitoring
latest = nw.latest_to_dataframe(scrapers="antaranews,kompas,viva")
print(latest[["source", "title"]].head())

# Save directly to file
nw.scrape_to_file(
    keywords="bank,ihsg", 
    start_date="2025-01-01",
    output_path="financial_news.xlsx"
)

# Quick recent news
recent_news = nw.quick_scrape("politik", days_back=3)

# Get available news sources
sources = nw.list_scrapers()
print("Available sources:", sources)

See the comprehensive guide for detailed usage examples and advanced patterns. For interactive examples, see the API reference notebook.

Run on Google Colab

You can run news-watch on Google Colab Open In Colab

Output

The scraped articles are saved as a CSV, XLSX, or JSON file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH.

The output file contains the following columns:

  • title
  • publish_date
  • author
  • content
  • keyword
  • category
  • source
  • link

Retrieval Methods

  • search is the default and keeps the current keyword/date research workflow.
  • latest is intended for latest-news monitoring and does not require keywords.
  • Latest mode currently starts with a smaller subset of sources than the full search catalog.

Supported Websites (63)

Antara News, AP News, Al Jazeera, Bali Post, BBC News, Berita Jatim, BeritaSatu, Bisnis.com, Bloomberg Technoz, CNA Indonesia, CNBC Indonesia, CNN Indonesia, DailySocial, Detik, Fajar, Galamedia, Gatra, Grid, Harian Jogja, Hipwee, IDN Times, iNews, Investor Daily, Jakarta Globe, Jakarta Post, Jakarta Selaras, Jawapos, JPNN, Kaltim Post, Katadata, KBR, Kompas, Kontan, Kumparan, Liputan6, Media Indonesia, Merdeka, Metro TV News, Niaga.Asia, Mojok, Mongabay Indonesia, Okezone, Pantau.com, Pikiran Rakyat, Poskota, Project Multatuli, Republika, RM.ID, RRI, RMOL, SINDOnews, Suara, Suara Merdeka, Surabaya Pagi, SWA, Tempo, Tirto, Tribunnews, TVOne, TVRI News, VOA Indonesia, VOI.id, Viva

Notes:

  • 63 total sources: 60 with keyword search, all 63 with latest mode.
  • AP News uses topic hub pages with keyword-in-title filtering (robots disallows /search?q=*).
  • Al Jazeera is latest-only via RSS feed (search page is JS-rendered).
  • Reuters skipped (WAF blocked).
  • Use -s all to force-run all scrapers (may cause errors/timeouts).
  • Some sources are environment-sensitive and may fail on remote servers even if they work locally.
  • Limitation: Kontan scraper maximum 50 pages.

Contributing

Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details. The authors assume no liability for misuse of this software.

Citation

DOI

@software{mabruri_newswatch,
  author = {Okky Mabruri},
  title = {news-watch},
  year = {2025},
  doi = {10.5281/zenodo.14908389}
}

Related Work

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

news_watch-0.9.0.tar.gz (111.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

news_watch-0.9.0-py3-none-any.whl (158.3 kB view details)

Uploaded Python 3

File details

Details for the file news_watch-0.9.0.tar.gz.

File metadata

  • Download URL: news_watch-0.9.0.tar.gz
  • Upload date:
  • Size: 111.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for news_watch-0.9.0.tar.gz
Algorithm Hash digest
SHA256 b82e3883076b320e6304c0f73bd55a81f593090770846e201a3d5a5ec76c5f0e
MD5 1920c34d1d55f2a18d5e3385518b1cf3
BLAKE2b-256 8494c32486c5012c3e01e3c4adcf544ff9edcec447cdda02263ed56666cad6e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for news_watch-0.9.0.tar.gz:

Publisher: release.yml on okkymabruri/news-watch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file news_watch-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: news_watch-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 158.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for news_watch-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 756e67e6517af85e0379d130844b286d63ea05fbaa38e42179fbd5e0f8c05458
MD5 572a9ffea299e5b915088cdd1d837349
BLAKE2b-256 e68c7f17040706f087009f5897fd2b994ffe200faaa2bd78fd2402411d93c8ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for news_watch-0.9.0-py3-none-any.whl:

Publisher: release.yml on okkymabruri/news-watch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page