Skip to main content

news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research.

Project description

news-watch: Indonesia's top news websites scraper

PyPI version Build Status PyPI Downloads

news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research

⚠️ Ethical Considerations & Disclaimer ⚠️

Purpose: For educational and research purposes only. Not designed for commercial use that could be detrimental to news source providers.

User Responsibility: Users must comply with each website's Terms of Service and robots.txt. Aggressive scraping may lead to IP blocking. Scrape responsibly and respect server limitations.

Installation

Using pip (standard)

pip install news-watch
playwright install chromium

# Development version
pip install git+https://github.com/okkymabruri/news-watch.git@dev

Using UV (recommended for development)

# Install UV (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/okkymabruri/news-watch.git
cd news-watch
uv sync --all-extras
uv run playwright install chromium

Performance Notes

⚠️ Works best locally. Cloud environments (Google Colab, servers) may experience degraded performance or blocking due to anti-bot measures.

Usage

To run the scraper from the command line:

newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> -v

Command-Line Arguments

Argument Description
-k, --keywords Required. Comma-separated keywords to scrape (e.g., "ojk,bank,npl")
-sd, --start_date Required. Start date in YYYY-MM-DD format (e.g., 2025-01-01)
-s, --scrapers Scrapers to use: specific names (e.g., "kompas,viva"), "auto" (default, platform-appropriate), or "all" (force all, may fail)
-of, --output_format Output format: csv, xlsx, or json (default: csv)
-v, --verbose Show detailed logging output (default: silent)
--list_scrapers List all supported scrapers and exit

Examples

# Basic usage
newswatch --keywords ihsg --start_date 2025-01-01

# Multiple keywords with specific scraper
newswatch -k "ihsg,bank" -s "detik" --output_format xlsx -v

# List available scrapers
newswatch --list_scrapers

Python API Usage

import newswatch as nw

# Basic scraping - returns list of article dictionaries
articles = nw.scrape("ekonomi,politik", "2025-01-01")
print(f"Found {len(articles)} articles")

# Get results as pandas DataFrame for analysis
df = nw.scrape_to_dataframe("teknologi,startup", "2025-01-01")
print(df['source'].value_counts())

# Save directly to file
nw.scrape_to_file(
    keywords="bank,ihsg", 
    start_date="2025-01-01",
    output_path="financial_news.xlsx"
)

# Quick recent news
recent_news = nw.quick_scrape("politik", days_back=3)

# Get available news sources
sources = nw.list_scrapers()
print("Available sources:", sources)

See the comprehensive guide for detailed usage examples and advanced patterns. For interactive examples, see the API reference notebook.

Run on Google Colab

You can run news-watch on Google Colab Open In Colab

Output

The scraped articles are saved as a CSV, XLSX, or JSON file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH.

The output file contains the following columns:

  • title
  • publish_date
  • author
  • content
  • keyword
  • category
  • source
  • link

Supported Websites

Note:

  • On Linux platforms: Kontan, Jawapos, Katadata are automatically excluded due to compatibility issues. Use -s all to force (may cause errors)
  • Limitation: Kontan scraper maximum 50 pages

Contributing

Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details. The authors assume no liability for misuse of this software.

Citation

DOI

@software{mabruri_newswatch,
  author = {Okky Mabruri},
  title = {news-watch},
  year = {2025},
  doi = {10.5281/zenodo.14908389}
}

Related Work

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

news_watch-0.4.0.tar.gz (53.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

news_watch-0.4.0-py3-none-any.whl (49.7 kB view details)

Uploaded Python 3

File details

Details for the file news_watch-0.4.0.tar.gz.

File metadata

  • Download URL: news_watch-0.4.0.tar.gz
  • Upload date:
  • Size: 53.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for news_watch-0.4.0.tar.gz
Algorithm Hash digest
SHA256 fe1827a11ef26978fdb9124312acbd622663ce1ea9f2863a9651bb7488d6aed3
MD5 1f51bc92cecf49d20a2a2d6bc2c6f8b0
BLAKE2b-256 0d200138dd9560d992cbfc42fb65a29ec51e4d80d0d3929c07c5861389c42769

See more details on using hashes here.

Provenance

The following attestation bundles were made for news_watch-0.4.0.tar.gz:

Publisher: release.yml on okkymabruri/news-watch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file news_watch-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: news_watch-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 49.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for news_watch-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fd0a0c86a187ad50bb0469ed5120a57906eaeb87fe9ae7aba19a4b6367cb024c
MD5 ba72f54388e4176fc7a13cbb48115e61
BLAKE2b-256 f96a3ff9f0285042062296c36634e45743411874dfc7a172214a9fd687e0924f

See more details on using hashes here.

Provenance

The following attestation bundles were made for news_watch-0.4.0-py3-none-any.whl:

Publisher: release.yml on okkymabruri/news-watch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page