news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research.
Project description
news-watch: Indonesia's top news websites scraper
news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research
⚠️ Ethical Considerations & Disclaimer ⚠️
Purpose: For educational and research purposes only. Not designed for commercial use that could be detrimental to news source providers.
User Responsibility: Users must comply with each website's Terms of Service and robots.txt. Aggressive scraping may lead to IP blocking. Scrape responsibly and respect server limitations.
Installation
pip install news-watch
playwright install chromium
# Development version
pip install git+https://github.com/okkymabruri/news-watch.git@dev
Performance Notes
⚠️ Works best locally. Cloud environments (Google Colab, servers) may experience degraded performance or blocking due to anti-bot measures.
Usage
To run the scraper from the command line:
newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> -v
Command-Line Arguments
| Argument | Description |
|---|---|
-k, --keywords |
Required. Comma-separated keywords to scrape (e.g., "ojk,bank,npl") |
-sd, --start_date |
Required. Start date in YYYY-MM-DD format (e.g., 2025-01-01) |
-s, --scrapers |
Scrapers to use: specific names (e.g., "kompas,viva"), "auto" (default, platform-appropriate), or "all" (force all, may fail) |
-of, --output_format |
Output format: csv or xlsx (default: xlsx) |
-v, --verbose |
Show detailed logging output (default: silent) |
--list_scrapers |
List all supported scrapers and exit |
Examples
# Basic usage
newswatch --keywords ihsg --start_date 2025-01-01
# Multiple keywords with specific scraper
newswatch -k "ihsg,bank" -s "detik" --output_format xlsx -v
# List available scrapers
newswatch --list_scrapers
Python API Usage
import newswatch as nw
# Basic scraping - returns list of article dictionaries
articles = nw.scrape("ekonomi,politik", "2025-01-01")
print(f"Found {len(articles)} articles")
# Get results as pandas DataFrame for analysis
df = nw.scrape_to_dataframe("teknologi,startup", "2025-01-01")
print(df['source'].value_counts())
# Save directly to file
nw.scrape_to_file(
keywords="bank,ihsg",
start_date="2025-01-01",
output_path="financial_news.xlsx"
)
# Quick recent news
recent_news = nw.quick_scrape("politik", days_back=3)
# Get available news sources
sources = nw.list_scrapers()
print("Available sources:", sources)
See the comprehensive guide for detailed usage examples and advanced patterns. For interactive examples, see the API reference notebook.
Run on Google Colab
You can run news-watch on Google Colab
Output
The scraped articles are saved as a CSV or XLSX file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH.
The output file contains the following columns:
titlepublish_dateauthorcontentkeywordcategorysourcelink
Supported Websites
- Antaranews.com
- Bisnis.com
- Bloomberg Technoz
- CNBC Indonesia
- Detik.com
- Jawapos.com
- Katadata.co.id
- Kompas.com
- Kontan.co.id
- Media Indonesia
- Metrotvnews.com
- Okezone.com
- Tempo.co
- Viva.co.id
Note:
Contributing
Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.
License
This project is licensed under the MIT License - see the LICENSE file for details. The authors assume no liability for misuse of this software.
Citation
@software{mabruri_newswatch,
author = {Okky Mabruri},
title = {news-watch},
year = {2025},
doi = {10.5281/zenodo.14908389}
}
Related Work
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file news_watch-0.3.0.tar.gz.
File metadata
- Download URL: news_watch-0.3.0.tar.gz
- Upload date:
- Size: 25.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f463417f1f453ab56dba84caccfc1f980cebe546568300fb195ab884b863892c
|
|
| MD5 |
f47b4a21edd8c52222851196b41d6e3e
|
|
| BLAKE2b-256 |
a1a3cbef2f159681a6b18ff8e669a6efa7af14fc393483aad46715818b19725d
|
Provenance
The following attestation bundles were made for news_watch-0.3.0.tar.gz:
Publisher:
release.yml on okkymabruri/news-watch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
news_watch-0.3.0.tar.gz -
Subject digest:
f463417f1f453ab56dba84caccfc1f980cebe546568300fb195ab884b863892c - Sigstore transparency entry: 300764521
- Sigstore integration time:
-
Permalink:
okkymabruri/news-watch@79f986387b281e6638a5390fcba312417bf50e9b -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/okkymabruri
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@79f986387b281e6638a5390fcba312417bf50e9b -
Trigger Event:
release
-
Statement type:
File details
Details for the file news_watch-0.3.0-py3-none-any.whl.
File metadata
- Download URL: news_watch-0.3.0-py3-none-any.whl
- Upload date:
- Size: 43.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
444b50c1a0bdde279bdcfae5ae117b406050881b88a4bc11ceec783e151bc40f
|
|
| MD5 |
dc4dc580254f41643006e25c260a88d1
|
|
| BLAKE2b-256 |
e51db71e13850d0ba420fa7c57eb9c0829b9dda956557e4f4ce036a8f2fce7fb
|
Provenance
The following attestation bundles were made for news_watch-0.3.0-py3-none-any.whl:
Publisher:
release.yml on okkymabruri/news-watch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
news_watch-0.3.0-py3-none-any.whl -
Subject digest:
444b50c1a0bdde279bdcfae5ae117b406050881b88a4bc11ceec783e151bc40f - Sigstore transparency entry: 300764529
- Sigstore integration time:
-
Permalink:
okkymabruri/news-watch@79f986387b281e6638a5390fcba312417bf50e9b -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/okkymabruri
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@79f986387b281e6638a5390fcba312417bf50e9b -
Trigger Event:
release
-
Statement type: