A scraper for Indonesian news websites.
Project description
news-watch
news-watch is a Python package that allows you to scrape news articles from various Indonesian news websites based on specific keywords and date ranges.
Installation
You can install newswatch via pip:
pip install news-watch
Usage
To run the scraper from the command line:
newswatch -k <keywords> -sd <start_date> -s [<scrapers>] [-v]
Command-Line Arguments
--keywords
, -k
: Required. A comma-separated list of keywords to scrape (e.g., -k "ojk,bank,npl").
--start_date
, -sd
: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2023-01-01).
--scrapers
, -s
: Optional. A comma-separated list of scrapers to use (e.g., -s "kompas,viva"). If not provided, all scrapers will be used by default.
--verbose
, -v
: Optional. Increase verbosity level (e.g., -v
, -vv
, -vvv
).
Examples
Scrape articles related to "ihsg" from October 28, 2024:
newswatch -k ihsg -sd 2024-10-28
Scrape articles for multiple keywords and increase verbosity:
newswatch -k "ihsg,bank,keuangan" -sd 2024-10-28 -vv
Output
The scraped articles are saved as a CSV file in the current working directory with the format news-watch-YYYYMMDD_HH.csv
.
The CSV file contains the following fields:
title
publish_date
author
content
keyword
category
source
link
Supported Websites
-
Bisnis Indonesia
-
CNBC Indonesia
-
Detik
-
Kompas
-
Kontan
Note: Running this on the cloud currently leads to errors due to Cloudflare restrictions.
Limitation: The scraper can process a maximum of 50 pages.
-
Viva
Contributing
Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.
Running Tests
To run the test suite:
pytest tests/
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file news_watch-0.1.5.tar.gz
.
File metadata
- Download URL: news_watch-0.1.5.tar.gz
- Upload date:
- Size: 23.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fce8456f4bc50fc6ca38adc89cfe19f2a80451a234c6b24b6a9daaca5609997a |
|
MD5 | 2b9f3c514f5c4fbc54caaecd081dd622 |
|
BLAKE2b-256 | c4e8ca7c54b7c82ff360ff8c731d44b73f5b1624baf9c9104c6cf68121e0c6cd |
File details
Details for the file news_watch-0.1.5-py3-none-any.whl
.
File metadata
- Download URL: news_watch-0.1.5-py3-none-any.whl
- Upload date:
- Size: 27.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9969cd04ea7e4e2263b363384a52eda1eb3996d023b78c49b4ad67ca16cb1f62 |
|
MD5 | 942a503a4a1ccb21f960aab013cdf052 |
|
BLAKE2b-256 | 485892e3696146fe28b3c4a156fd01133ea8f171d270d2a6595116194db09009 |