Skip to main content

A scraper for Indonesian news websites.

Project description

news-watch

news-watch is a Python package that allows you to scrape news articles from various Indonesian news websites based on specific keywords and date ranges.

Installation

You can install newswatch via pip:

pip install news-watch

Usage

To run the scraper from the command line:

newswatch -k <keywords> -sd <start_date> -s [<scrapers>] [-v]

Command-Line Arguments

--keywords, -k: Required. A comma-separated list of keywords to scrape (e.g., -k "ojk,bank,npl").

--start_date, -sd: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2023-01-01).

--scrapers, -s: Optional. A comma-separated list of scrapers to use (e.g., -s "kompas,viva"). If not provided, all scrapers will be used by default.

--verbose, -v: Optional. Increase verbosity level (e.g., -v, -vv, -vvv).

Examples

Scrape articles related to "ihsg" from October 28, 2024:

newswatch -k ihsg -sd 2024-10-28

Scrape articles for multiple keywords and increase verbosity:

newswatch -k "ihsg,bank,keuangan" -sd 2024-10-28 -vv

Output

The scraped articles are saved as a CSV file in the current working directory with the format news-watch-YYYYMMDD_HH.csv.

The CSV file contains the following fields:

  • title
  • publish_date
  • author
  • content
  • keyword
  • category
  • source
  • link

Supported Websites

  • Bisnis Indonesia

  • CNBC Indonesia

  • Detik

  • Kompas

  • Kontan

    Note: Running this on the cloud currently leads to errors due to Cloudflare restrictions.

    Limitation: The scraper can process a maximum of 50 pages.

  • Viva

Contributing

Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.

Running Tests

To run the test suite:

pytest tests/

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

news_watch-0.1.5.tar.gz (23.0 kB view details)

Uploaded Source

Built Distribution

news_watch-0.1.5-py3-none-any.whl (27.3 kB view details)

Uploaded Python 3

File details

Details for the file news_watch-0.1.5.tar.gz.

File metadata

  • Download URL: news_watch-0.1.5.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for news_watch-0.1.5.tar.gz
Algorithm Hash digest
SHA256 fce8456f4bc50fc6ca38adc89cfe19f2a80451a234c6b24b6a9daaca5609997a
MD5 2b9f3c514f5c4fbc54caaecd081dd622
BLAKE2b-256 c4e8ca7c54b7c82ff360ff8c731d44b73f5b1624baf9c9104c6cf68121e0c6cd

See more details on using hashes here.

File details

Details for the file news_watch-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: news_watch-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 27.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for news_watch-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9969cd04ea7e4e2263b363384a52eda1eb3996d023b78c49b4ad67ca16cb1f62
MD5 942a503a4a1ccb21f960aab013cdf052
BLAKE2b-256 485892e3696146fe28b3c4a156fd01133ea8f171d270d2a6595116194db09009

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page