Skip to main content

Google News Scraper CLI for OSINT and research

Project description

NewsCrap

NewsCrap adalah alat scraping berita Google berbasis Command Line Interface (CLI) yang dirancang untuk riset, investigasi, dan pengumpulan data OSINT. Dengan fitur canggih seperti rotation proxy, scheduling otomatis, dan multi-format export, alat ini memudahkan pengumpulan data berita secara efisien dan andal.

CLI Python OSINT

Fitur Utama

  • Multi-keyword Support - Cari dengan beberapa keyword sekaligus
  • Pagination Scraping - Ambil artikel dari banyak halaman hasil
  • Multi-format Export - CSV, JSON, SQLite database
  • Proxy & User-Agent Rotation - Hindari deteksi dan blokir
  • Scheduler Mode - Jalankan otomatis sesuai interval waktu
  • Deduplication & Filtering - Hindari duplikat dan filter by domain
  • Report Generation - Export laporan Markdown/HTML
  • Verbose Logging - Pantau proses scraping secara detail
  • Error Handling - Tetap berjalan meski ada error

Instalasi

  1. Clone Repository
git clone https://github.com/opsysdebug/NewsCrap.git
cd NewsCrap
  1. Install Dependencies
pip install -r requirements.txt
# Scrape berita dengan keyword tunggal
python news_scrap.py "artificial intelligence"

# Multiple keywords
python news_scrap.py "AI" "machine learning" "deep learning"

# Dengan batasan jumlah artikel
python news_scrap.py "cybersecurity" --max-articles 50
# Export ke JSON dengan filter domain
python news_scrap.py "technology" --output-format json --domain-filter bbc.com

# Dengan proxy rotation dan user agent
python news_scrap.py "news" --proxy-file proxies.txt --user-agent-file user_agents.txt

# Mode terjadwal (setiap 2 jam)
python news_scrap.py "cryptocurrency" --schedule 2h --verbose

# Generate laporan HTML
python news_scrap.py "politics" --report-format html --max-articles 30
usage: news_scrap.py [-h] [--max-articles MAX_ARTICLES] [--output-format {csv,json,sqlite,all}]
                    [--output-dir OUTPUT_DIR] [--report-format {markdown,html,both}]
                    [--proxy-file PROXY_FILE] [--user-agent-file USER_AGENT_FILE]
                    [--domain-filter DOMAIN_FILTER] [--schedule SCHEDULE] [--verbose]
                    keywords [keywords ...]

Google News Scraper CLI

positional arguments:
  keywords              Keywords to search for

optional arguments:
  -h, --help            show this help message and exit
  --max-articles MAX_ARTICLES
                        Maximum articles per keyword (default: 10)
  --output-format {csv,json,sqlite,all}
                        Output format (default: csv)
  --output-dir OUTPUT_DIR
                        Output directory (default: output)
  --report-format {markdown,html,both}
                        Generate report in specified format
  --proxy-file PROXY_FILE
                        File containing list of proxies (one per line)
  --user-agent-file USER_AGENT_FILE
                        File containing list of user agents (one per line)
  --domain-filter DOMAIN_FILTER
                        Filter results by domain (e.g., bbc.com)
  --schedule SCHEDULE   Run on schedule (e.g., "1h" for hourly, "30m" for every 30 minutes)
  --verbose, -v         Verbose output

File Konfigurasi

proxies.txt

http://proxy1.example.com:8080
http://proxy2.example.com:3128
socks5://proxy3.example.com:1080

user_agents.txt

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36

Penggunaan CLI

Untuk Riset Akademik

python news_scrap.py "climate change" "global warming" --max-articles 100 --output-format json

Untuk Investigasi OSINT

python news_scrap.py "company name" --domain-filter reuters.com --proxy-file proxies.txt --schedule 6h

Untuk Monitoring Berita

python news_scrap.py "breaking news" --schedule 30m --report-format both --verbose

Penting

  • Gunakan tool ini secara bertanggung jawab dan patuhi Terms of Service Google
  • Respect robots.txt dan rate limiting
  • Disarankan menggunakan proxy untuk menghindari IP blocking
  • Tool ini untuk tujuan edukasi dan research yang legal

Troubleshooting

Error: 429 Too Many Requests (gunakan proxy jangan lupa)

python news_scrap.py "keyword" --proxy-file proxies.txt

Error: Connection Issues

# Pastikan koneksi internet stabil
# Coba dengan user agent berbeda

⭐ Support

Your sponsor akan sangat berharga bagi saya dalam research project maupun aktifitas saya di OSS lebih lanjut. terimakasih banyak.

Disclaimer: Tool ini dibuat untuk tujuan edukasi dan research. Pengguna bertanggung jawab penuh atas penggunaan tool ini sesuai dengan hukum yang berlaku dan kebijakan website yang di-scrape.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newscrap-0.1.0.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

newscrap-0.1.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file newscrap-0.1.0.tar.gz.

File metadata

  • Download URL: newscrap-0.1.0.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for newscrap-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4f68d874eace0f7cf0e4b0344999beeca8b53c99533d34f2b22416aec3b67e1f
MD5 2d38864dc98222ea1065f5f97a0260d0
BLAKE2b-256 fe8448e4e18ebe808625e241910f5fa1f5d882e7fb51ccec4036bd96c13423e1

See more details on using hashes here.

File details

Details for the file newscrap-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: newscrap-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for newscrap-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d13bf1275dc7678d5609b73edacc83e31be17855450189c15c378d5e4e095367
MD5 f42fab055321a911dd937e69cd7e881f
BLAKE2b-256 9726a4cc21bc12f5af1bafda2a1190019bfdc35b641e1128cbf6307f06ebece2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page