Skip to main content

News scraping application

Project description

  1. Overview
  2. Installation
  3. Quick Start
  4. Library Usage
  5. Architecture
  6. Configuration
  7. Plugin Development
  8. API Reference
  9. Troubleshooting

NewsLookout is a comprehensive, multi-threaded web scraping framework designed for extracting news articles and data from various online sources. It features a plugin-based architecture for extensibility and supports concurrent processing across multiple news sources.

  • Multi-threaded Architecture: Concurrent URL discovery, content fetching, and data processing
  • Plugin-Based Design: Easy to extend with custom scrapers for different news sources
  • Session Management: Tracks completed URLs to avoid duplicate processing
  • Data Processing Pipeline: Built-in support for deduplication, classification, and keyword extraction
  • Graceful Shutdown: Handles interrupts cleanly without data loss
  • Library Interface: Can be used as a Python library in your own applications
  • Configurable Timeouts: Prevents indefinite hangs with configurable timeout mechanisms
  1. Timeout Management: URL gathering operations now have configurable timeouts (default: 10 minutes)
  2. Dedicated Database Thread: All database operations handled by single thread to prevent lock conflicts
  3. Improved Recursion: Iterative link extraction with strict depth limiting (max 4 levels)
  4. Better Interrupt Handling: Graceful shutdown on Ctrl+C with proper cleanup
  5. Queue-Based URL Streaming: URLs processed as discovered, not in batches
  6. Library Interface: Can be imported and used programmatically
pip install newslookout

When installed via pip, NewsLookout stores all user-writable files outside the Python package directory so that package upgrades never overwrite your data or configuration.

Platform Config file Log / PID files Data & archive
Linux ~/.config/newslookout/newslookout.conf ~/.local/state/newslookout/ ~/.local/share/newslookout/data/
macOS ~/Library/Preferences/newslookout/newslookout.conf ~/Library/Logs/newslookout/ ~/Library/Application Support/newslookout/data/
Windows APPDATA\newslookout\newslookout.conf APPDATA\newslookout\ APPDATA\newslookout\data\

Tip: You can override any path in the config file. Set the data_dir, log_file, and archive_base_path keys under [environment] to any absolute path you prefer.

The first time you run newslookout without specifying a config file it will:

  1. Create the default configuration at the platform-appropriate path shown above.
  2. Print the path and exit so you can review it before scraping begins.
newslookout          # first run: creates config and exits
newslookout -d 2024-03-22

You can also point to a custom config explicitly:

newslookout -c /path/to/my.conf -d 2024-03-22
git clone https://github.com/sandeep-sandhu/newslookout.git
cd newslookout
pip install -e .

NewsLookout requires Python 3.8+ and will install the following dependencies:

  • beautifulsoup4 - HTML parsing
  • newspaper3k - Article extraction
  • nltk - Natural language processing
  • requests - HTTP requests
  • pandas - Data manipulation
  • enlighten - Progress bars
  • spacy - Advanced NLP (optional, for deduplication)
  • torch - Deep learning (optional, for classification)

After installation, download the required NLP model data:

python -m spacy download en_core_web_lg

python - <<'EOF'
import nltk
for pkg in ['punkt', 'punkt_tab', 'maxent_treebank_pos_tagger',
'reuters', 'universal_treebanks_v20']:
nltk.download(pkg)
EOF

If NLTK data is stored in a non-standard location, set the NLTK_DATA environment variable to its path. See https://www.nltk.org/data.html for details.

Alternatively, you could manually download these from the source location - https://github.com/nltk/nltk_data/tree/gh-pages/packages

For NLTK, refer to the NLTK website on downloading the data - https://www.nltk.org/data.html. Specifically, the following data needs to be downloaded:

  1. reuters
  2. universal_treebanks_v20
  3. maxent_treebank_pos_tagger
  4. punkt
newslookout -c config.conf -d 2025-12-21

newslookout -c config.conf -d 2025-12-21 --log-level DEBUG
from newslookout import NewsLookoutApp

app = NewsLookoutApp(config_file='config.conf')
stats = app.run(run_date='2025-12-21', max_runtime=3600)

print(f"Processed {stats['urls_processed']} URLs in {stats['duration']:.1f} seconds")
from newslookout import NewsLookoutApp

with NewsLookoutApp('config.conf') as app:
app.start()  # Run in background
app.stop()
from newslookout import scrape

stats = scrape('config.conf', run_date='2025-12-21', max_runtime=3600)
from newslookout import NewsLookoutApp

app = NewsLookoutApp(config_file='path/to/config.conf')

stats = app.run(run_date='2025-12-21')

print(f"URLs discovered: {stats['urls_discovered']}")
print(f"URLs processed: {stats['urls_processed']}")
print(f"Data processed: {stats['data_processed']}")
print(f"Duration: {stats['duration']:.1f} seconds")
from newslookout import NewsLookoutApp
import time

app = NewsLookoutApp('config.conf')

app.start()

while app.is_running:
stats = app.get_statistics()
print(f"Progress: {stats['urls_processed']} URLs processed")
time.sleep(10)

app.wait_for_completion()

final_stats = app.get_statistics()
from newslookout import NewsLookoutApp

app = NewsLookoutApp('config.conf')

stats = app.run(max_runtime=3600)

if app.is_running:
print("Timeout reached, stopping...")
app.stop()
app = NewsLookoutApp('config.conf')
app.start()

plugin_status = app.get_plugin_status()
for plugin_name, state in plugin_status.items():
print(f"{plugin_name}: {state}")

The application status is also visible from the monitoring dashboard which uses the REST API to publish the status of scraping activity and progress. It is accessible at http://localhost:8080/dashboard.html

Monitoring Dashboard

┌─────────────────────────────────────────────────────┐
│                  NewsLookoutApp                      │
│              (Library Interface)                     │
└───────────────────┬─────────────────────────────────┘
│
┌───────────────────▼─────────────────────────────────┐
│                 QueueManager                         │
│          (Orchestrates all workers)                  │
└─────┬────────────┬────────────┬────────────┬────────┘
│            │            │            │
▼            ▼            ▼            ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│   URL    │ │ Content  │ │   Data   │ │ Progress │
│Discovery │ │ Fetching │ │Processing│ │ Watcher  │
│ Workers  │ │ Workers  │ │ Workers  │ │          │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│            │            │            │
└────────────┴────────────┴────────────┘
│
┌───────▼────────┐
│   Database     │
│     Worker     │
│  (Dedicated)   │
└────────────────┘
  1. URL Discovery Workers: One per plugin, discovers URLs to scrape
  2. Content Fetch Workers: Multiple workers that download and parse content
  3. Data Processing Workers: Process scraped data through plugins
  4. Database Worker: Single thread handling all database operations
  5. Progress Watcher: Monitors progress and updates UI
  • URL Discovery Queue: New URLs streamed here as discovered
  • Fetch Queue: URLs pending content download
  • Processing Queue: Downloaded content pending processing
  • Database Queue: Database operations to be executed
  • Completed Queue: Finished items
[installation]
prefix = /opt/newslookout
data_dir = /var/cache/newslookout_data
plugins_dir = /opt/newslookout/plugins
log_file = /var/log/newslookout/app.log
pid_file = /tmp/newslookout.pid

[operation]
url_gathering_timeout = 600

recursion_level = 2

user_agent = Mozilla/5.0 ...
fetch_timeout = 60
connect_timeout = 3
retry_count = 3

proxy_url_http = http://proxy.example.com:8080
proxy_url_https = https://proxy.example.com:8080

[logging]
log_level = INFO
max_logfile_size = 10485760
logfile_backup_count = 30

[plugins]
plugin1 = mod_en_in_ecotimes|10
plugin2 = mod_en_in_timesofindia|20
plugin3 = mod_dedupe|100
  • url_gathering_timeout: Maximum seconds for URL discovery (default: 600)

  • recursion_level: Depth of link extraction (1-4, default: 2)

  • fetch_timeout: Timeout for downloading content (seconds)

  • connect_timeout: Timeout for establishing connection (seconds)

  • retry_count: Number of retry attempts

  • user_agent: User agent string for requests

  • completed_urls_datafile: SQLite database for session history

  • log_level: DEBUG, INFO, WARNING, ERROR

  • max_logfile_size: Maximum log file size before rotation

  • logfile_backup_count: Number of rotated logs to keep

from base_plugin import BasePlugin
from data_structs import PluginTypes

class mod_my_news_site(BasePlugin):
"""
Plugin for scraping MyNewsSite.com
"""

pluginType = PluginTypes.MODULE_NEWS_CONTENT
mainURL = 'https://www.mynewssite.com'
allowedDomains = ['www.mynewssite.com']

validURLStringsToCheck = ['mynewssite.com/article/']
invalidURLSubStrings = ['mynewssite.com/ads/', '/video/']

def __init__(self):
super().__init__()

def getURLsListForDate(self, runDate, sessionHistoryDB):
"""Discover URLs for given date."""
urls = []
return urls

def extractArticleBody(self, htmlContent):
"""Extract article text from HTML."""
return text

def extractUniqueIDFromURL(self, url):
"""Extract unique identifier from URL."""
return unique_id
from base_plugin import BasePlugin
from data_structs import PluginTypes

class mod_my_processor(BasePlugin):
"""
Plugin for processing scraped data
"""

pluginType = PluginTypes.MODULE_DATA_PROCESSOR

def __init__(self):
super().__init__()

def additionalConfig(self, sessionHistoryObj):
"""Additional configuration."""
pass

def processDataObj(self, newsEventObj):
"""Process a news event object."""
newsEventObj.setText(processed_text)

filename = newsEventObj.getFileName().replace('.json', '')
newsEventObj.writeFiles(filename, '', saveHTMLFile=False)
  • MODULE_NEWS_CONTENT: Scrapes news articles
  • MODULE_NEWS_AGGREGATOR: Aggregates URLs from multiple sources
  • MODULE_DATA_CONTENT: Scrapes structured data
  • MODULE_DATA_PROCESSOR: Post-processes scraped data
NewsLookoutApp(config_file: str, run_date: Optional[str] = None)

Parameters:

  • config_file (str): Path to configuration file
  • run_date (str, optional): Date in 'YYYY-MM-DD' format

Raises:

  • FileNotFoundError: If config file doesn't exist
  • ValueError: If configuration is invalid
run(run_date: Optional[str] = None,
max_runtime: Optional[int] = None,
blocking: bool = True) -> Dict[str, Any]

Run the scraping process.

Parameters:

  • run_date (str, optional): Override run date
  • max_runtime (int, optional): Maximum runtime in seconds
  • blocking (bool): If True, wait for completion

Returns:

  • dict: Statistics dictionary
start()

Start application in background mode.

stop(timeout: int = 30)

Stop the running application gracefully.

Parameters:

  • timeout (int): Maximum seconds to wait for shutdown
get_statistics() -> Dict[str, Any]

Get current or last run statistics.

Returns:

  • dict: Statistics including:
  • urls_discovered: Total URLs found
  • urls_processed: URLs successfully scraped
  • data_processed: Items processed
  • start_time: Execution start time
  • end_time: Execution end time
  • duration: Runtime in seconds
  • is_running: Current status
get_plugin_status() -> Dict[str, str]

Get status of all loaded plugins.

Returns:

  • dict: Map of plugin names to states
wait_for_completion(timeout: Optional[int] = None) -> bool

Wait for background execution to complete.

Parameters:

  • timeout (int, optional): Maximum seconds to wait

Returns:

  • bool: True if completed, False if timeout
scrape(config_file: str,
run_date: Optional[str] = None,
max_runtime: Optional[int] = None) -> Dict[str, Any]

Convenience function to run a scraping job.

Symptom: Application hangs during URL discovery

Solution: Increase url_gathering_timeout in configuration:

[operation]
url_gathering_timeout = 1200  # 20 minutes

Symptom: database is locked errors in logs

Solution: All database operations now go through dedicated thread. If issue persists:

  • Check no other process is accessing the database
  • Remove -journal files if present
  • Increase timeout in session_hist.py

Symptom: Ctrl+C doesn't stop the application

Solution: Updated code includes periodic shutdown checks. Ensure:

  • Using latest version
  • Not stuck in long-running external call
  • Check network timeouts are reasonable

Symptom: Memory exhaustion from excessive URLs

Solution:

  • Reduce recursion_level in configuration
  • Improve URL filtering in plugins
  • Use more restrictive validURLStringsToCheck

Symptom: Specific plugin never completes

Solution:

  • Check plugin's is_stopped flag periodically
  • Ensure network operations have timeouts
  • Review getURLsListForDate() implementation

Enable detailed logging:

[logging]
log_level = DEBUG

Or programmatically:

import logging
logging.getLogger('').setLevel(logging.DEBUG)
[operation]
fetch_timeout = 30  # Reduce if sites are fast
retry_count = 2     # Reduce retries

Modify in code:

queue_manager.dataproc_threads = 10  # Increase for more parallelism
[operation]
recursion_level = 1  # Minimum recursion
  • Use separate configs for different environments

  • Version control your configuration files

  • Document custom settings

  • Always check self.is_stopped in loops

  • Use timeouts for all network operations

  • Handle exceptions gracefully

  • Log progress at regular intervals

  • Monitor disk space for data directory

  • Rotate logs regularly

  • Clean up old session data periodically

  • Use systemd or supervisor for service management

  • Set up log rotation

  • Monitor application health

  • Configure appropriate timeouts

  • Use separate database for each instance

  • Review logs after each run

  • Set up alerts for critical errors

  • Test plugins with edge cases

  • Handle malformed HTML gracefully

Log message Cause Fix
can't compare offset-naive and offset-aware datetimes The news site returns a timezone-aware publication date Apply Patch 2 to base_plugin.py
'NoneType' object has no attribute 'getURL' in mod_keywordflags The JSON article file for a previously scraped URL no longer exists on disk Apply Patch 4 to worker.py; also verify your data_dir path in the config
Invalid article_id: None / Falling back to legacy file storage The URL did not match any urlMatchPatterns in the plugin Apply Patch 3 to base_plugin.py
Error fetching status: TypeError: can't access property "textContent" … is null Dashboard JS runs before DOM is ready Apply Patch 5 to dashboard.html
Request for font "Ubuntu Sans" blocked at visibility level 2 Browser privacy policy blocks Google Fonts Apply Patch 5a to dashboard.html
Installed package appears under src/newslookout instead of newslookout Missing src-layout config in setup.cfg / pyproject.toml Apply Patches 9 and 10

This software is provided "AS IS" without warranty. See LICENSE file for details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newslookout-3.0.0.tar.gz (178.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

newslookout-3.0.0-py3-none-any.whl (197.4 kB view details)

Uploaded Python 3

File details

Details for the file newslookout-3.0.0.tar.gz.

File metadata

  • Download URL: newslookout-3.0.0.tar.gz
  • Upload date:
  • Size: 178.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for newslookout-3.0.0.tar.gz
Algorithm Hash digest
SHA256 78c67c6cfb52f69bea67a0bfe8a002a8ff44383929522339c08355f49d47b7c1
MD5 5eaf17fd67cb9664e828232f01455c1d
BLAKE2b-256 49f54dfe12c96c2efc9b0f3772e0d433f501bc19d67a064ea5314de76d3cc392

See more details on using hashes here.

File details

Details for the file newslookout-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: newslookout-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 197.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for newslookout-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed6a9e704d47480b92a60ee86db465a967a47376bb10497966fc8715ec124067
MD5 231b4b20e54d55d8bc0038312f13b1da
BLAKE2b-256 9908acdb60acfb9f1a1abab0940d9ceb4d424dc608aad96040008a55df9cd7e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page