NewsLookout

News scraping application

These details have not been verified by PyPI

Project links

Homepage

Project description

Overview
Installation
Quick Start
Library Usage
Architecture
Configuration
Plugin Development
API Reference
Troubleshooting

NewsLookout is a comprehensive, multi-threaded web scraping framework designed for extracting news articles and data from various online sources. It features a plugin-based architecture for extensibility and supports concurrent processing across multiple news sources.

Multi-threaded Architecture: Concurrent URL discovery, content fetching, and data processing
Plugin-Based Design: Easy to extend with custom scrapers for different news sources
Session Management: Tracks completed URLs to avoid duplicate processing
Data Processing Pipeline: Built-in support for deduplication, classification, and keyword extraction
Graceful Shutdown: Handles interrupts cleanly without data loss
Library Interface: Can be used as a Python library in your own applications
Configurable Timeouts: Prevents indefinite hangs with configurable timeout mechanisms

Timeout Management: URL gathering operations now have configurable timeouts (default: 10 minutes)
Dedicated Database Thread: All database operations handled by single thread to prevent lock conflicts
Improved Recursion: Iterative link extraction with strict depth limiting (max 4 levels)
Better Interrupt Handling: Graceful shutdown on Ctrl+C with proper cleanup
Queue-Based URL Streaming: URLs processed as discovered, not in batches
Library Interface: Can be imported and used programmatically

pip install newslookout

When installed via pip, NewsLookout stores all user-writable files outside the Python package directory so that package upgrades never overwrite your data or configuration.

Platform	Config file	Log / PID files	Data & archive
Linux	`~/.config/newslookout/newslookout.conf`	`~/.local/state/newslookout/`	`~/.local/share/newslookout/data/`
macOS	`~/Library/Preferences/newslookout/newslookout.conf`	`~/Library/Logs/newslookout/`	`~/Library/Application Support/newslookout/data/`
Windows	`APPDATA\newslookout\newslookout.conf`	`APPDATA\newslookout\`	`APPDATA\newslookout\data\`

Tip: You can override any path in the config file. Set the data_dir, log_file, and archive_base_path keys under [environment] to any absolute path you prefer.

The first time you run newslookout without specifying a config file it will:

Create the default configuration at the platform-appropriate path shown above.
Print the path and exit so you can review it before scraping begins.

newslookout          # first run: creates config and exits
newslookout -d 2024-03-22

You can also point to a custom config explicitly:

newslookout -c /path/to/my.conf -d 2024-03-22

git clone https://github.com/sandeep-sandhu/newslookout.git
cd newslookout
pip install -e .

NewsLookout requires Python 3.8+ and will install the following dependencies:

beautifulsoup4 - HTML parsing
newspaper3k - Article extraction
nltk - Natural language processing
requests - HTTP requests
pandas - Data manipulation
enlighten - Progress bars
spacy - Advanced NLP (optional, for deduplication)
torch - Deep learning (optional, for classification)

After installation, download the required NLP model data:

python -m spacy download en_core_web_lg

python - <<'EOF'
import nltk
for pkg in ['punkt', 'punkt_tab', 'maxent_treebank_pos_tagger',
'reuters', 'universal_treebanks_v20']:
nltk.download(pkg)
EOF

If NLTK data is stored in a non-standard location, set the NLTK_DATA environment variable to its path. See https://www.nltk.org/data.html for details.

Alternatively, you could manually download these from the source location - https://github.com/nltk/nltk_data/tree/gh-pages/packages

For NLTK, refer to the NLTK website on downloading the data - https://www.nltk.org/data.html. Specifically, the following data needs to be downloaded:

reuters
universal_treebanks_v20
maxent_treebank_pos_tagger
punkt

newslookout -c config.conf -d 2025-12-21

newslookout -c config.conf -d 2025-12-21 --log-level DEBUG

from newslookout import NewsLookoutApp

app = NewsLookoutApp(config_file='config.conf')
stats = app.run(run_date='2025-12-21', max_runtime=3600)

print(f"Processed {stats['urls_processed']} URLs in {stats['duration']:.1f} seconds")

from newslookout import NewsLookoutApp

with NewsLookoutApp('config.conf') as app:
app.start()  # Run in background
app.stop()

from newslookout import scrape

stats = scrape('config.conf', run_date='2025-12-21', max_runtime=3600)

from newslookout import NewsLookoutApp

app = NewsLookoutApp(config_file='path/to/config.conf')

stats = app.run(run_date='2025-12-21')

print(f"URLs discovered: {stats['urls_discovered']}")
print(f"URLs processed: {stats['urls_processed']}")
print(f"Data processed: {stats['data_processed']}")
print(f"Duration: {stats['duration']:.1f} seconds")

from newslookout import NewsLookoutApp
import time

app = NewsLookoutApp('config.conf')

app.start()

while app.is_running:
stats = app.get_statistics()
print(f"Progress: {stats['urls_processed']} URLs processed")
time.sleep(10)

app.wait_for_completion()

final_stats = app.get_statistics()

from newslookout import NewsLookoutApp

app = NewsLookoutApp('config.conf')

stats = app.run(max_runtime=3600)

if app.is_running:
print("Timeout reached, stopping...")
app.stop()

app = NewsLookoutApp('config.conf')
app.start()

plugin_status = app.get_plugin_status()
for plugin_name, state in plugin_status.items():
print(f"{plugin_name}: {state}")

The application status is also visible from the monitoring dashboard which uses the REST API to publish the status of scraping activity and progress. It is accessible at http://localhost:8080/dashboard.html

Monitoring Dashboard

┌─────────────────────────────────────────────────────┐
│                  NewsLookoutApp                      │
│              (Library Interface)                     │
└───────────────────┬─────────────────────────────────┘
│
┌───────────────────▼─────────────────────────────────┐
│                 QueueManager                         │
│          (Orchestrates all workers)                  │
└─────┬────────────┬────────────┬────────────┬────────┘
│            │            │            │
▼            ▼            ▼            ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│   URL    │ │ Content  │ │   Data   │ │ Progress │
│Discovery │ │ Fetching │ │Processing│ │ Watcher  │
│ Workers  │ │ Workers  │ │ Workers  │ │          │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│            │            │            │
└────────────┴────────────┴────────────┘
│
┌───────▼────────┐
│   Database     │
│     Worker     │
│  (Dedicated)   │
└────────────────┘

URL Discovery Workers: One per plugin, discovers URLs to scrape
Content Fetch Workers: Multiple workers that download and parse content
Data Processing Workers: Process scraped data through plugins
Database Worker: Single thread handling all database operations
Progress Watcher: Monitors progress and updates UI

URL Discovery Queue: New URLs streamed here as discovered
Fetch Queue: URLs pending content download
Processing Queue: Downloaded content pending processing
Database Queue: Database operations to be executed
Completed Queue: Finished items

[installation]
prefix = /opt/newslookout
data_dir = /var/cache/newslookout_data
plugins_dir = /opt/newslookout/plugins
log_file = /var/log/newslookout/app.log
pid_file = /tmp/newslookout.pid

[operation]
url_gathering_timeout = 600

recursion_level = 2

user_agent = Mozilla/5.0 ...
fetch_timeout = 60
connect_timeout = 3
retry_count = 3

proxy_url_http = http://proxy.example.com:8080
proxy_url_https = https://proxy.example.com:8080

[logging]
log_level = INFO
max_logfile_size = 10485760
logfile_backup_count = 30

[plugins]
plugin1 = mod_en_in_ecotimes|10
plugin2 = mod_en_in_timesofindia|20
plugin3 = mod_dedupe|100

url_gathering_timeout: Maximum seconds for URL discovery (default: 600)
recursion_level: Depth of link extraction (1-4, default: 2)
fetch_timeout: Timeout for downloading content (seconds)
connect_timeout: Timeout for establishing connection (seconds)
retry_count: Number of retry attempts
user_agent: User agent string for requests
completed_urls_datafile: SQLite database for session history
log_level: DEBUG, INFO, WARNING, ERROR
max_logfile_size: Maximum log file size before rotation
logfile_backup_count: Number of rotated logs to keep

from base_plugin import BasePlugin
from data_structs import PluginTypes

class mod_my_news_site(BasePlugin):
"""
Plugin for scraping MyNewsSite.com
"""

pluginType = PluginTypes.MODULE_NEWS_CONTENT
mainURL = 'https://www.mynewssite.com'
allowedDomains = ['www.mynewssite.com']

validURLStringsToCheck = ['mynewssite.com/article/']
invalidURLSubStrings = ['mynewssite.com/ads/', '/video/']

def __init__(self):
super().__init__()

def getURLsListForDate(self, runDate, sessionHistoryDB):
"""Discover URLs for given date."""
urls = []
return urls

def extractArticleBody(self, htmlContent):
"""Extract article text from HTML."""
return text

def extractUniqueIDFromURL(self, url):
"""Extract unique identifier from URL."""
return unique_id

from base_plugin import BasePlugin
from data_structs import PluginTypes

class mod_my_processor(BasePlugin):
"""
Plugin for processing scraped data
"""

pluginType = PluginTypes.MODULE_DATA_PROCESSOR

def __init__(self):
super().__init__()

def additionalConfig(self, sessionHistoryObj):
"""Additional configuration."""
pass

def processDataObj(self, newsEventObj):
"""Process a news event object."""
newsEventObj.setText(processed_text)

filename = newsEventObj.getFileName().replace('.json', '')
newsEventObj.writeFiles(filename, '', saveHTMLFile=False)

MODULE_NEWS_CONTENT: Scrapes news articles
MODULE_NEWS_AGGREGATOR: Aggregates URLs from multiple sources
MODULE_DATA_CONTENT: Scrapes structured data
MODULE_DATA_PROCESSOR: Post-processes scraped data

NewsLookoutApp(config_file: str, run_date: Optional[str] = None)

Parameters:

config_file (str): Path to configuration file
run_date (str, optional): Date in 'YYYY-MM-DD' format

Raises:

FileNotFoundError: If config file doesn't exist
ValueError: If configuration is invalid

run(run_date: Optional[str] = None,
max_runtime: Optional[int] = None,
blocking: bool = True) -> Dict[str, Any]

Run the scraping process.

Parameters:

run_date (str, optional): Override run date
max_runtime (int, optional): Maximum runtime in seconds
blocking (bool): If True, wait for completion

Returns:

dict: Statistics dictionary

start()

Start application in background mode.

stop(timeout: int = 30)

Stop the running application gracefully.

Parameters:

timeout (int): Maximum seconds to wait for shutdown

get_statistics() -> Dict[str, Any]

Get current or last run statistics.

Returns:

dict: Statistics including:
urls_discovered: Total URLs found
urls_processed: URLs successfully scraped
data_processed: Items processed
start_time: Execution start time
end_time: Execution end time
duration: Runtime in seconds
is_running: Current status

get_plugin_status() -> Dict[str, str]

Get status of all loaded plugins.

Returns:

dict: Map of plugin names to states

wait_for_completion(timeout: Optional[int] = None) -> bool

Wait for background execution to complete.

Parameters:

timeout (int, optional): Maximum seconds to wait

Returns:

bool: True if completed, False if timeout

scrape(config_file: str,
run_date: Optional[str] = None,
max_runtime: Optional[int] = None) -> Dict[str, Any]

Convenience function to run a scraping job.

Symptom: Application hangs during URL discovery

Solution: Increase url_gathering_timeout in configuration:

[operation]
url_gathering_timeout = 1200  # 20 minutes

Symptom: database is locked errors in logs

Solution: All database operations now go through dedicated thread. If issue persists:

Check no other process is accessing the database
Remove -journal files if present
Increase timeout in session_hist.py

Symptom: Ctrl+C doesn't stop the application

Solution: Updated code includes periodic shutdown checks. Ensure:

Using latest version
Not stuck in long-running external call
Check network timeouts are reasonable

Symptom: Memory exhaustion from excessive URLs

Solution:

Reduce recursion_level in configuration
Improve URL filtering in plugins
Use more restrictive validURLStringsToCheck

Symptom: Specific plugin never completes

Solution:

Check plugin's is_stopped flag periodically
Ensure network operations have timeouts
Review getURLsListForDate() implementation

Enable detailed logging:

[logging]
log_level = DEBUG

Or programmatically:

import logging
logging.getLogger('').setLevel(logging.DEBUG)

[operation]
fetch_timeout = 30  # Reduce if sites are fast
retry_count = 2     # Reduce retries

Modify in code:

queue_manager.dataproc_threads = 10  # Increase for more parallelism

[operation]
recursion_level = 1  # Minimum recursion

Use separate configs for different environments
Version control your configuration files
Document custom settings
Always check self.is_stopped in loops
Use timeouts for all network operations
Handle exceptions gracefully
Log progress at regular intervals
Monitor disk space for data directory
Rotate logs regularly
Clean up old session data periodically
Use systemd or supervisor for service management
Set up log rotation
Monitor application health
Configure appropriate timeouts
Use separate database for each instance
Review logs after each run
Set up alerts for critical errors
Test plugins with edge cases
Handle malformed HTML gracefully

Log message	Cause	Fix
`can't compare offset-naive and offset-aware datetimes`	The news site returns a timezone-aware publication date	Apply Patch 2 to `base_plugin.py`
`'NoneType' object has no attribute 'getURL'` in `mod_keywordflags`	The JSON article file for a previously scraped URL no longer exists on disk	Apply Patch 4 to `worker.py`; also verify your `data_dir` path in the config
`Invalid article_id: None` / `Falling back to legacy file storage`	The URL did not match any `urlMatchPatterns` in the plugin	Apply Patch 3 to `base_plugin.py`
`Error fetching status: TypeError: can't access property "textContent" … is null`	Dashboard JS runs before DOM is ready	Apply Patch 5 to `dashboard.html`
`Request for font "Ubuntu Sans" blocked at visibility level 2`	Browser privacy policy blocks Google Fonts	Apply Patch 5a to `dashboard.html`
Installed package appears under `src/newslookout` instead of `newslookout`	Missing `src`-layout config in `setup.cfg` / `pyproject.toml`	Apply Patches 9 and 10

Documentation: https://github.com/sandeep-sandhu/newslookout
Issues: Report bugs on GitHub Issues
Contributing: Pull requests welcome

This software is provided "AS IS" without warranty. See LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

3.0.0

Mar 23, 2026

2.9.9

Mar 23, 2026

2.1.0

Oct 8, 2022

2.0.0

Jul 17, 2021

1.9.9

Jun 25, 2021

1.9.1

Jun 13, 2021

1.9.0

Jun 12, 2021

1.8.5

Jun 7, 2021

1.8.0

May 16, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newslookout-3.0.0.tar.gz (178.4 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

newslookout-3.0.0-py3-none-any.whl (197.4 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file newslookout-3.0.0.tar.gz.

File metadata

Download URL: newslookout-3.0.0.tar.gz
Upload date: Mar 23, 2026
Size: 178.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for newslookout-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`78c67c6cfb52f69bea67a0bfe8a002a8ff44383929522339c08355f49d47b7c1`
MD5	`5eaf17fd67cb9664e828232f01455c1d`
BLAKE2b-256	`49f54dfe12c96c2efc9b0f3772e0d433f501bc19d67a064ea5314de76d3cc392`

See more details on using hashes here.

File details

Details for the file newslookout-3.0.0-py3-none-any.whl.

File metadata

Download URL: newslookout-3.0.0-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 197.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for newslookout-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed6a9e704d47480b92a60ee86db465a967a47376bb10497966fc8715ec124067`
MD5	`231b4b20e54d55d8bc0038312f13b1da`
BLAKE2b-256	`9908acdb60acfb9f1a1abab0940d9ceb4d424dc608aad96040008a55df9cd7e8`

See more details on using hashes here.

NewsLookout 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes