News scraping application
Project description
- Overview
- Installation
- Quick Start
- Library Usage
- Architecture
- Configuration
- Plugin Development
- API Reference
- Troubleshooting
NewsLookout is a comprehensive, multi-threaded web scraping framework designed for extracting news articles and data from various online sources. It features a plugin-based architecture for extensibility and supports concurrent processing across multiple news sources.
- Multi-threaded Architecture: Concurrent URL discovery, content fetching, and data processing
- Plugin-Based Design: Easy to extend with custom scrapers for different news sources
- Session Management: Tracks completed URLs to avoid duplicate processing
- Data Processing Pipeline: Built-in support for deduplication, classification, and keyword extraction
- Graceful Shutdown: Handles interrupts cleanly without data loss
- Library Interface: Can be used as a Python library in your own applications
- Configurable Timeouts: Prevents indefinite hangs with configurable timeout mechanisms
- Timeout Management: URL gathering operations now have configurable timeouts (default: 10 minutes)
- Dedicated Database Thread: All database operations handled by single thread to prevent lock conflicts
- Improved Recursion: Iterative link extraction with strict depth limiting (max 4 levels)
- Better Interrupt Handling: Graceful shutdown on Ctrl+C with proper cleanup
- Queue-Based URL Streaming: URLs processed as discovered, not in batches
- Library Interface: Can be imported and used programmatically
pip install newslookout
When installed via pip, NewsLookout stores all user-writable files outside the Python package directory so that package upgrades never overwrite your data or configuration.
| Platform | Config file | Log / PID files | Data & archive |
|---|---|---|---|
| Linux | ~/.config/newslookout/newslookout.conf |
~/.local/state/newslookout/ |
~/.local/share/newslookout/data/ |
| macOS | ~/Library/Preferences/newslookout/newslookout.conf |
~/Library/Logs/newslookout/ |
~/Library/Application Support/newslookout/data/ |
| Windows | APPDATA\newslookout\newslookout.conf |
APPDATA\newslookout\ |
APPDATA\newslookout\data\ |
Tip: You can override any path in the config file. Set the
data_dir,log_file, andarchive_base_pathkeys under[environment]to any absolute path you prefer.
The first time you run newslookout without specifying a config file it will:
- Create the default configuration at the platform-appropriate path shown above.
- Print the path and exit so you can review it before scraping begins.
newslookout # first run: creates config and exits
newslookout -d 2024-03-22
You can also point to a custom config explicitly:
newslookout -c /path/to/my.conf -d 2024-03-22
git clone https://github.com/sandeep-sandhu/newslookout.git
cd newslookout
pip install -e .
NewsLookout requires Python 3.8+ and will install the following dependencies:
beautifulsoup4- HTML parsingnewspaper3k- Article extractionnltk- Natural language processingrequests- HTTP requestspandas- Data manipulationenlighten- Progress barsspacy- Advanced NLP (optional, for deduplication)torch- Deep learning (optional, for classification)
After installation, download the required NLP model data:
python -m spacy download en_core_web_lg
python - <<'EOF'
import nltk
for pkg in ['punkt', 'punkt_tab', 'maxent_treebank_pos_tagger',
'reuters', 'universal_treebanks_v20']:
nltk.download(pkg)
EOF
If NLTK data is stored in a non-standard location, set the NLTK_DATA environment variable
to its path. See https://www.nltk.org/data.html for details.
Alternatively, you could manually download these from the source location - https://github.com/nltk/nltk_data/tree/gh-pages/packages
For NLTK, refer to the NLTK website on downloading the data - https://www.nltk.org/data.html. Specifically, the following data needs to be downloaded:
- reuters
- universal_treebanks_v20
- maxent_treebank_pos_tagger
- punkt
newslookout -c config.conf -d 2025-12-21
newslookout -c config.conf -d 2025-12-21 --log-level DEBUG
from newslookout import NewsLookoutApp
app = NewsLookoutApp(config_file='config.conf')
stats = app.run(run_date='2025-12-21', max_runtime=3600)
print(f"Processed {stats['urls_processed']} URLs in {stats['duration']:.1f} seconds")
from newslookout import NewsLookoutApp
with NewsLookoutApp('config.conf') as app:
app.start() # Run in background
app.stop()
from newslookout import scrape
stats = scrape('config.conf', run_date='2025-12-21', max_runtime=3600)
from newslookout import NewsLookoutApp
app = NewsLookoutApp(config_file='path/to/config.conf')
stats = app.run(run_date='2025-12-21')
print(f"URLs discovered: {stats['urls_discovered']}")
print(f"URLs processed: {stats['urls_processed']}")
print(f"Data processed: {stats['data_processed']}")
print(f"Duration: {stats['duration']:.1f} seconds")
from newslookout import NewsLookoutApp
import time
app = NewsLookoutApp('config.conf')
app.start()
while app.is_running:
stats = app.get_statistics()
print(f"Progress: {stats['urls_processed']} URLs processed")
time.sleep(10)
app.wait_for_completion()
final_stats = app.get_statistics()
from newslookout import NewsLookoutApp
app = NewsLookoutApp('config.conf')
stats = app.run(max_runtime=3600)
if app.is_running:
print("Timeout reached, stopping...")
app.stop()
app = NewsLookoutApp('config.conf')
app.start()
plugin_status = app.get_plugin_status()
for plugin_name, state in plugin_status.items():
print(f"{plugin_name}: {state}")
The application status is also visible from the monitoring dashboard which uses the REST API to publish the status of scraping activity and progress. It is accessible at http://localhost:8080/dashboard.html
┌─────────────────────────────────────────────────────┐
│ NewsLookoutApp │
│ (Library Interface) │
└───────────────────┬─────────────────────────────────┘
│
┌───────────────────▼─────────────────────────────────┐
│ QueueManager │
│ (Orchestrates all workers) │
└─────┬────────────┬────────────┬────────────┬────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ URL │ │ Content │ │ Data │ │ Progress │
│Discovery │ │ Fetching │ │Processing│ │ Watcher │
│ Workers │ │ Workers │ │ Workers │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
└────────────┴────────────┴────────────┘
│
┌───────▼────────┐
│ Database │
│ Worker │
│ (Dedicated) │
└────────────────┘
- URL Discovery Workers: One per plugin, discovers URLs to scrape
- Content Fetch Workers: Multiple workers that download and parse content
- Data Processing Workers: Process scraped data through plugins
- Database Worker: Single thread handling all database operations
- Progress Watcher: Monitors progress and updates UI
- URL Discovery Queue: New URLs streamed here as discovered
- Fetch Queue: URLs pending content download
- Processing Queue: Downloaded content pending processing
- Database Queue: Database operations to be executed
- Completed Queue: Finished items
[installation]
prefix = /opt/newslookout
data_dir = /var/cache/newslookout_data
plugins_dir = /opt/newslookout/plugins
log_file = /var/log/newslookout/app.log
pid_file = /tmp/newslookout.pid
[operation]
url_gathering_timeout = 600
recursion_level = 2
user_agent = Mozilla/5.0 ...
fetch_timeout = 60
connect_timeout = 3
retry_count = 3
proxy_url_http = http://proxy.example.com:8080
proxy_url_https = https://proxy.example.com:8080
[logging]
log_level = INFO
max_logfile_size = 10485760
logfile_backup_count = 30
[plugins]
plugin1 = mod_en_in_ecotimes|10
plugin2 = mod_en_in_timesofindia|20
plugin3 = mod_dedupe|100
-
url_gathering_timeout: Maximum seconds for URL discovery (default: 600) -
recursion_level: Depth of link extraction (1-4, default: 2) -
fetch_timeout: Timeout for downloading content (seconds) -
connect_timeout: Timeout for establishing connection (seconds) -
retry_count: Number of retry attempts -
user_agent: User agent string for requests -
completed_urls_datafile: SQLite database for session history -
log_level: DEBUG, INFO, WARNING, ERROR -
max_logfile_size: Maximum log file size before rotation -
logfile_backup_count: Number of rotated logs to keep
from base_plugin import BasePlugin
from data_structs import PluginTypes
class mod_my_news_site(BasePlugin):
"""
Plugin for scraping MyNewsSite.com
"""
pluginType = PluginTypes.MODULE_NEWS_CONTENT
mainURL = 'https://www.mynewssite.com'
allowedDomains = ['www.mynewssite.com']
validURLStringsToCheck = ['mynewssite.com/article/']
invalidURLSubStrings = ['mynewssite.com/ads/', '/video/']
def __init__(self):
super().__init__()
def getURLsListForDate(self, runDate, sessionHistoryDB):
"""Discover URLs for given date."""
urls = []
return urls
def extractArticleBody(self, htmlContent):
"""Extract article text from HTML."""
return text
def extractUniqueIDFromURL(self, url):
"""Extract unique identifier from URL."""
return unique_id
from base_plugin import BasePlugin
from data_structs import PluginTypes
class mod_my_processor(BasePlugin):
"""
Plugin for processing scraped data
"""
pluginType = PluginTypes.MODULE_DATA_PROCESSOR
def __init__(self):
super().__init__()
def additionalConfig(self, sessionHistoryObj):
"""Additional configuration."""
pass
def processDataObj(self, newsEventObj):
"""Process a news event object."""
newsEventObj.setText(processed_text)
filename = newsEventObj.getFileName().replace('.json', '')
newsEventObj.writeFiles(filename, '', saveHTMLFile=False)
MODULE_NEWS_CONTENT: Scrapes news articlesMODULE_NEWS_AGGREGATOR: Aggregates URLs from multiple sourcesMODULE_DATA_CONTENT: Scrapes structured dataMODULE_DATA_PROCESSOR: Post-processes scraped data
NewsLookoutApp(config_file: str, run_date: Optional[str] = None)
Parameters:
config_file(str): Path to configuration filerun_date(str, optional): Date in 'YYYY-MM-DD' format
Raises:
FileNotFoundError: If config file doesn't existValueError: If configuration is invalid
run(run_date: Optional[str] = None,
max_runtime: Optional[int] = None,
blocking: bool = True) -> Dict[str, Any]
Run the scraping process.
Parameters:
run_date(str, optional): Override run datemax_runtime(int, optional): Maximum runtime in secondsblocking(bool): If True, wait for completion
Returns:
dict: Statistics dictionary
start()
Start application in background mode.
stop(timeout: int = 30)
Stop the running application gracefully.
Parameters:
timeout(int): Maximum seconds to wait for shutdown
get_statistics() -> Dict[str, Any]
Get current or last run statistics.
Returns:
dict: Statistics including:urls_discovered: Total URLs foundurls_processed: URLs successfully scrapeddata_processed: Items processedstart_time: Execution start timeend_time: Execution end timeduration: Runtime in secondsis_running: Current status
get_plugin_status() -> Dict[str, str]
Get status of all loaded plugins.
Returns:
dict: Map of plugin names to states
wait_for_completion(timeout: Optional[int] = None) -> bool
Wait for background execution to complete.
Parameters:
timeout(int, optional): Maximum seconds to wait
Returns:
bool: True if completed, False if timeout
scrape(config_file: str,
run_date: Optional[str] = None,
max_runtime: Optional[int] = None) -> Dict[str, Any]
Convenience function to run a scraping job.
Symptom: Application hangs during URL discovery
Solution: Increase url_gathering_timeout in configuration:
[operation]
url_gathering_timeout = 1200 # 20 minutes
Symptom: database is locked errors in logs
Solution: All database operations now go through dedicated thread. If issue persists:
- Check no other process is accessing the database
- Remove
-journalfiles if present - Increase timeout in session_hist.py
Symptom: Ctrl+C doesn't stop the application
Solution: Updated code includes periodic shutdown checks. Ensure:
- Using latest version
- Not stuck in long-running external call
- Check network timeouts are reasonable
Symptom: Memory exhaustion from excessive URLs
Solution:
- Reduce
recursion_levelin configuration - Improve URL filtering in plugins
- Use more restrictive
validURLStringsToCheck
Symptom: Specific plugin never completes
Solution:
- Check plugin's
is_stoppedflag periodically - Ensure network operations have timeouts
- Review
getURLsListForDate()implementation
Enable detailed logging:
[logging]
log_level = DEBUG
Or programmatically:
import logging
logging.getLogger('').setLevel(logging.DEBUG)
[operation]
fetch_timeout = 30 # Reduce if sites are fast
retry_count = 2 # Reduce retries
Modify in code:
queue_manager.dataproc_threads = 10 # Increase for more parallelism
[operation]
recursion_level = 1 # Minimum recursion
-
Use separate configs for different environments
-
Version control your configuration files
-
Document custom settings
-
Always check
self.is_stoppedin loops -
Use timeouts for all network operations
-
Handle exceptions gracefully
-
Log progress at regular intervals
-
Monitor disk space for data directory
-
Rotate logs regularly
-
Clean up old session data periodically
-
Use systemd or supervisor for service management
-
Set up log rotation
-
Monitor application health
-
Configure appropriate timeouts
-
Use separate database for each instance
-
Review logs after each run
-
Set up alerts for critical errors
-
Test plugins with edge cases
-
Handle malformed HTML gracefully
| Log message | Cause | Fix |
|---|---|---|
can't compare offset-naive and offset-aware datetimes |
The news site returns a timezone-aware publication date | Apply Patch 2 to base_plugin.py |
'NoneType' object has no attribute 'getURL' in mod_keywordflags |
The JSON article file for a previously scraped URL no longer exists on disk | Apply Patch 4 to worker.py; also verify your data_dir path in the config |
Invalid article_id: None / Falling back to legacy file storage |
The URL did not match any urlMatchPatterns in the plugin |
Apply Patch 3 to base_plugin.py |
Error fetching status: TypeError: can't access property "textContent" … is null |
Dashboard JS runs before DOM is ready | Apply Patch 5 to dashboard.html |
Request for font "Ubuntu Sans" blocked at visibility level 2 |
Browser privacy policy blocks Google Fonts | Apply Patch 5a to dashboard.html |
Installed package appears under src/newslookout instead of newslookout |
Missing src-layout config in setup.cfg / pyproject.toml |
Apply Patches 9 and 10 |
- Documentation: https://github.com/sandeep-sandhu/newslookout
- Issues: Report bugs on GitHub Issues
- Contributing: Pull requests welcome
This software is provided "AS IS" without warranty. See LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file newslookout-3.0.0.tar.gz.
File metadata
- Download URL: newslookout-3.0.0.tar.gz
- Upload date:
- Size: 178.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78c67c6cfb52f69bea67a0bfe8a002a8ff44383929522339c08355f49d47b7c1
|
|
| MD5 |
5eaf17fd67cb9664e828232f01455c1d
|
|
| BLAKE2b-256 |
49f54dfe12c96c2efc9b0f3772e0d433f501bc19d67a064ea5314de76d3cc392
|
File details
Details for the file newslookout-3.0.0-py3-none-any.whl.
File metadata
- Download URL: newslookout-3.0.0-py3-none-any.whl
- Upload date:
- Size: 197.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed6a9e704d47480b92a60ee86db465a967a47376bb10497966fc8715ec124067
|
|
| MD5 |
231b4b20e54d55d8bc0038312f13b1da
|
|
| BLAKE2b-256 |
9908acdb60acfb9f1a1abab0940d9ceb4d424dc608aad96040008a55df9cd7e8
|