Skip to main content

News scraping application

Project description

NewsLookout is a web scraping application for financial events. It is a scalable, fault-tolerant, modular and configurable multi-threaded python console application. It is enterprise ready and can run behind a proxy environment via automated schedulers.

The application is readily extended by adding custom modules via its ‘plugin’ architecture for additional news sources, custom data pre-processing and NLP based news text analytics (e.g. entity recognition, negative event classification, economy trends, industry trends, etc.). For more details, refer to https://github.com/sandeep-sandhu/NewsLookout

Although the application runs without any special configuration with default parameters, the parameters given in the default config file must be customized - especially the file and folder locations for data, config file, log file, PID file, etc. Most importantly, certain model related data needs to be downloaded for NLTK and spacy NLP libraries as part of installation.

For spacy, run the following command: > python -m spacy download en_core_web_lg

For nltk, run the following command within the python shell: > import nltk > nltk.download()

You can extend its functionality to add any additional website that you need scraped by using the template file template_for_plugin.py and customising it. Name your custom plugin file with the same name as the class object name. Place it in the plugins_contrib folder and add the plugins name in the configuration file. It will be picked up automatically and run on the next application run. Take a look at one of the already implemented plugins code for examples of how a plugin can be written.

There already exist a number of python libraries for web-scraping, so why should you consider this application for web scraping news? The reason is that it has been specifically built for sourcing news and has several useful features. Some of the notable ones are:

  • Built-in NLP models for keyword extraction

  • Text de-duplication using deep learning NLP model

  • Text tone classification using deep learning NLP model to indicate positive, neutral or negative news

  • Extensible data processing plugins to customize the data processing required after web scraping

  • Multi-threaded for scraping several news sites in parallel

  • Includes data processing pipeline configurable by defining the execution order of the data-processing plugins

  • Performs data processing on multiple news/data in parallel to speed up processing for thousands of news items

  • Extensible with custom plugins that can be rapidly written with minimal additional code to support additional news sources. Writing a new plugin does not need writing low level code to handle network traffic and HTTP protocols.

  • Rigorously tested for the specific websites enabled in the plugins, handles several quirks and formatting problems caused by inconsistent and non-standard HTML code.

  • Rigorous text cleaning tested for each of the sites implemented

  • Reduces the network traffic and consequently webserver load by pausing between network requests. High traffic load are usually detected and blocked. The application reduces network traffic to avoid overloading the news web servers.

  • Keeps track of failures and history of sites scraped to avoid re-visiting them again

  • Completely configurable functionality

  • Works with proxy servers

  • Enterprise ready functionality - configurable event logging, segregation of data storage locations vs. program executables, minimum permissions to run the executable, etc.

  • Runnable without a frontend, as a daemon.

  • Enables web-scraping news archives to get news from previous dates for establishing history for analysis

  • Saves the current session state and resumes downloading unfinished URLs in case the application is shut-down midway during web scraping

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

NewsLookout-2.1.0.tar.gz (262.4 kB view details)

Uploaded Source

Built Distribution

NewsLookout-2.1.0-py3-none-any.whl (161.6 kB view details)

Uploaded Python 3

File details

Details for the file NewsLookout-2.1.0.tar.gz.

File metadata

  • Download URL: NewsLookout-2.1.0.tar.gz
  • Upload date:
  • Size: 262.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for NewsLookout-2.1.0.tar.gz
Algorithm Hash digest
SHA256 904c719316cf0c3d60ee99f2afd118eaa774a3a331394f443b6b857c659183b1
MD5 44b1b710165fceaf1a3edf8b9a19bd23
BLAKE2b-256 a38bc66d1ed267ba6721f54528853b10b07df94ce7813ff6784284679301bc88

See more details on using hashes here.

File details

Details for the file NewsLookout-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: NewsLookout-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 161.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for NewsLookout-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0c9820ea6a99deb08e86bb0da2694992b47999ddd0ef9314b876c80ff7f33bfe
MD5 2ab7b45632c6233ca2baa725714d1a33
BLAKE2b-256 e00449ae535dba20cfb3aeaf1a6d461a4274717aefd0461a15b33d4201ff2436

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page