Skip to main content

Monitor changes to webpages in RSS feeds

Project description

diffengine is a utility for watching RSS feeds to see when story content changes. When new content is found a snapshot is saved at the Internet Archive, and a diff is generated for sending to social media. The hope is that it can help draw attention to the way news is being shaped on the web. It also creates a database of changes over time that can be useful for research purposes.

diffengine draws heavily on the inspiration of NYTDiff and NewsDiffs which almost did what we wanted. NYTdiff is able to create presentable diff images and tweet them, but was designed to work specifically with the NYTimes API. NewsDiffs provides a comprehensive framework for watching changes on multiple sites (Washington Post, New York Times, CNN, BBC, etc) but you need to be a programmer to add a parser module for a website that you want to monitor. It is also a full-on website which involves some commitment to install and run.

With the help of feedparser, diffengine takes a different approach by working with any site that publishes an RSS feed of changes. This covers many news organizations, but also personal blogs and organizational websites that put out regular updates. And with the readability module, diffengine is able to automatically extract the primary content of pages, without requiring special parsing to remove boilerplate material. And like NYTDiff, instead of creating another website for people to watch, diffengine pushes updates out to social media where people are already, while also building a local database of diffs that can be used for research purposes.

Install

  1. install GeckoDriver
  2. install Python 3
  3. pip3 install --process-dependency-links diffengine

Run

In order to run diffengine you need to pick a directory location where you can store the diffengine configuration, database and diffs. For example I have a directory in my home directory, but you can use whatever location you want, you just need to be able to write to it.

The first time you run diffengine it will prompt you to enter an RSS or Atom feed URL to monitor and will authenticate with Twitter.

% diffengine /home/ed/.diffengine 

What RSS/Atom feed would you like to monitor? https://inkdroid.org/feed.xml

Would you like to set up tweeting edits? [Y/n] Y

Go to https://apps.twitter.com and create an application.

What is the consumer key? <TWITTER_APP_KEY>

What is the consumer secret? <TWITTER_APP_SECRET>

Log in to https://twitter.com as the user you want to tweet as and hit enter.

Visit https://api.twitter.com/oauth/authorize?oauth_token=NRW9BQAAAAAAyqBnAAXXYYlCL8g

What is your PIN: 8675309

Saved your configuration in /home/ed/.diffengine/config.yaml

Fetching initial set of entries.

Done!

After that you just need to put diffengine in your crontab to have it run regularly, or you can run it manually at your own intervals if you want. Here's my crontab to run every 30 minutes to look for new content.

0,30 * * * * /usr/local/bin/diffengine /home/ed/.diffengine

You can examine your config file at any time and add/remove feeds as needed. It is the config.yaml file that is stored relative to the storage directory you chose, so in my case /home/ed/.diffengine/config.yaml.

Logs can be found in diffengine.log in the storage directory, for example /home/ed/.diffengine/diffengine.log.

Examples

Checkout Ryan Baumann's "diffengine" Twitter list for a list of known diffengine Twitter accounts that are out there.

Multiple Accounts & Feed Implementation Example

If you are setting multiple accounts, and multiple feeds if may be helpful to setup a directory for each account. For example:

  • Toronto Sun /home/nruest/.torontosun
  • Toronto Star /home/nruest/.torontostar
  • Globe & Mail /home/nruest/.globemail
  • Canadaland /home/nruest/.canadaland
  • CBC /home/nruest/.cbc

Then you will configure a cron entry for each account:

0,15,30,45 * * * * /usr/bin/flock -xn /tmp/globemail.lock -c "/usr/local/bin/diffengine /home/nruest/.globemail"
0,15,30,45 * * * * /usr/bin/flock -xn /tmp/torontosun.lock -c "/usr/local/bin/diffengine /home/nruest/.torontosun"
0,15,30,45 * * * * /usr/bin/flock -xn /tmp/cbc.lock -c "/usr/local/bin/diffengine /home/nruest/.cbc"
0,15,30,45 * * * * /usr/bin/flock -xn /tmp/lapresse.lock -c "/usr/local/bin/diffengine /home/nruest/.lapresse"
0,15,30,45 * * * * /usr/bin/flock -xn /tmp/calgaryherald.lock -c "/usr/local/bin/diffengine /home/nruest/.calgaryherald"

If there are multiple feeds for an account, you can setup the config.yml like so:

- name: The Globe and Mail - Report on Business
  twitter:
    access_token: ACCESS_TOKEN
    access_token_secret: ACCESS_TOKEN_SECRET
  url: http://www.theglobeandmail.com/report-on-business/?service=rss
- name: The Globe and Mail - Opinion
  twitter:
    access_token: ACCESS_TOKEN
    access_token_secret: ACCESS_TOKEN_SECRET
  url: http://www.theglobeandmail.com/opinion/?service=rss
- name: The Globe and Mail - News
  twitter:
    access_token: ACCESS_TOKEN
    access_token_secret: ACCESS_TOKEN_SECRET
  url: http://www.theglobeandmail.com/news/?service=rss
twitter:
  consumer_key: CONSUMER_KEY
  consumer_secret: CONSUMER_SECRET

Develop

Build Status

Here's how to get started hacking on diffengine with pipenv:

% git clone https://github.com/docnow/diffengine 
% cd diffengine
% pipenv install
% pytest
============================= test session starts ==============================
platform linux -- Python 3.5.2, pytest-3.0.5, py-1.4.32, pluggy-0.4.0
rootdir: /home/ed/Projects/diffengine, inifile:
collected 5 items

test_diffengine.py .....

=========================== 5 passed in 8.09 seconds ===========================

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diffengine-0.2.7.tar.gz (11.6 kB view details)

Uploaded Source

File details

Details for the file diffengine-0.2.7.tar.gz.

File metadata

  • Download URL: diffengine-0.2.7.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/42.0.2 requests-toolbelt/0.8.0 tqdm/4.28.0 CPython/3.7.6

File hashes

Hashes for diffengine-0.2.7.tar.gz
Algorithm Hash digest
SHA256 4decdbbde85969ab75a147c020a7ee3440e9ec0796545a71ac8cb2348a54a492
MD5 55c0053f76095f9132dfa3472f39fb32
BLAKE2b-256 a0da0afc809b781a12bfb0759becc02506e4a2c405b05d7ce7bcf4bc01c7b89f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page