collective.linkcheck

Add-on for Plone that provides link validity checking and reporting.

These details have not been verified by PyPI

Project links

Homepage

Environment
- Web Environment
Framework
License
- OSI Approved :: GNU General Public License v2 (GPLv2)
Operating System
- OS Independent
Programming Language
- Python
- Python :: 2.7

Project description

The system is an integrated solution, with a headless instance processing items in the background.

ZEO or other shared storage is required.

Compatibility: Plone 4+, Plone 5

Setup

Add the package to your buildout, then install the add-on from inside Plone.

Next, set up an instance to run the link checking processor. This can be an existing instance, or a separate one:

$ bin/instance linkcheck

This process should always be running, but may be stopped and started at any time without data loss.

Control panel

Once the system is up and running, interaction happens through the “Link validity” control panel:

http://localhost:8080/site/@@linkcheck-controlpanel

It’s available from Plone’s control panel overview.

Reporting

The report tab lists current problems.

Notification

An alert system is provided in the form of an RSS-feed:
http://localhost:8080/site/++linkcheck++feed.rss
Note that this view requires the “Manage portal” permission. To allow integration with other services, a self-authenticating link is available from the report screen:

RSS-feed available. Click the orange icon

To set up e-mail notification, configure an RSS-driven newsletter with this feed and adjust the frequency to match the update interval (i.e. every morning). There’s probably a lot of choice here. MailChimp makes it very easy.

Settings

The settings tab on the control panel provides configuration for concurrency level, checking interval and link expiration, as well as statistics about the number of links that are currently active and the queue size.

There is also a setting available that lets the processor use the publisher to test internal link validity (at the cost of additional system resources). If this mode is enabled, the processor will attempt to publish the internal link and check that the response is good.

From the controlpanel you can also crawl the entire site for broken links. You can constrain the content that is checked by type and workflow-state. Beware that this can take a very long time!

Export

You can export the report about broken links in various formats. Call @@linkcheck-export?export_type=csv for the export. Supported formats are: csv, xlsx, xls, tsv, yaml, html and json.

How does it work?

When the add-on is installed, Plone will pass each HTML response through a collection step that keeps track of:

The status code of outgoing HTML responses;
The hyperlinks which appear in the response body, if available.

This happens very quickly. The lxml library is used to parse and search the response document for links.

The benefit of the approach is that we don’t need to spend additional resources to check the validity of pages that we’ve already rendered.

There’s an assumption here that the site is visited regularly and exhaustively by a search robot or other crawling service. This is typically true for a public site.

Link status

A good status is either 200 OK or 302 Moved Temporarily; a neutral status is a good link which has turned bad, or not been checked; a bad status is everything else, including 301 Moved Permanently.

In any case, the status of an external link is updated only once per the configured interval (24 hours by default).

History

Link validity checking has previously been a core functionality in Plone, but starting from the 4.x-series, there is no such capability. It’s been proposed to bring it back into the core (see PLIP #10987), but the idea has since been retired.

There’s a 3rd party product available, gocept.linkchecker which relies on a separate process written in the Grok framework to perform external link-checking. It communicates with Plone via XML-RPC. There’s a Plone 4 compatibility branch available. This product demands significantly more resources (both CPU and memory) because it publishes all internal links at a regular interval.

Performance

In the default configuration, the system should not incur significant overhead.

That said, we’ve put the data into a Zope 2 tool to allow easily mounting it into a separate database.

Keeping a separate database for updates

Using the plone.recipe.zope2instance recipe for buildout, this is how you would configure a mount point for a Plone site located at /site:

zope-conf-additional =
    <zodb_db linkcheck>
       mount-point /site/portal_linkcheck
       container-class collective.linkcheck.tool.LinkCheckTool
       <zeoclient>
         server ${zeo:zeo-address}
         storage linkcheck
       </zeoclient>
    </zodb_db>

This should match a plone.recipe.zeoserver part:

zeo-conf-additional =
    <filestorage linkcheck>
      path ${buildout:directory}/var/filestorage/linkcheck.fs
    </filestorage>

Note that you must add the mount point using the ZMI before installing the add-on for it to work.

License

GPLv3 (http://www.gnu.org/licenses/gpl.html).

Author

Malthe Borch <mborch@gmail.com>

Contributors

Malthe Borch, mborch@gmail.com
Philip Bauer, bauer@starzel.de
Jörg Kubaile
lewicki
petschki
Toni Fischer

Changes

1.5 (2017-10-10)

Update german translations [pbauer]
Clear before crawling site [pbauer]
Catch error when auth is empty [pbauer]
Add fallbacks when trying to remove empty entries [pbauer]

1.4.dev1_gww (2017-03-22)

Nothing changed yet.

1.3.dev1_gww (2017-03-22)

Allow to export the report about broken links report in various formats. Call @@linkcheck-export?export_type=json for the view. Supported formats are csv, xlsx, xls, tsv, yaml, html and json. [pbauer]
Add a setting to select workflow-states to check on crawl and update. [pbauer]
Add timeout setting. [pbauer]
Allow recent versions of Requests. [pbauer]
Add a setting to select portal_types to check on crawl and update. [lewicki]
Add a view @@linkcheck to check links from object context [lewicki]
Add setting to disable event triggering on each request. [lewicki]
Handle mailto links [Jörg Kubaile]
Handle relative urls [Toni Fischer]
Add link to remove entry from the report list [Jörg Kubaile]
Added german translations [petschki]
Added .gitignore [petschki]
Add upgrade step for new registry entry [petschki]

1.2 (2012-11-22)

Fixed an issue where URLs containing unquotable characters would cause the controlpanel to break.
Discard anchor (#) and query string (?) URL components.
Resolve links with parent pointers (“../”) to avoid duplicate indexing.
Always enter run loop and routinely poll for new sites.
Fixed issue where the composite queue implementation would be used incorrectly.

1.1 (2012-06-25)

Don’t store path (location in document) information; it’s useless and it takes up too much disk space.
Added option to limit number of referers to store (default: 5).
Datastructure optimization.

Use bucket-based data types when possible, and avoid copying strings (instead using an integer-based lookup table).

Note: Migration required. Please run the upgrade step.

1.0.2 (2012-06-15)

Add whitelist (ignore) option. This is a list of regular expressions that match on links to prevent them from being recorded by the tool.
Make report sortable.

1.0.1 (2012-05-10)

Quote URLs passed to the “Enqueue” action.
Added support for HEAD request.
Use gzip library to correctly read and decompress zlib-compressed responses.

1.0 (2012-05-10)

Initial public release.

Project details

These details have not been verified by PyPI

Project links

Homepage

Environment
- Web Environment
Framework
License
- OSI Approved :: GNU General Public License v2 (GPLv2)
Operating System
- OS Independent
Programming Language
- Python
- Python :: 2.7

Release history Release notifications | RSS feed

This version

1.5

Oct 10, 2017

1.2

Nov 22, 2012

1.1

Jun 25, 2012

1.0.2

Jun 15, 2012

1.0.1

May 10, 2012

1.0

May 8, 2012

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

collective.linkcheck-1.5.tar.gz (39.7 kB view details)

Uploaded Oct 10, 2017 Source

File details

Details for the file collective.linkcheck-1.5.tar.gz.

File metadata

Download URL: collective.linkcheck-1.5.tar.gz
Upload date: Oct 10, 2017
Size: 39.7 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for collective.linkcheck-1.5.tar.gz
Algorithm	Hash digest
SHA256	`9abac892e88797e58a5bf4ad40d0960e00922c661ed8299342858c7b0aa95491`
MD5	`88bb3d51fa96959328db520b6fd8627d`
BLAKE2b-256	`f1847b52f9d4406048b5016650cad384cfc1406e765c715a2c8f41870ff55d5d`

See more details on using hashes here.

collective.linkcheck 1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Setup

Control panel

How does it work?

Link status

History

Performance

Keeping a separate database for updates

License

Author

Contributors

Changes

1.5 (2017-10-10)

1.4.dev1_gww (2017-03-22)

1.3.dev1_gww (2017-03-22)

1.2 (2012-11-22)

1.1 (2012-06-25)

1.0.2 (2012-06-15)

1.0.1 (2012-05-10)

1.0 (2012-05-10)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes