This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

This package facilitates the clustering of similar URLs of a website.

Live demo: http://urlclustering.com

General information

You give a (preferably long and complete) list of URLs as input e.g.:

urls = [
    'http://example.com',
    'http://example.com/about',
    'http://example.com/contact',

    'http://example.com/cat/sports',
    'http://example.com/cat/tech',
    'http://example.com/cat/life',
    'http://example.com/cat/politics',

    'http://example.com/tag/623/tag1',
    'http://example.com/tag/335/tag2',
    'http://example.com/tag/671/tag3',

    'http://example.com/article/?id=1',
    'http://example.com/article/?id=2',
    'http://example.com/article/?id=3',
]

You get a list of clusters as a result. For each cluster you get:

  • a REGEX that matches all cluster URLs
  • a HUMAN readable string representing the cluster
  • a list with all matched cluster URLs

So for our example the result is:

REGEX: http://example.com/cat/([^/]+)
HUMAN: http://example.com/cat/[...]
URLS:
    http://example.com/cat/sports
    http://example.com/cat/tech
    http://example.com/cat/life
    http://example.com/cat/politics

REGEX: http://example.com/tag/(\d+)/([^/]+)
HUMAN: http://example.com/tag/[NUMBER]/[...]
URLS:
    http://example.com/tag/623/tag1
    http://example.com/tag/335/tag2
    http://example.com/tag/671/tag3

REGEX: http://example.com/article/?\?id=(\d+)
HUMAN: http://example.com/article?id=[NUMBER]
URLS:
    http://example.com/article/?id=1
    http://example.com/article/?id=2
    http://example.com/article/?id=3

UNCLUSTERED URLS:
    http://example.com
    http://example.com/about
    http://example.com/contact

When to use

This is most useful for website analysis tools that report findings to the user. E.g. a service that crawls your website and reports page loading time may find that 10,000 pages take >2 seconds to load. Instead of listing 10,000 URLs it’s better to cluster them. So the end user will see something like:

Slow pages (>2 secs):
- http://example.com/                             (1 URL)
- http://example.com/sitemap                      (1 URL)
- http://example.com/search?q=[...]               (578 URLs)
- http://example.com/tags?tag1=[...]&tag2=[...]   (409 URLs)
- http://example.com/article?id=[NUMBER]          (7209 URLs)

How it works:

URLs are grouped by domain. Only same domain URLs are clustered.

URLs are then grouped by a signature which is the number of path elements and the number of QueryString parameters & values the URL has.

Examples:

URLs with the same signature are inserted in a tree structure. For each part (path element or QS parameter or QS value) two nodes are created:

  • One with the verbatim part.
  • One with the reduced part i.e. a regex that could replace the part.

Leaf nodes hold the number of URLs that match and the number of reductions.

E.g. inserting URL http://ex.com/article?123 will create 2 top nodes:

root 1: `article`
root 2: `[^/]+`

And each top node will have two children:

child 1: `123`
child 2: `\d+`

Inserting 3 URLs of the form /article/[0-9]+ would lead to a tree like this:

       `article`                        `[^/]+`
  /    /      \     \             /    /      \     \
`123`  `456`  `789`  `\d+`      `123`  `456`  `789`  `\d+`
1 URL  1 URL  1 URL  3 URLs     1 URL  1 URL  1 URL  3 URLs
0 re   0 re   0 re   1 re       1 re   1 re   1 re   2  re

The final step is to choose the best leafs. In this case article -> \d+ is best because it macthes all 3 URLs with 1 reduction so the cluster returned is http://ex.com/article/[NUMBER]

License

Copyright (c) 2015 Dimitris Giannitsaros.

Licensed under the MIT License.

Release History

Release History

0.4.1

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.4

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.3

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
urlclustering-0.4.1.tar.gz (6.5 kB) Copy SHA256 Checksum SHA256 Source Oct 19, 2015

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting