Facilitate clustering of similar URLs of a website
Project description
This package facilitates the clustering of similar URLs of a website.
Live demo: http://urlclustering.com
General information
You give a (preferably long and complete) list of URLs as input e.g.:
urls = [ 'http://example.com', 'http://example.com/about', 'http://example.com/contact', 'http://example.com/cat/sports', 'http://example.com/cat/tech', 'http://example.com/cat/life', 'http://example.com/cat/politics', 'http://example.com/tag/623/tag1', 'http://example.com/tag/335/tag2', 'http://example.com/tag/671/tag3', 'http://example.com/article/?id=1', 'http://example.com/article/?id=2', 'http://example.com/article/?id=3', ]
You get a list of clusters as a result. For each cluster you get:
a REGEX that matches all cluster URLs
a HUMAN readable string representing the cluster
a list with all matched cluster URLs
So for our example the result is:
REGEX: http://example.com/cat/([^/]+) HUMAN: http://example.com/cat/[...] URLS: http://example.com/cat/sports http://example.com/cat/tech http://example.com/cat/life http://example.com/cat/politics REGEX: http://example.com/tag/(\d+)/([^/]+) HUMAN: http://example.com/tag/[NUMBER]/[...] URLS: http://example.com/tag/623/tag1 http://example.com/tag/335/tag2 http://example.com/tag/671/tag3 REGEX: http://example.com/article/?\?id=(\d+) HUMAN: http://example.com/article?id=[NUMBER] URLS: http://example.com/article/?id=1 http://example.com/article/?id=2 http://example.com/article/?id=3 UNCLUSTERED URLS: http://example.com http://example.com/about http://example.com/contact
When to use
This is most useful for website analysis tools that report findings to the user. E.g. a service that crawls your website and reports page loading time may find that 10,000 pages take >2 seconds to load. Instead of listing 10,000 URLs it’s better to cluster them. So the end user will see something like:
Slow pages (>2 secs): - http://example.com/ (1 URL) - http://example.com/sitemap (1 URL) - http://example.com/search?q=[...] (578 URLs) - http://example.com/tags?tag1=[...]&tag2=[...] (409 URLs) - http://example.com/article?id=[NUMBER] (7209 URLs)
How it works:
URLs are grouped by domain. Only same domain URLs are clustered.
URLs are then grouped by a signature which is the number of path elements and the number of QueryString parameters & values the URL has.
Examples:
http://ex.com/about has a signature of (1, 0)
http://ex.com/article?123 has a signature of (1, 1)
http://ex.com/path/to/file?par1=val1&par2=val2 has a signature of (3,4)
URLs with the same signature are inserted in a tree structure. For each part (path element or QS parameter or QS value) two nodes are created:
One with the verbatim part.
One with the reduced part i.e. a regex that could replace the part.
Leaf nodes hold the number of URLs that match and the number of reductions.
E.g. inserting URL http://ex.com/article?123 will create 2 top nodes:
root 1: `article` root 2: `[^/]+`
And each top node will have two children:
child 1: `123` child 2: `\d+`
Inserting 3 URLs of the form /article/[0-9]+ would lead to a tree like this:
`article` `[^/]+` / / \ \ / / \ \ `123` `456` `789` `\d+` `123` `456` `789` `\d+` 1 URL 1 URL 1 URL 3 URLs 1 URL 1 URL 1 URL 3 URLs 0 re 0 re 0 re 1 re 1 re 1 re 1 re 2 re
The final step is to choose the best leafs. In this case article -> \d+ is best because it macthes all 3 URLs with 1 reduction so the cluster returned is http://ex.com/article/[NUMBER]
License
Copyright (c) 2015 Dimitris Giannitsaros.
Licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file urlclustering-0.4.1.tar.gz
.
File metadata
- Download URL: urlclustering-0.4.1.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 792270c0e10aca65d21f94343396c2719ff5c99ee7e24c3e1b7a8e991ebaafd3 |
|
MD5 | 20c52c9acd773a12c9d6763310d35a8c |
|
BLAKE2b-256 | 8e0d3dc97f51b1b043575ec613784076f71de430b7a72e8d97d9105f665d7a57 |