Cluster url pattern automatically.
Project description
os-urlpattern
This package is used for unsupervised URLs clustering. Furthermore, it generate URL Pattern(RegEx)
pypy can also be used for performance(4x-8x). Command line tools are provided for standalone clustering and matching, APIs are also convenient. Several extra packages can be installed for additional features. Under CPython 1cpu, 100 thousand URLs clustering cost almost 1min and 200M memory. Built-in matching strategy is efficient enough at most use case(3k/s, depend on patterns complexity).
$ pip install -U os-urlpattern
$ wget -qO- 'https://git.io/f4QlP' | pattern-make
/[0-9]{2}[\.]html
http://example.com/01.html
http://example.com/02.html
http://example.com/03.html
/[0-9]{3}/test[0-9]{2}[\.]html
http://example.com/123/test01.html
http://example.com/456/test02.html
http://example.com/789/test03.html
Aknowledgement
Similar URLs
URLs with the same URL structure.
Components of the parsed URLs at the same position are in the same character space.
Corresponding components of the different URLs have the same character space order.
URL structure
Typically, URL can be parsed into 6 components:
<scheme>://<netloc>/<path>;<params>?<query>#<fragment>
Because different sites may have similar URLs structure and <params> is rare, so <schema> <netloc> and <params> are ignored, <path> <query> <fragment> are used to define URL structure.
If the URLs have the same path levels, same query keys(also keys order) and with the same fragment existence, their URL structure should be the same.
http://example.com/p1/p2?k1=v1&k2=v2#pos URL structure: path levels: 2 query keys: k1, k2 have fragment: True
Character space
Consider RFC 3986 (Section 2: Characters), URL with the following characters would be legal:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=
There are three major character space: lower-case letters(a-z), upper-case letters(A-Z), number letters(0-9). Other symbols are in their own character space.
HeLlOwoRd233! character space: a-z A-Z 0-9 !
Character space order
Split a string by character, consecutive character space can be joined.
HELLOword233! split into: HELLO word 233 ! character space order: A-Z a-z 0-9 !
Complex consecutive major character space can be joined.
HellWorld233! split into: H ell W orld 233 ! major join: HellWorld233 ! character space order: A-Za-z0-9 !
Because of URL quote, ‘%’ can be joined with major character space.
%E4%BD%A0%E5%A5%BD! split into: % E 4 % BD % A 0 % E 5 % A 5 % BD ! major join: %E4%BD%A0%E5%A5%BD ! character space order: A-Z0-9% !
URL Pattern
URL Pattern is used to express each cluster. It is normal regex string. Each URL in the same cluster can be matched with the pattern.
pattern examples: /news/[0-9]{8}/[a-z]+[\\.]html /newsShow[\\.]asp[\\?]dataID=[0-9]+ /thread[\\-][0-9]+[\\-][0-9][\\-]1[\\.]html
The built-in matching strategy is strict, it can’t tolerate incomplet matching.
letter: helloword pattern01: [a-z0-9]+ # not match, because no number in the letter pattern02: [a-z]+ # match
Install
Install with pip
$ pip install os-urlpattern
Install extra packages
subpackage |
install command |
enables |
---|---|---|
memory |
pip install os-urlpattern[memroy] |
Show memory useage |
ete-tree |
pip install os-urlpattern[ete_tree] |
Enable ete pattern tree formatter |
Usage
Command line
pattern-make
Load urls, cluster and dump patterns.
$ pattern-make -h usage: pattern-make [-h] [-f FILE [FILE ...]] [-L {NOTSET,DEBUG,INFO,WARN,ERROR,FATAL}] [-c CONFIG [CONFIG ...]] [-F {JSON,ETE}] optional arguments: -h, --help show this help message and exit -f FILE [FILE ...], --file FILE [FILE ...] file to be processed (default: stdin) -L {NOTSET,DEBUG,INFO,WARN,ERROR,FATAL}, --loglevel {NOTSET,DEBUG,INFO,WARN,ERROR,FATAL} log level (default: NOTSET) -c CONFIG [CONFIG ...], --config CONFIG [CONFIG ...] config file -F {PATTERN,CLUSTER,JSON,ETE,INLINE,NULL}, --formatter {PATTERN,CLUSTER,JSON,ETE,INLINE,NULL} output formatter (default: CLUSTER)
Dump clustered URLs with patterns:
$ cat urls.txt | pattern-make -L debug > clustered.txt
Only generate URL Pattern:
$ cat urls.txt | pattern-make -L debug -F pattern > patterns.txt
Generate pattern tree from URLs(ete installed):
$ cat urls.txt | pattern-make -L debug -F ete
pattern-match
Load patterns, dump URLs matched results.
$ pattern-match -h usage: pattern-match [-h] [-f FILE [FILE ...]] [-L {NOTSET,DEBUG,INFO,WARN,ERROR,FATAL}] -p PATTERN_FILE [PATTERN_FILE ...] optional arguments: -h, --help show this help message and exit -f FILE [FILE ...], --file FILE [FILE ...] file to be processed (default: stdin) -L {NOTSET,DEBUG,INFO,WARN,ERROR,FATAL}, --loglevel {NOTSET,DEBUG,INFO,WARN,ERROR,FATAL} log level (default: NOTSET) -p PATTERN_FILE [PATTERN_FILE ...], --pattern-file PATTERN_FILE [PATTERN_FILE ...] pattern file to be loaded
Match URLs:
$ cat urls.txt | pattern-match -L debug -p patterns.txt
APIs
Cluster and generate URL Pattern:
from os_urlpattern.formatter import pformat from os_urlpattern.pattern_maker import PatternMaker pattern_maker = PatternMaker() # load URLs(unicode) for url in urls: pattern_maker.load(url) # cluster and print pattern for url_meta, clustered in pattern_maker.make(): for pattern in pformat('pattern', url_meta, clustered) print(pattern)
Match URLs:
from os_urlpattern.pattern_matcher import PatternMatcher pattern_matcher = PatternMatcher() # load url_pattern(unicode) for url_pattern in url_patterns: # meta will bind to matched result pattern_matcher.load(url_pattern, meta=url_pattern) # match URL(unicode) for url in urls: matched_results = patterm_matcher.match(url) # the best matched result: # sorted(matched_results, reverse=True)[0] patterns = [n.meta for n in matched_results]
Low-level APIs:
It is necessary to use low-level APIs for customizing processing procdure, especially for parallel computing or working on an distributed cluster(hadoop).
Key points: same fuzzy-digest same maker and same matcher.
Use os_urlpattern.parser.fuzzy_digest to get fuzzy digest from URL, URL pattern or parsed URLMeta and parsed pieces.
from os_urlpattern.formatter import pformat from os_urlpattern.parser import fuzzy_digest, parse from os_urlpattern.pattern_maker import Maker makers = {} # load URLs(unicode) for url in urls: url_meta, parsed_pieces = parse(url) # same digest same maker digest = fuzzy_digest(url_meta, parsed_pieces) if digest not in makers: makers[digest] = Maker(url_meta) # not PatternMaker makers[digest].load(parsed_pieces) # iterate makers, cluster and print pattern for maker in makers.values(): for clustered in maker.make(): for pattern in pformat('pattern', maker.url_meta, clustered) print(pattern)
Unit Tests
$ tox
License
MIT licensed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.