Skip to main content

url canonicalization library for python and java

Project description

build status

A URL canonicalization (normalization) library for Python and Java.

It currently provides:

  • A URL parser which preserves the input bytes exactly

  • A precanned canonicalization ruleset that tries to match the normalization implicit in the parsing rules used by browsers

  • An alternative URL serialization suitable for sorting and prefix-matching: SSURT.

Status: Stable and in production use for some time. But no API or output stability guarantees yet. There are differences in features between Java and Python versions.

Examples

Python

>>> import urlcanon
>>> input_url = "http://///EXAMPLE.com:80/foo/../bar"
>>> parsed_url = urlcanon.parse_url(input_url)
>>> print(parsed_url)
http://///EXAMPLE.com:80/foo/../bar
>>> urlcanon.whatwg(parsed_url)
<urlcanon.parse.ParsedUrl object at 0x10eb13a58>
>>> print(parsed_url)
http://example.com/bar
>>> print(parsed_url.ssurt())
b'com,example,//:http/bar'
>>>
>>> rule = urlcanon.MatchRule(ssurt=b'com,example,//:http/bar')
>>> urlcanon.whatwg.rule_applies(rule, b'https://example..com/bar/baz')
False
>>> urlcanon.whatwg.rule_applies(rule, b'HTtp:////eXAMple.Com/bar//baz//..///quu')
True

Python releases are available in PyPI:

pip install urlcanon

Java

String inputUrl = "http://///EXAMPLE.com:80/foo/../bar";
ParsedUrl parsedUrl = ParsedUrl.parseUrl(inputUrl);

System.out.println(parsedUrl);
// http://///EXAMPLE.com:80/foo/../bar

Canonicalizer.WHATWG.canonicalize(parsedUrl);

System.out.println(parsedUrl);
// http://example.com/bar

System.out.println(parsedUrl.ssurt());
// "com,example,//:http/bar"

Java releases are available in the Maven Central repository:

<dependency>
    <groupId>org.netpreserve</groupId>
    <artifactId>urlcanon</artifactId>
    <version>0.1.1</version>
</dependency>

License

  • Copyright (C) 2016-2018 Internet Archive

  • Copyright (C) 2016-2017 National Library of Australia

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this software except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urlcanon-0.3.1.tar.gz (13.5 kB view details)

Uploaded Source

File details

Details for the file urlcanon-0.3.1.tar.gz.

File metadata

  • Download URL: urlcanon-0.3.1.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for urlcanon-0.3.1.tar.gz
Algorithm Hash digest
SHA256 30f5bf0e2e4a0feb6dd9ee139a4180a5d493117e8a1448569da3d73c18b92b62
MD5 d961106a2e524ce5f59a34171a324188
BLAKE2b-256 cb65222a5733af4c6d728fa90b0dcead218b3b0460eacdae22bb9ecdea1bbe5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page