Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip by pypi
pip install liburlparser
if you want to use psl.update to update the public suffix list, you must install the online
version
pip install "liburlparser[online]"
Or
pip by git
pip install git+https://github.com/mohammadraziei/liburlparser
Or
manually
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.10s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for liburlparser-1.4.3-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f1bed9fa75247780e67e732e2a12f3d3d0a2193b35e720fb336a544d1a783fb |
|
MD5 | 5099444fce0e899a3b305170c9681b93 |
|
BLAKE2b-256 | 860697744bf0332e08294f5d45e81ac1699fef9372b4c9b80d0a6685ef742379 |
Hashes for liburlparser-1.4.3-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39d95e010fd1d28ad48e0e8e7914dc5b711cac2423aef748a8550284db01ab4a |
|
MD5 | 2be5f66e5c089725ddc71446583b3a7d |
|
BLAKE2b-256 | a17d27c4404050e3d59228e9a44e2254688d72c2561d4576d48d5653818e6603 |
Hashes for liburlparser-1.4.3-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 328703f83fa0b7ec0df2b5a58a2638dd90c77d5e5c6fb0f2e27d5d04d7939007 |
|
MD5 | 083832c0720c4d1d72b3b63d78b4a5ea |
|
BLAKE2b-256 | fc1f66cea402dc830d7fbc062ffb83929bedc7ff0cbf693f4633833607d0c483 |
Hashes for liburlparser-1.4.3-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 31422d3fd53b5d618e1eab383edd8c0d04d5e1fae05de6d90b364d489d79472a |
|
MD5 | 34712201fcd366691c73b93b5a5695c9 |
|
BLAKE2b-256 | 95185f36db3163dafb5119f72b6aa1468ca1d277c47c8c210bafda80241c8063 |
Hashes for liburlparser-1.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e1526096ccfc693b0c35427e78a382bbc604aff9c9eb3f29fa921326b900d34 |
|
MD5 | 5e93c327d1ce278cf53215546e2fc671 |
|
BLAKE2b-256 | 485e89aedd0e9638183237347257ab9ffb5165fb9ad8984408b39767b27510b4 |
Hashes for liburlparser-1.4.3-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 719cac0f4ef895d0975bef5b28276767f33d3f4f3b51e6cca3a9dc5c183177cc |
|
MD5 | c40ca49b21ddaa0ef70441e8319915ea |
|
BLAKE2b-256 | 464b55735ae63cc356e432290e15bf853a470b76a0bb8628c39ce21262e6408c |
Hashes for liburlparser-1.4.3-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8fe80dc7c495af5d4292db55c0cb9bd2fc235df9cce23a14073fdf2fb606a629 |
|
MD5 | 5d26538706d424f83c70e7db21a304bb |
|
BLAKE2b-256 | 14dced0f270b8d98fe7fa15760f02ef3ad278624927adae9337832687b8c74ae |
Hashes for liburlparser-1.4.3-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5cd8387e9d5c262c01e9f0ec300f55d76d6ee5a83dec823726634085d7e3cb05 |
|
MD5 | a9745207f32a51674499eca606a8971c |
|
BLAKE2b-256 | 01acf44f4f3615ec7a40105ca5c34dff45cda4d50d0d2c9325e06ac12b4104b8 |
Hashes for liburlparser-1.4.3-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3dfc58741819872969d65b88e0f2aa41c7ad6e48698c5f30354c256fd9f84b28 |
|
MD5 | b5a4e8bb86e56ba73dc48e1eb56023e1 |
|
BLAKE2b-256 | 5c77614a618af242226b33633951a2868219f8bf93e7b940fcc45b2628bbd8b5 |
Hashes for liburlparser-1.4.3-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3ac003fae710d84bac19dcdec39fa68aa18c8e8ce40335ab148b0512b7d93d7 |
|
MD5 | 1083969d6d468c3e3fe0734cc4441212 |
|
BLAKE2b-256 | 80339f34a026d71d8ce7309d6976d1c9b1a104531bb5a1e4ac3a9133c769767e |
Hashes for liburlparser-1.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e73e3972fdd9685d88c1c129fd06bc2333faa8fc047415e4dd81d047219c7b94 |
|
MD5 | 765f35d5219f75c0facdd7ec2e6d7f30 |
|
BLAKE2b-256 | c3bacdcc7f6a075e3065b15005f6fe2783d4943c4b62a81c61cd6237371ce440 |
Hashes for liburlparser-1.4.3-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c29499d6156e3148f383d337a10f07802760ff25bef206059d6fe8e6706c12f8 |
|
MD5 | 94154cd86e96acc05d736c8e7a69f3ec |
|
BLAKE2b-256 | 9c0e808e3334cf8168a888178c589f29e7a640927b27503a9ddeb1b0c00ec4f2 |
Hashes for liburlparser-1.4.3-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 19b8197fdd3dee82fc3a0b20099aee31ca1fcb3a2d43b2dabcbb965b0c903f0c |
|
MD5 | f1c6b1103b5afdf3bc47172430d50c2e |
|
BLAKE2b-256 | f51a028c82ff898635fd795479620fd7f55db5d081b36cc096539f6b291edac0 |
Hashes for liburlparser-1.4.3-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 320ba9b1abfb3bbe6641cede6cfdb5433e8c858a88a4decf7e916c4b9cea53ab |
|
MD5 | 7b0b6b786b0e1717a7fb23abad929d29 |
|
BLAKE2b-256 | 58705555faf1e3ea91d97d26ae6516c47f0bf81143d6334005d97207b4a37533 |
Hashes for liburlparser-1.4.3-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4643061637a37371ccfc41c3c3a09386c2bff64a91969187cde68ad1fb4acdc4 |
|
MD5 | d86104e00c4f1bbfeb3d0d4c0ff4c442 |
|
BLAKE2b-256 | ccaf2c3c5c0d46792726e42fb050993f99ed829556625cb00da80271e1ba9405 |
Hashes for liburlparser-1.4.3-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99b473acc01dda5ce9460f7aaad1a08715cfee422382e619a5984205ef8720a4 |
|
MD5 | 4bc3fb793d9d66f3ffe0acf29671b762 |
|
BLAKE2b-256 | ccb5c018a891482f8a65fab98d5da598cac149c0cc93d642cc623670baa0ff36 |
Hashes for liburlparser-1.4.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 23a835829b8bf48d5b75b71c67384c153f968eed3885b37637f59c422a896ea1 |
|
MD5 | d0cf04b173a01f67bc789c4f3cd6b813 |
|
BLAKE2b-256 | 0c987041ccf09189b42fa2ecbc76d8398b75ee2959c6638e77ed465bbc7e2e83 |
Hashes for liburlparser-1.4.3-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64c33082c01b620ef91fbf4b979d351de0be15b8561de1ab9b83b40ef6f8dfd2 |
|
MD5 | c7d86b0c215b33b4fffb90eed7f0acdf |
|
BLAKE2b-256 | 683595b1645f3bef603c29464b2a048a21223158bfea829140da6267587d7be9 |
Hashes for liburlparser-1.4.3-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c4ef1b28805a9059371a79726a0832ff42a54d61ec051b98137314943a6dc17 |
|
MD5 | 467e2a0e4405282540fb3313257b95f9 |
|
BLAKE2b-256 | c6d219463c810c8884ee28b0eb78f5a8b0709b29bf64e346162f87abd5f9b65c |
Hashes for liburlparser-1.4.3-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f390c8a4cd6fd7a12196bf0fc022a9d669e97eff2b9f497bec5572be9ec0a42e |
|
MD5 | 8d34272cb235b35fa0653177312b7474 |
|
BLAKE2b-256 | 80f1551a696562c9939c74ea4f118362aae1552d7502f0e3b52450bf90675221 |
Hashes for liburlparser-1.4.3-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0ab722d1bf38f644e6c26af56614e2e2204ef9da1903a8b07deb4799806cafd |
|
MD5 | 7473f8d9407828eccf0e5e10e107b689 |
|
BLAKE2b-256 | 5ddaa9ffa16e037d5fcd7b3481a7447629edf179217e0db658bfd5dade818a02 |
Hashes for liburlparser-1.4.3-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91ac47c503d0a22dcaaf73caa31594676b1deab4cf18babc0bea3eaa9c9ffddb |
|
MD5 | 2ca256b1866d9cc1d92550ca242f4913 |
|
BLAKE2b-256 | 1543520761d38af023237a28ba4c59bbdef4a81c54275aecf372b24a5e9a3a57 |
Hashes for liburlparser-1.4.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad204bad24ae45c1f96ed8ea0c0085e32a446b849709b73df8c660e955ba1936 |
|
MD5 | b0ae218a8f0fb9a06dba0d4dbf107f24 |
|
BLAKE2b-256 | e258492d7096fbae9389e72f7ab8df6ddcf8faf1155260b0aec67c626fd41cb6 |
Hashes for liburlparser-1.4.3-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab3d730dadc2611209bff81054224d0ac66aea227daf51176970ba4c38e0e1f2 |
|
MD5 | 6ffe166857b02451c5cac901d37591e6 |
|
BLAKE2b-256 | 7820e71861935afe2be9d9373dca76df2e359a1dd667a61a64494ccf7828e700 |