Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip by pypi
pip install liburlparser
if you want to use psl.update to update the public suffix list, you must install the online
version
pip install "liburlparser[online]"
Or
pip by git
pip install git+https://github.com/mohammadraziei/liburlparser
Or
manually
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.10s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for liburlparser-1.4.4-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a93b406de309abc4b4e1cca210c548cc89a4c439bf7b2f38e67bac5c64ebbfa |
|
MD5 | 3828390920a7b14af3f1c77d8bd8eabf |
|
BLAKE2b-256 | f0bb4c3ae118bd5da3355342b7fc2818159f302b02a8c492e7bdba40ebabb7e2 |
Hashes for liburlparser-1.4.4-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b90c17f4e2a1648d056fbf788a39a9795e2ef9d7c33fabcf86f55bc8fc14b595 |
|
MD5 | de89162b4288623d6b38abd97dd1ebc8 |
|
BLAKE2b-256 | 8b04dce33906fe9ae2027d8c549047a22e6812a8d7ec13487c8f3138369b5a9e |
Hashes for liburlparser-1.4.4-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39fe4e16cbc7dd9e062fb75f4b5f17ca077101c1c2dec8c336785297b16ef136 |
|
MD5 | 8c6c0f1145868e3adc8ad63e7201eb79 |
|
BLAKE2b-256 | 27e5698c877af075c8bef341c60b917ad4d16313774cbd2856eccc0d3f936dde |
Hashes for liburlparser-1.4.4-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cee503ccac8f5fedb652c0a798e40f0501c25e4397131e004866f9c8257a613c |
|
MD5 | 8479125538317a0d78b691e2bb76bbc6 |
|
BLAKE2b-256 | 3a272f59eb19fc85957433460187950d179174ee7fcdce67d40fb047c906c25e |
Hashes for liburlparser-1.4.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1a5cae89872448a4146e0b6f5ed4a31a583d6f797da339d1d0fa5f88aa7b765 |
|
MD5 | 66ff313d0bc76bdf866087d94e2a5fd6 |
|
BLAKE2b-256 | 6c2050234258ddd4ebde29d146791b57e6a7cba033d34f5099ec12305e5f0292 |
Hashes for liburlparser-1.4.4-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8895f5f22baa90cf389213ffe6c9f598e66423e236bd2720b4d54fcd8bf1e8da |
|
MD5 | d4ac5a43b1d2ce11569b11e6289e879b |
|
BLAKE2b-256 | 98a86c0a2ac14d07ddcbe96eb1a5a29b3dec3f9fb5438f713a8d99244b46a791 |
Hashes for liburlparser-1.4.4-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a76695c06c20df7e9fcb8734c6a8ccb9cd4a305351562684ede9dbacb8f601cf |
|
MD5 | 5710f435cc11a9571b56f8b21439369c |
|
BLAKE2b-256 | 8a9ee5fa39d65e499150e07e5576666a0e79c1ec3bf67fb4c8f558c6c311dd67 |
Hashes for liburlparser-1.4.4-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8937d6fdbcc821912d14d5196726d752a74bb0049635818b1cd22da92114f1c8 |
|
MD5 | 7d9977007c494ffb18a80f33d71ece74 |
|
BLAKE2b-256 | 1e8d3b10defdfc963f72d19ade73ee25ea3949ff057535fd6135fc12ba703a6f |
Hashes for liburlparser-1.4.4-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42f6c6e8a8cd74312c67c0681e056d2ee0734399dc1e84a39b875e38d339dca0 |
|
MD5 | 784e449e9193a35df9e9206608fa5f70 |
|
BLAKE2b-256 | 9d97364a2729a5257a4ec0eff736f9a6a6aa0d920bede250e86949b5f944238a |
Hashes for liburlparser-1.4.4-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f1fa950c6767e845936d5cce7458a49b44d3c305cfdd8e2ad191576bdd01aea |
|
MD5 | ad677bccafb46228ec5ba381db1e7d26 |
|
BLAKE2b-256 | 8521088b9fc315001779a06c491849c5de42e8ed1c656c75a0fc7e07366effeb |
Hashes for liburlparser-1.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14fa85561a390aa3e279627f8b81e605ff9f4c065146975cc6d358ac22bd5dd8 |
|
MD5 | 767e43db0e93a34579d065e71e72ca74 |
|
BLAKE2b-256 | 15e96966788c7958a0104202d4d2da3579fb06d977cd988166c6cae6f458a035 |
Hashes for liburlparser-1.4.4-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb46bf6db15c79f5555088320eda7d77e464b3ff4a6c6381e2e52d192589f492 |
|
MD5 | 4d90409cb9bbe81d3267d22e95d60e4a |
|
BLAKE2b-256 | c436c52a3f2820ff052f7f1fc1b1b0a96b15fbbb29d8249c5f19fc4849b3d56e |
Hashes for liburlparser-1.4.4-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c44472fe2b6bc2f2a735156a7becd64cd1eff964d23641683440e1ec8f24ac26 |
|
MD5 | 314f9b677440b42e3ecf8b730b1f9ee0 |
|
BLAKE2b-256 | 4a03119c1291557a77028c7f8b2e38a45296a7bfb57dbd780eec7f654ce3f1a2 |
Hashes for liburlparser-1.4.4-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a00c113679535963fd255db08bdd1b0abf0e015aa8e7826009f75de6dda2fa4d |
|
MD5 | d19c008ccd332551bb31b42917dc6ec6 |
|
BLAKE2b-256 | d2e13c132ac104babdded80c485cf5346b2b8d0b1bb7b19f9992ee1546f4ecf3 |
Hashes for liburlparser-1.4.4-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cec8306b7e607fdf971c17c56458685aeeb502dfe08b2b0ce861633c45274394 |
|
MD5 | 1efc501897f42e7d0d4717570307a02a |
|
BLAKE2b-256 | 3ae96442a1d6f7c250c02afe0c523582e90954639c8eae4174c31be7bcd1a4bb |
Hashes for liburlparser-1.4.4-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0aa7d20280ba0d3b2e32ed60c3f7c46bb244d22d3fdf13067b2241846ae51e6a |
|
MD5 | 71219945a399b881282b81b3c102665e |
|
BLAKE2b-256 | 5426e3c86c9ec4670a5eb06b87391926e388567a903096cc5a3ac673ff3d61dc |
Hashes for liburlparser-1.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb6fd7463b8825a54deb005b8a7db9cf5cd11d5f240ad51d4f1ea947c1618d2f |
|
MD5 | 7ec48854aeb416f9d2d9a6d3d9c5436b |
|
BLAKE2b-256 | 6c762142fd2647dc4501b696e7218efaa18516f63b22491819cf6d33a6fb6184 |
Hashes for liburlparser-1.4.4-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bdf3a8647b690c99073c2fb68c322fa1f461f5e126789fd4d7c44e0bc7140440 |
|
MD5 | e723f9077a02d701181a4ae8799e870b |
|
BLAKE2b-256 | 8f5f26d3d6e2a1734c6c6e40a07e02aafcf417e724b7f41ed824c59b32d19376 |
Hashes for liburlparser-1.4.4-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7da892c90bdd7cd384cc4b94b1c496048ca1ad3d2cbea327e287ed03cb46be20 |
|
MD5 | 980f19efa71a757e89967d4b91b373fc |
|
BLAKE2b-256 | 97d38dd0a5fbce993550dd2d1f7c61516ffa2c51aa0b49adb9e4052ed54c22be |
Hashes for liburlparser-1.4.4-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 436b652d8767fde605ec1ca35bbd31d3f5a9b3af72b5353b1b2f125a99580da7 |
|
MD5 | 2bba9cbf6f8e442468d59e738808fff1 |
|
BLAKE2b-256 | bba898ef6945fd5ae376ad1f7035246e82693819a830e214447d0baee0a4f480 |
Hashes for liburlparser-1.4.4-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c031e964bffe0c559d6fe51a9f41e356367dc3e9b9fb63e8d1e49f2bd0338bf |
|
MD5 | b53c713241b2d12e3cd26742801492d3 |
|
BLAKE2b-256 | b460ae68c7df10476e542d14b498c302a93695cfd6994784f454ecb98e11c38d |
Hashes for liburlparser-1.4.4-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d9b6ecca5c738cbbeec6ef47919688e109e5e4712e371b96b5293e1c4a69d9e |
|
MD5 | 83dcef0dbf992cfb1bb8e7320e7e2d73 |
|
BLAKE2b-256 | a6fe2f70f3020e8d33c82321f77ca0798e601f543151c4856b41b0c7dcf6315b |
Hashes for liburlparser-1.4.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6f75d1eaa99891a7dabe8314675b8165763ddc9c7d6ae71f0a96f561551eb993 |
|
MD5 | 7bdf2986f14b69448c6b675bd0070352 |
|
BLAKE2b-256 | 1ecdfaf366012c04b73f803ffdfb78acee93c627b8a7f3957a66bfc5c23a2aac |
Hashes for liburlparser-1.4.4-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c65403392101a749c39eb9960210fcd8b776175e62fabb63acb9cef513e5f2a |
|
MD5 | 2aeaa8549b56843b70986ce866e0e613 |
|
BLAKE2b-256 | e219d700788ec87fe942f6feb5dedf94b21063ada8d07f91bce3e28e4c29086c |