Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip by pypi
pip install liburlparser
if you want to use psl.update to update the public suffix list, you must install the online
version
pip install "liburlparser[online]"
Or
pip by git
pip install git+https://github.com/mohammadraziei/liburlparser
Or
manually
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.10s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for liburlparser-1.4.2-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8523a3f922ea873c8e723f3f5571965d9f1e9bb0778da285aac9cb01101fcc7e |
|
MD5 | c54bed86e51596311e5def78178ffe27 |
|
BLAKE2b-256 | 2bf20829acd69decf2326f767688900794773a8f76a0671d45dd53ea31477224 |
Hashes for liburlparser-1.4.2-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea0372914a2002a8618b442f54d6aa9b96a6e7109e315de365432e4aa7be9d4b |
|
MD5 | acfd708c7bcce8b8e7229f3666b0e76a |
|
BLAKE2b-256 | 37b8a01a78d7480c21c38f734098c37d3fae1103e50a343a6c62614e8cd61151 |
Hashes for liburlparser-1.4.2-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6cf2dcae296f6089341e2755b055b09bce62a06c72464410cfc8bc0bbcd85212 |
|
MD5 | 32cb79984432c9cabbb02675a5e30a39 |
|
BLAKE2b-256 | bbe3e019288a31a08ee1e274048471e35f611182f36b8ee723e6cfcc70d2b809 |
Hashes for liburlparser-1.4.2-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e96c88d1087531bef89882f5c65cb0bf52903770f518493f41ad3a08d8eede5 |
|
MD5 | 84d0051a5532ced3128fe8ea661a6c7e |
|
BLAKE2b-256 | a55eec94159d412a66a22149fe3db7764e69e68bd4a0c132e26b001776e75132 |
Hashes for liburlparser-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab3d87a88c1e30228c8ec87d3b898b46d32db25a28220c23e37ad810f5d334aa |
|
MD5 | 540780fc31eef9c06c7fd008f1a47986 |
|
BLAKE2b-256 | 5128873d773d882f66662951e5f53e815e8916d53757807c3ad85df706f30e72 |
Hashes for liburlparser-1.4.2-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93e09fe9420d2f67f31a584900feaed373d9d109885fe36655328a5d58e8c153 |
|
MD5 | a50144464d724da11d50eddc8211843f |
|
BLAKE2b-256 | 6a8bf13af8b511110be3f8fb88c8659bc9345ae8526d8aa74e0523895d55994c |
Hashes for liburlparser-1.4.2-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d247e2d9e8f3e4b2496505e700291df50a3b11c11b5e0ec62ccecaa2d29f1ae1 |
|
MD5 | 39f176a0cd01202e048ba86273d4dd9d |
|
BLAKE2b-256 | 9aca40ae737d3715a868cf2db1f9824061b98d3f689655692fb02effa470a067 |
Hashes for liburlparser-1.4.2-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34c2de94a7655dee339020272145efdd4b2d619320e19201c05cf41a15fb2943 |
|
MD5 | 6e880c6db2d6fa761742e81414beb903 |
|
BLAKE2b-256 | 543ea8658cba4e8c5b02471e13350197b34a63bae3d8a090fc31b6a0918fae97 |
Hashes for liburlparser-1.4.2-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 188271b8b1904f19ee250826cf51fd4681ef2b210fb4d4a2475e90455271cc04 |
|
MD5 | 3fc81aaca940a68f8f1c187246fdaa93 |
|
BLAKE2b-256 | 97dae71cfa3ada7e347eabdeb4125a71d749ae6338c972ca3ba879d9080430d4 |
Hashes for liburlparser-1.4.2-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4b59e7d57c9893f18be3c15b95fe81897cfb099885605585f116b3679591888 |
|
MD5 | 1f24f200843a95e95b8263f0bf5ba526 |
|
BLAKE2b-256 | 0f5e438b2bc75a1fc149928618d02b6eaac951ad939d0e38214649e8679eb196 |
Hashes for liburlparser-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a1329d00e25e79c6273e6230115972e6d063084c70fad0705328108c321348e |
|
MD5 | 65df5e77677e77873bba718a8fc6f66a |
|
BLAKE2b-256 | 6d03db0ca7f16961566de93b651f3e75cd27758c21df77c908859af72329cd8c |
Hashes for liburlparser-1.4.2-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b84b1e7008af11cc8f3503afaa981ea007ba00564ea0e5409b0c8391277a79b2 |
|
MD5 | fb98e94ad6c7cc3303805b8a90e99730 |
|
BLAKE2b-256 | 19edcd0ca58f4c4e2e3b1c2a67628daebe8e62a0aa13a10f6605e9f874592fcd |
Hashes for liburlparser-1.4.2-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a397015569d5da71b68a137c2db0048bcbbcfe4c5e24c45751a170257c79547 |
|
MD5 | 6bb0483900457f6d58fce0ceb26db7dc |
|
BLAKE2b-256 | 969c5ea7c6d4ee99fe50266f817089967b3b3547eeded18d4c1dd191424348ef |
Hashes for liburlparser-1.4.2-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02def9cb4d4263fc86f8a2b10d0145e17fc4605844927501cae1e7f66273ac4e |
|
MD5 | 1d329f5c24bf06a45d9f9db28275c434 |
|
BLAKE2b-256 | d867e6566406aa6cabae0f44e2d6c28f45f37fe695806c4c9c7dbc1b18ce9854 |
Hashes for liburlparser-1.4.2-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e89b751a14b5df62fa3270a6bf78451d4b2cdee0aea0264467f2ef5600325e7b |
|
MD5 | 210456a7528ca50634acbf76fe992361 |
|
BLAKE2b-256 | 74aa1c7dffc41349bd3eff0cc35bb8d311efe7f0f7b788af61a0d4be9367853f |
Hashes for liburlparser-1.4.2-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b02d6b3c23959de6a8ff48dd06460bfa5cde57ba22aeea60d5bef8c8cad62095 |
|
MD5 | aed2afaad6d041dea2445eda6915be92 |
|
BLAKE2b-256 | 7bcf70da77189d252623dbf4f4b0db2f13e6d381bb23c5bd1ec875adcc95ed8a |
Hashes for liburlparser-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a05859e2996f6b78b7ca1b7bf65b137223e69ee5f4d8fd29dd6f7b064f1c7541 |
|
MD5 | ba26ea4534f57ebb81f3c7c29a9f34fc |
|
BLAKE2b-256 | 51876c1207ce8eed358b5b77228a40a65d2dfadc39b61222f7c550cb525156b0 |
Hashes for liburlparser-1.4.2-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffc7f85bb1f8928fc004a8d5c67cefe25dca9530ac60b020780659f82a4b60da |
|
MD5 | 71236cbeaf4c2e8b228abc6291f13688 |
|
BLAKE2b-256 | 069d9c23aa8d7ac78f26a5aec0f67c3d049d00b2d4eb0df91e097d532d633353 |
Hashes for liburlparser-1.4.2-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48a91adcdb5d64d37229880ce654d27a912bc2d14a37df896269c7b1530311a9 |
|
MD5 | 6dd13a417cf023bdd07a0f1dd1faa02e |
|
BLAKE2b-256 | 5de9a3009957831c596ebd93f5e282c4a3e68c04fa739884edaebdf388baaf49 |
Hashes for liburlparser-1.4.2-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93d7009dfc7bddbb1e44be0ad8550acdbc18adbe32e9bcc61ad4837515a05006 |
|
MD5 | a8d9d616cb9da896cff3b66e29934fc3 |
|
BLAKE2b-256 | 7bced5c0685033e07aad334afba6eaa7c5531aac011caf82d60ad24ca6a099e3 |
Hashes for liburlparser-1.4.2-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a04477b5c75ea2949005f9d772707a6140f2a5fbdc4669f4af89c9d75387917e |
|
MD5 | 55dac4d23d1b36d39e3b4575b9ec8df2 |
|
BLAKE2b-256 | b9bafc566ecc3a9d762dd69cc6125dd9b8a7c9e15d2d987503b5226fbf39f84f |
Hashes for liburlparser-1.4.2-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62f7a8b4044d196aa4d321bffb19211835d9496a4c5132483473b04a1cf6b97c |
|
MD5 | 458a6f7f3e07635fdd121ce600292277 |
|
BLAKE2b-256 | 7820850cf7d45703d624b4fe72beac16878b1aa9fa91619c4257baaa88a75b2b |
Hashes for liburlparser-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 079dea9c152ed55d40e3735e3b74ea34d2208b7942701584480920c88b8411ee |
|
MD5 | 25d2d1ec98be11727a67ee709ed8770d |
|
BLAKE2b-256 | 419591ba9c3049922f2582a488ed6567d0383cc8a26ea96800c2a4ed2eb71a36 |
Hashes for liburlparser-1.4.2-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f8a24ca0b3420c05b17bc9b3a4c362e4b903e5a072cbcb106df40b06c7719b0 |
|
MD5 | 1c5c196d89b3437c907b72585662a3b6 |
|
BLAKE2b-256 | 0eecfa6dbce920a750ae92d2f8417e5d700b8a3282e3f501f1ff20ce68131a29 |