Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip install liburlparser
Or
pip install git+https://github.com/mohammadraziei/liburlparser
Or
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.20s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link: https://github.com/mohammadraziei/liburlparser
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for liburlparser-1.0.2-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 087ff595519faca1b0ed550114d05214e3818014669dcbe90115a549deb2e89b |
|
MD5 | cfbe0714eba0e9b6e9db2a175cb97dc3 |
|
BLAKE2b-256 | 7bc60b4ff6afa0a8a06ff8e3eaf1718754d737f60404ba4fef13a4efd334b4b3 |
Hashes for liburlparser-1.0.2-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72dde275bce6e56795f4321ddc996d6904ad8fa27098c164b26dc1f0b5bb1a36 |
|
MD5 | dda30fade314940a42d345bec79c3a8d |
|
BLAKE2b-256 | b62ffa2553b69aeb706f5266f1c93a59daf9a26352c767521f5d19fec1361788 |
Hashes for liburlparser-1.0.2-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4dfe4ce12481b45884c08c1fc8be07d6aafe82757f3fc0a2382f034609b013d2 |
|
MD5 | 0ee7a81b66c278322479504ead7c99af |
|
BLAKE2b-256 | 001932af0dcdeba3c42747fc6b80cb0bdc397091b41260ba4a530f03b65a5e6c |
Hashes for liburlparser-1.0.2-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 910d36ed83ae2104b19afd1a4eed0c0b2700c55c2667d9c8f04919f53596ca8e |
|
MD5 | b7e2be9cdde0b6497f8634ad0ba1ac70 |
|
BLAKE2b-256 | e1d289a2b2787106a90a16adaf071b50b24cbbc1add47d70e58ecfef65f82609 |
Hashes for liburlparser-1.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b83bd9a17ee8c7dce4e115f63face241000a843dd0c9af24d8be3f5385439661 |
|
MD5 | 0781067d8e40f06e5650d8490b1e03dd |
|
BLAKE2b-256 | fb265b887a4888d243238761c16c15f4cc62fd8dc6b2f126a1275e797058eb4a |
Hashes for liburlparser-1.0.2-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 493b5a94257a94878016bf146a61c8864b04b48a685561ca350297d2cba7635f |
|
MD5 | 1d1d255678635c99972cb4c9fb2d9ac9 |
|
BLAKE2b-256 | a97c3675765093938ae8a59c7424e6b41f59fdbd41e0a3fdfdb2fdb5d9429e4d |
Hashes for liburlparser-1.0.2-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1cb71a326ced39b4622022fe9a895fcc2ff4f1e949baf5f169088a12f1cd0f28 |
|
MD5 | 4618b9566f709c3e478e726e955ab35c |
|
BLAKE2b-256 | bf8ae2f678c3c0cfed1bf6743ae40e7e557380d3744145b7b283162e28c3332a |
Hashes for liburlparser-1.0.2-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7ccff9817f262cfe52aa67e9eb311c76073088bb0a05fae91beb5cc3b20d50e |
|
MD5 | a190da540b2d99dfd99199020d7d3e07 |
|
BLAKE2b-256 | abfeda13b8b593e9d004eba1639043787049c3503a2f7a5b95f287b59a9f37fc |
Hashes for liburlparser-1.0.2-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5242f78b39658bd303d858657feb7e4faa329ee0ed857f77e8a25d829b14e8ea |
|
MD5 | a87480eb97c6fceabe6259de4eb041cc |
|
BLAKE2b-256 | 7e27179d0c127f1b151ad0449d897bd2ef3efb5148d8b2f43add8e4cc32878ee |
Hashes for liburlparser-1.0.2-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa7918c604e430287c9051c855b01e70a96e92467efc8fe96bf055c8a1425285 |
|
MD5 | 624e1e0808eb25d798b3639c285eccad |
|
BLAKE2b-256 | 7ecb694b34f2ceb85a38d509a7f0ea29fcb3e3dd1ccfac226c9883d5138d0064 |
Hashes for liburlparser-1.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0a15627f615483b4fd372ca7fb054f02ae508e99f0638b6bebec0422666eb6f |
|
MD5 | 3353235d45319e0d2cfc6c0ed290d72e |
|
BLAKE2b-256 | 43199c56da6a814e150cc178a89d7c9bca8769930317432e06cdc00d86bae31e |
Hashes for liburlparser-1.0.2-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e19e0a96d12992b578428b443cb92b3f9adc2bf76b3ea773249c9a57c2fa2e92 |
|
MD5 | 80f9d07b7667c9a04a1a601278630439 |
|
BLAKE2b-256 | 2190017da4546ddec736d169432f2db7f69d65b9a9e45719f76d731ea831d24c |
Hashes for liburlparser-1.0.2-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7db46779de7c9a81c040fdfd201f7e3599d6cb49154200e62048a9c261f358c3 |
|
MD5 | a3993f651c246e0428093dea336244ff |
|
BLAKE2b-256 | f16d49b818ea132a4a98384d26d758f24198d6f928906186a6e9c6fb09bad041 |
Hashes for liburlparser-1.0.2-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f4215e8866f9d6ece765680660908bbf012b04181d0f5311351be87fa184c0f |
|
MD5 | 5e001efedf43ae9937673d0a1e467224 |
|
BLAKE2b-256 | cefdd83654ca0c930f8aed285494e9f9271b4393a06b00a6cfa58ff5f4175404 |
Hashes for liburlparser-1.0.2-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b55497228c869f05dd15e589cfec0e46258d66e3e2033d905787f9509f65b9ae |
|
MD5 | ccd944d45c1f87f76207de6422b71cc4 |
|
BLAKE2b-256 | 3843f8b2da1c4b1b4111615a06dc788faf58619e82d499a851a3bedbc8d5cd5b |
Hashes for liburlparser-1.0.2-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c50839e7e24c89df183ad45e988e5f403770628fd09031929077538a2a6d96f2 |
|
MD5 | 354a1d60013a64ec5a02d714f06ab3ae |
|
BLAKE2b-256 | dbdab60dcf8796f0840871c92cd229f25676791c8d8e2655b0fb4246f824aaac |
Hashes for liburlparser-1.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 22a761ab6b6c6d9e6f323bffd874a9e05db31423c146b0866cfedffc3e738e9b |
|
MD5 | bd708008ec539e47cefd0740a4916f98 |
|
BLAKE2b-256 | d1cc5f59ee56c79fbe4f0a90455579bd2d2badcea7c5edd034a535a32f3ece6c |
Hashes for liburlparser-1.0.2-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2113303d36e291053c7f9020c371a977aaa2f08bc9e32ff89add38d99c557521 |
|
MD5 | f48cef27059a4f2397c6da4aaf1f83cc |
|
BLAKE2b-256 | ac9be597981b4e7b075e1719c3b38f79be7d6e2f994f4da62b27df4225930824 |
Hashes for liburlparser-1.0.2-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3653a91c62f6e2125c732a8a01ba512d62b96f22f0386c263ab220414ed7b80e |
|
MD5 | 254333d893b7d5076f84b9bde64a6fa4 |
|
BLAKE2b-256 | ffe330f4651aeb574fe2a3241ce48773e5c9849b3e314f924c3dc7ff22858f28 |
Hashes for liburlparser-1.0.2-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d5fd8eaacfb7c5e7b3b2c0b51af9aa53790a5195e6954c211faa35f980490af |
|
MD5 | 570091166c82de5829c3308e06c50b14 |
|
BLAKE2b-256 | d6e47b5b70e28b577e89b897ebd267a312086ed6854bbc034f428770ee18ddc7 |
Hashes for liburlparser-1.0.2-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c821c2ab63d83018a54cebbc2363c52aa686e4590d1b759ee622071e8de6e72 |
|
MD5 | 02eb99489668943b1ff34357b2d06454 |
|
BLAKE2b-256 | ec2905a32154ade12c224bb5a5f8cc4db23d06118acf5851b37dd67f25edcb29 |
Hashes for liburlparser-1.0.2-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9ff3c437e5746669ab56b7c5356a87bf4f9091af27500f5dc82c2d6171d239c |
|
MD5 | 13e2d47ec61afe2081073dd3bc3715d0 |
|
BLAKE2b-256 | 8028195ae3122580eb01ad93410776af17c2a5fc558a2e8b6429db5911482a71 |
Hashes for liburlparser-1.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ad0ea3fded6cb45f424f770955d9e176e78375ffd99fc111db1a119d4f34f7a |
|
MD5 | 1d3305070867c4ee2959aeb683d849d6 |
|
BLAKE2b-256 | 6f759f417d45794ee75870d36594c3664e54736113579d22a312508dba23b1c1 |
Hashes for liburlparser-1.0.2-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 303cae19ac7b131cb6e4ed0865e20f13bfdb51f7b79a7cf22371db50c878cbd1 |
|
MD5 | 6ce645dc16d700582f9bc8df3b302309 |
|
BLAKE2b-256 | 275eb98e2b88bc16040caa4a691fa3979201ee83b3a7a63c91bcadbd180e0905 |