Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip by pypi
pip install liburlparser
Or
pip by git
pip install git+https://github.com/mohammadraziei/liburlparser
Or
manually
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.10s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for liburlparser-1.4.1-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2847a54b0a2a1670dfd996bdfe8657496a87062dcd266b307707cf7ccf3952c7 |
|
MD5 | 7b2bcd2ce11feb2360c689d8c62e1b6a |
|
BLAKE2b-256 | 75a85f3372d28835d7de05b9c05bca5c72afbba00f23aae66326b40d085c9e85 |
Hashes for liburlparser-1.4.1-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11c70b0f3c884e7ad7803ac23b66d8604f8cac61460ef976aba1f4b465cab0a5 |
|
MD5 | b0b635da2f12f14517027b7706d8ec6c |
|
BLAKE2b-256 | a9e86c79882127ca0fb4013d387d6c197d0cc4b47fb6747535a9a72e99470611 |
Hashes for liburlparser-1.4.1-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96204c240f28c429021512cf083b4f9e1c734af4abbce4c1caa6de7842c6a0da |
|
MD5 | d7edce7a8e717e87095b3e94f5c70ca1 |
|
BLAKE2b-256 | 1fd655597f142b0f6dd66f752e77d673d01c3635cc9e9bcbc4c646e2c5add238 |
Hashes for liburlparser-1.4.1-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | acbce5ae483040134350989bc925f53715fe2ef6bb36ddb9a1131c6ce73d48f6 |
|
MD5 | 25077ba02874729ae32b779009c8eabb |
|
BLAKE2b-256 | 2dae64b578d66dc7a19bf5f71332480ce3f1cdf20c0171176e24963c127f30a8 |
Hashes for liburlparser-1.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 87b082a94bf6acc7bd870c3aca2524bb79eee1f2c7005d7cf0d4ca8ad83ac68f |
|
MD5 | 08fca1efcae00a978bae354bbf193527 |
|
BLAKE2b-256 | 573322be6b8d7daf2ccc315b6537abb28cfe7e01ef3100bc3661822ea007e936 |
Hashes for liburlparser-1.4.1-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b1a4403eef4ca1e8e9d48daf3bd95f6eb5f34a64c5d4f9c4f47d093923180624 |
|
MD5 | 6c96e74c353f42e6bbee5e5ad5ca073d |
|
BLAKE2b-256 | 478638ae6d07022c34cb46ea68063dc7d5d195ad50074ac6249bbeca756c622e |
Hashes for liburlparser-1.4.1-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee71ea2113a875f91f61c219da9b2305dfa6c026e2e87ce4ebe28aa8fc7d13b2 |
|
MD5 | 785f969394d02cc45b54aca33a8e7243 |
|
BLAKE2b-256 | 1da44061eb8b6ee155f5179f325b768cea4eeaff4490600a1a0883ae3409c77e |
Hashes for liburlparser-1.4.1-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92c85c6ffdf7002dc3c638ad373f2b33ef4b3006d5c7358a67efd7da5ae77d8c |
|
MD5 | beb86ae19b7980602d2efcc56831deff |
|
BLAKE2b-256 | 2887be6cdbcb2047c6fde4b9fa628ce9dc4739b869db8c341a17bf5d82ed85f7 |
Hashes for liburlparser-1.4.1-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97c11f35baf24296f718cba4fec9748fb40b0093fa9f9bdf7f61efc623d442d1 |
|
MD5 | a5875c5f9966fa8521cce5afc2f3eabc |
|
BLAKE2b-256 | a10ed930c0ef71c6d9f43d106d52c0e6c4ae44381b5aa66d53ee2188f18acd1c |
Hashes for liburlparser-1.4.1-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46322de6e83c889acb47bf215277a8e93c8eba2d7a1d4e6f24baf5dd54dfe299 |
|
MD5 | 9693de25b58a7d20dd211db8e48f9dd9 |
|
BLAKE2b-256 | 207e902c50c3f98ab3f34b44164d3f758dd17f8f122ae55843ce09b7bf3050e6 |
Hashes for liburlparser-1.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 570cfe9a271559438381cfd7174edb1346fa2eb2288cb6adc7e64b638970a842 |
|
MD5 | 36bb9b211defba7e059324a2cebbe427 |
|
BLAKE2b-256 | f9a99109648937763ba4379ced5b836b6e1cdf5f3e2b713abd45c15fe4e28b3e |
Hashes for liburlparser-1.4.1-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50e035e31d0e8d3682ed2d27349e7af7b0014565d53d3a6524aaae690c007ed7 |
|
MD5 | 9a3c5b947e5d223681d9752fd9050077 |
|
BLAKE2b-256 | 23cac6734789b2098760b1f5fafda3ee08f67ef3c3aa7b4d8000bf5fdf46842a |
Hashes for liburlparser-1.4.1-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88dae3e7d3e8710e532b1b361f5dffc0fe7ec67d1c04af90b867b90be1086da3 |
|
MD5 | 30f6d6881b0dbab19fbc9dd3316636c9 |
|
BLAKE2b-256 | 0e675d1e7771c09dc476115587317475e9341c32f6395039d66b4b941ceb93aa |
Hashes for liburlparser-1.4.1-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6e8d1758d88d75ca3d5e2a0c65a853d0dafb7fd7a241e5e4b61cd465cbb3937 |
|
MD5 | 978f515a034c86654b856150b885fc94 |
|
BLAKE2b-256 | f5f976e6fb23096850605e6d8db6ae1f875b4c0c57fde367407945578d1eafc0 |
Hashes for liburlparser-1.4.1-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad13f960770777a12e01a1c342f5d4b722e682666dc4dc09265985fa954bae87 |
|
MD5 | 10cd71b9f10fe43f6d1cefef1c1068c6 |
|
BLAKE2b-256 | 55c1fd3048d5671618c0a99934c09c0ff528201776db62cf436a7d92cc3c1e50 |
Hashes for liburlparser-1.4.1-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1d267d0b4ce8aa6134b27a85e155accbaa4e39882634edab37e2acee52d7977 |
|
MD5 | 55a5d00778d4f61f58eacefb476d9372 |
|
BLAKE2b-256 | cb60e5a32a11f2ba04effb302ee9aa992008c02ac5e41ce6e4b0d4a7b5981b78 |
Hashes for liburlparser-1.4.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 79b057acd5b07ada9fa1b250d115a229bb0854811f8a37c0b70e4f05ec0bd384 |
|
MD5 | eb4d78634e99541ed29a81cf43999bca |
|
BLAKE2b-256 | 384314e281525e0120ec9938f385d7fdfdf739716948b403f312c47fa0e1f833 |
Hashes for liburlparser-1.4.1-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec855ae1d320e96b9d6520e30efec40c534ed7ca58a488dd1c8bc96344d12d16 |
|
MD5 | 26ea91c6e6cb88af547059a89e2e389b |
|
BLAKE2b-256 | 6b3e891a4c7e09c2423654bcba6454c6368829677026deb9b494955f45f11d18 |
Hashes for liburlparser-1.4.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa999cfdd9fa5be8e28a46990d5edba63173bd606846bf39a024d0225897bd88 |
|
MD5 | 03e78cdffe1d36bf2083795630b06017 |
|
BLAKE2b-256 | 3689ac51067557e7c32735fe438a7c822705730e067a4d45623f65e3150a9ff3 |
Hashes for liburlparser-1.4.1-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c0777636e1e34065daf565719a679406aa409710523c5cd062de1addbc2a1c4 |
|
MD5 | 0223d0fef610e1aef93a7a2ffcb43895 |
|
BLAKE2b-256 | 67908b6b56562cfdb48fcd0a7d3a057ca6987b5a7dc1cdc571abeca8f1357ce3 |
Hashes for liburlparser-1.4.1-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64990b37c53297446e6e877dd7d669edfec49e6606226d61a62982bf493bd059 |
|
MD5 | c4549253c012fde1ccd42d115bd500c7 |
|
BLAKE2b-256 | 86f636600fe0877147b11d9d6dc168e36cc8a5729f57f476b95734aa3e69df44 |
Hashes for liburlparser-1.4.1-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4efaa132a3a25323c793379ffe7ead756f58d6f28b88a5f649ae808bbb563027 |
|
MD5 | 9267a937eca855cb39d4b47cfb8fa2ea |
|
BLAKE2b-256 | 9da146ac67654d6a2dc56ca0014a01a05031cffd931780408c0a9241604cd219 |
Hashes for liburlparser-1.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e31d4021f2a4f7df6510902b73e0a2c1edaafb29454da5302ac0c5b0b30ac6a |
|
MD5 | 2585df1f8884659dc4ebcc91ac6db263 |
|
BLAKE2b-256 | 0bac0287a4120174389aead5a473d03df6039777156348239b0cd2e747f8e752 |
Hashes for liburlparser-1.4.1-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15b4bf5c5a14fcc4b728a7dc74167340fcb613eff60b7f5a6b23b4581ea42442 |
|
MD5 | 6926d6f84671480fbbc9781d6235cbce |
|
BLAKE2b-256 | 690526c5d2fb341b5ae06e872bd0a834ea591235c4f5160a62eacc9c4b1b6b28 |