Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip install liburlparser
Or
pip install git+https://github.com/mohammadraziei/liburlparser
Or
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.10s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link: https://github.com/mohammadraziei/liburlparser
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for liburlparser-1.4.0-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e13cc2023cee0565ce508c02e38b8f96994d4bed9c55581f6e2f657c19f0b31 |
|
MD5 | 144bca361c404652c756b148d2709e56 |
|
BLAKE2b-256 | 116f1ca7d44be1a5cdc7459a61153985fedf3980980d9de735157078faa9445b |
Hashes for liburlparser-1.4.0-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8b9b64ff4ad6b1a4ab951d9069ab167ecc92cfae103b789b9777eea16cad025 |
|
MD5 | 6692b220e1fa0442b5c5703f25c6b649 |
|
BLAKE2b-256 | 9b5e8ad16e48d9b6c5a95979bec0eba1fab9edd648b4945e7d10433b763b8966 |
Hashes for liburlparser-1.4.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b396828305558d6f2d30ba236d3cd3634d9cf62243fc1187f54ca91ef8610569 |
|
MD5 | d61c683e51335069fc314eee5cbdc616 |
|
BLAKE2b-256 | edd606176dfc4ceaba30031534c1d42f662f48a4065f0551c978509004ad375e |
Hashes for liburlparser-1.4.0-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d68dbc7f298df6465565c08b9db9fef639c1a2dbc3e8275fc7a21efe4052641 |
|
MD5 | db79bd71a6d81ecf63a6c6710c9d5d3b |
|
BLAKE2b-256 | ce13a74cc919b67bb211a0f68745bca55bbbe57ac851144cc51a765430e83e01 |
Hashes for liburlparser-1.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48085339c9b27ef500a2184b69539afeb79762b781efa8d3923a8ee9fe199d3b |
|
MD5 | 45482b4027867819e595d03fc05bb753 |
|
BLAKE2b-256 | 7b80ae18aaa2c0de8dd2afaf577de768a45df5903b9613b1c72936295c196d51 |
Hashes for liburlparser-1.4.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6eb78d0d3514b344fdf8dd549f4de60a8097d2f2c49c048e162efe67d582f465 |
|
MD5 | c35326016f6f45656d74d8a1bb109e9d |
|
BLAKE2b-256 | 4ee1a2ea51e02ca16dc0152abb8885aa2210a3d592e25319411ee4f18281c379 |
Hashes for liburlparser-1.4.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b3921aa5996daaef6a3763f4c0476b642b04437743dc34af8fafc41532a5e1f |
|
MD5 | 8ff607269008682f6707f5ea5d7182bd |
|
BLAKE2b-256 | 9a4474b472718f2909e88b194c17296e32e03cbf909d737f2861d9a9538e6825 |
Hashes for liburlparser-1.4.0-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a7df7fdf2cc3f992c2811a56bef0ad8776191bec3c42275261d569567f1a81c |
|
MD5 | 346dc33828a5bf95e8abbbe1a2a588b8 |
|
BLAKE2b-256 | 7221ca2048787d37090aabf12e9c753584786ddcd80f3fbe8dc529762f8dba87 |
Hashes for liburlparser-1.4.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a0b81626e245533b4de7c98a9df3920264f51e2a9ccb073e10db82e0259679c |
|
MD5 | 258f7973985b9924806b75522591e5fe |
|
BLAKE2b-256 | 58c8305dfab0d681c2507ec915feccee2020c5ec184a0813679e31ed8e643518 |
Hashes for liburlparser-1.4.0-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a665ece4929b68e07494271c6f64d1b76d565d8f143d508cf9fc753c250be63e |
|
MD5 | d547e1d3faea6141304ee12d907ced5a |
|
BLAKE2b-256 | ed4a27a40ae76cfac8f0b70c2941c98b3e17cc2a0d8c2f73f4a42fc8fe31c0fe |
Hashes for liburlparser-1.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30613b8269843c24ee5f3b81e598bbf74351be4df8a6a6d13fd16690b341f05c |
|
MD5 | 4b57660e67afc88d460e25b770af4aef |
|
BLAKE2b-256 | 4007a372f6c2a887e18495520866687537643eb490233259211504664d3f5d68 |
Hashes for liburlparser-1.4.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0aeb471ad70ebabc4bda1cc02f6483ad9dbf41ee6b1408428d47ee7d63b6dbc1 |
|
MD5 | 565420662a188eb18bbd1daa2eb4569a |
|
BLAKE2b-256 | a6949ec647f6abd06e28efd58b44e0decec61c55c780d65d92c95f26137ba358 |
Hashes for liburlparser-1.4.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a721cfa98f6475aa72f9c9aced63ada7f85485ecf265b1578f1f01efa7fe84b2 |
|
MD5 | 5f9963d380a9d56cc9eafcbaa406e2eb |
|
BLAKE2b-256 | 3439c8b4d6bbfa6b4b5c0bf48b6af06f8f964cc8e346d87597b2262613d19947 |
Hashes for liburlparser-1.4.0-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25ebd1b659086b81e778cb94a148df7c9ae629eba8b135ee15e9ee1084522ed9 |
|
MD5 | 47e91c8390273da1ef5d0589d8bf40ef |
|
BLAKE2b-256 | b03456064e661df4784787ae7e3fc43720e64ecb19c8369234c81cf56a83b4ac |
Hashes for liburlparser-1.4.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fb49f49517c638d161cb403dbba14c0e9aa6502caed67acf8f84904a7187d39 |
|
MD5 | c6b88d1503432f7c8219437c41c5fd02 |
|
BLAKE2b-256 | 1d8c93531f359845eb881176a9a0a2e5ddacfca8b897e1f145c58484976fa3b1 |
Hashes for liburlparser-1.4.0-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0193c33c3cc4b5b3526f88f748bc680747c6e92f8570df2be09fbdf2a685cae |
|
MD5 | 1686dc24f9701c3d3a1b04a8ec6f1381 |
|
BLAKE2b-256 | e2efb9feaac1d681ffe690e5b78ee3c70f022abf077534a207307344c111a35c |
Hashes for liburlparser-1.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 748bac84e003cc843b3e0f58b90950c6520e81726fd7087f7ae0bed7f826234b |
|
MD5 | d146b6a130f818b575603dba7e201428 |
|
BLAKE2b-256 | cea64eb96aa68aca687d860b016f769c315851dc4ac283141ee9b1fdf934f7f8 |
Hashes for liburlparser-1.4.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc4392df1f8d2ec269b48ab624a81ee5aaaacc6cf6b51840bd0ca4c504d5fb62 |
|
MD5 | e616eb54f2f36fb336299c73030a9f53 |
|
BLAKE2b-256 | 83c3456aea58e47f207534169d05f2322cc9a5b971bdea47682a8094d0822369 |
Hashes for liburlparser-1.4.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f4e7ee130f1d0fedf86324d96b01d224ce7e3d301285c1ab4105517793526f0 |
|
MD5 | fd2ed5a5caabf619d4268fb09a48b1e5 |
|
BLAKE2b-256 | 4ed30c924d4aeab01c38b4b40269763b07a19fec7baf432de9b03588db8cfe70 |
Hashes for liburlparser-1.4.0-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4653e52b3da7123cc5a3bce03c86ed121b8ed0f24c532596c3d20dbb014c29e3 |
|
MD5 | 390d4a8e8041915c620c4c0d6363b1a0 |
|
BLAKE2b-256 | 960f6787ebac7eb0929004f56bb6320831048bd37475beb9472675cf55c4ae4a |
Hashes for liburlparser-1.4.0-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 371da38dc9d00fcd07604447ffc5bbcbf334f102bea91314518b5511a62e42f7 |
|
MD5 | 3158a8996c0d953c8d1d4d5b6854ab90 |
|
BLAKE2b-256 | a297b2e88f192f367b7ac0b5ca0d7e5d5f45575aee29501fc52488d83c6c6646 |
Hashes for liburlparser-1.4.0-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13bcdef0e0d44ba48192bf8bd4a17c556efc584adcb9086c1b0c7bba27c0008a |
|
MD5 | a6d90f1cb594a41ffeaa96ad3532eea4 |
|
BLAKE2b-256 | a910e0238bb7449346c8ae8166f37a789c7975865e8fda678dc29da3bcc2591d |
Hashes for liburlparser-1.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2b8557fdb46edb6ffc150a08df20611c1cbdc4d478babf320f075c5b313397a |
|
MD5 | 4cea7105550f5c79e685927b16a0b9c6 |
|
BLAKE2b-256 | ac347344d7fc11bd38c9ba10382ca210f624b68b221bcdf0fd3255404ad67a2b |
Hashes for liburlparser-1.4.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 241eff10923406cff2f71de520cc88035573e39c54e580c283aa02a6600d6502 |
|
MD5 | e5e98efca8aa5b1da13f57234e533fda |
|
BLAKE2b-256 | 43442e325a81dddab6d2a1f1ea7ec31ce747f3cc23a323c60b4b2013fb32a24d |