Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip install liburlparser
Or
pip install git+https://github.com/mohammadraziei/liburlparser
Or
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.20s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link: https://github.com/mohammadraziei/liburlparser
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for liburlparser-0.3.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a138132460dd69bcc6f829016a238769da33d4454f530b591550ed1bf76594ef |
|
MD5 | 11d0515a205611bbb9bbaea2ef301e55 |
|
BLAKE2b-256 | 997b9bc5de4fee3d2727d2c9d47326f3654ec80148e690ab1ac77939b5b1b3ad |
Hashes for liburlparser-0.3.0-pp39-pypy39_pp73-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f712e97346e93ada50648242e4cb131caaecd136ba22eb58465828d940a6002 |
|
MD5 | 562cdf48980af47f31daeadad8a6c58e |
|
BLAKE2b-256 | 2423fc2575d2fa05e5190138e80e7a8231eefa2b68f7826b2f540f2b87c80059 |
Hashes for liburlparser-0.3.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e4385c33d00910439150e81225dbf7770c1143ac7294064cf7fac7ac4b9b3ce |
|
MD5 | 7862bbdb67b99f96751613c2f3f990bb |
|
BLAKE2b-256 | c115c42118078ad6bc20bc1ff5969c904cece0ea767dc2143029b2bc706377fd |
Hashes for liburlparser-0.3.0-pp38-pypy38_pp73-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3c4b8e97e9e6113152543d9fe26b854d9add8c44e916a458e5f684dd663b9ac |
|
MD5 | ad6423ccb9479b5f42d761858215bb71 |
|
BLAKE2b-256 | e5c15ac98d9763b34d7328fd5f614a4683c85e8415a4b3661dcac606bc38d4d7 |
Hashes for liburlparser-0.3.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c494ff5d114852ecae11ae17076c028932f823dce47a3971259411817be5a279 |
|
MD5 | 67d2f184ad24dd8bf14c0f700b5c0480 |
|
BLAKE2b-256 | bfcabf1d96eb94f9292cdcb15cca8e5ba2523228ff556bdc067e8d4dd6c6e711 |
Hashes for liburlparser-0.3.0-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0243ed9334e3329361ff936031a1a3e08ffbee1d20975635b638f018e2502e44 |
|
MD5 | b4bd1dc89659cc8164a072f564f70bca |
|
BLAKE2b-256 | ab740d46a2d832e06cb2eeac0fdd0ee45a802f957d37b049b2e10b4c52d8ba74 |
Hashes for liburlparser-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6743de5780c192bafbaddcf675bb8c74840d8be187d052d4b71083013aa19557 |
|
MD5 | 2348906c5961ad84482d3ce5d736fb96 |
|
BLAKE2b-256 | 60f72244a6e60dc7b849b4c7f68c9fadae0f69bbbdcd03b445d4724bed773fc5 |
Hashes for liburlparser-0.3.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 43c2819bf3a5cf0b0543179a004dfe3717e4c90d24ba34fe6efdb2652ba29b0d |
|
MD5 | ec550e6573f377ef5b29f99b2729184a |
|
BLAKE2b-256 | df49b39928859fc56fd00abbfb88800549f4b384b5a78d4bc0615d19ef582ad7 |
Hashes for liburlparser-0.3.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e66236f3ebf5962f5af9f9a884bc91ad903eb016585799fc41d3120848a820b |
|
MD5 | bbfb0e88763af5910375c0998c9c01ca |
|
BLAKE2b-256 | 4785ed123d2687f6fd65969aa97a2a4ff7d2e3f02511320a9e1d5b6ed31de51b |
Hashes for liburlparser-0.3.0-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72666626e6bfc98e45f34b6b2a2b3c34bad93389bd7ddece50a4d69249e24550 |
|
MD5 | 9084c801ff45f1be0579cc4a656bdf41 |
|
BLAKE2b-256 | 9a1e0122d7f52464d795aac1aa336e329f9b434af5a91af59e6cce5d110569dc |
Hashes for liburlparser-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1653015c0a2f1a374b940684a1a8a18128ceea207c360d6ac20e7245bb59f238 |
|
MD5 | 9d4b4ad890ab35a25dcc68bc0268ff05 |
|
BLAKE2b-256 | 982486e1511f390083232215c4cb1a509b252be500c0e03c279bdf5dac8172fb |
Hashes for liburlparser-0.3.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 369c31d892c9b9e83461d95c74d439f11aafb15004a3552f5842b00d9395f595 |
|
MD5 | 06a7cbc9002570fdd77907f773fa0f10 |
|
BLAKE2b-256 | 948e20e832d9fd4b58943f51435b40bec704c8ffc50f3c9be61ab1d78f5a6946 |
Hashes for liburlparser-0.3.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5c1c906f73e5c004b3d58752048af2418e7cb7ebc3982663527a02d8b0f2e96 |
|
MD5 | 20270cb087b99a46a01a0630be4d5929 |
|
BLAKE2b-256 | 4e10d1c1e0d1ad575f26a0e9433ac788657165ea8d39b3f4c6d9e70b388a07c3 |
Hashes for liburlparser-0.3.0-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9404213ee52a73954f5a36a09fc846a1cf8d91f1e2edc7d602bbcff609ad1b5 |
|
MD5 | f4b1851870e342d805da4401425028ed |
|
BLAKE2b-256 | 76e85fce539365fde44ac143295a4eeb88634dc87b6e6e47832129ec99a188a4 |
Hashes for liburlparser-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54c88ce8fb9905fa0faea7f79969f19fbbadad85bce633708587c49bff788fd8 |
|
MD5 | 0e7c9dc8074c97f31cdbdfd4e1577589 |
|
BLAKE2b-256 | 0e05e57461460b75b581908b9fc6094c393ef9443fa0b339437fdc3647282fcd |
Hashes for liburlparser-0.3.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac1bf0934799cf657305c5e862d5666ac66269395644ce5adff34b79fc775013 |
|
MD5 | ead433bc1a1d501c391607d939da0269 |
|
BLAKE2b-256 | 925c9f6eb8773237969d958a9947b06ff59f77dc26f42a02cddbc4a2ba631d8d |
Hashes for liburlparser-0.3.0-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3e98fb68c4f3baffddf7b90ca2cf70cec7c5ae696b774ef9d9e104ca4451a98 |
|
MD5 | 1ff3050abc5e0ecdf690af618b4365c3 |
|
BLAKE2b-256 | 21a908b76142a23bf9312291854c1ec38eefea36925cac262a411f177db87d8d |
Hashes for liburlparser-0.3.0-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0df9cc42fb21b146bb83a94ac878685f21ac1a202ae46460c08c7fdfea1fd908 |
|
MD5 | 2cde803a0a6f023ee451faae3de55ee8 |
|
BLAKE2b-256 | 9214bcd5e2948bcb68f7218c514b898dc70bcdda3cae5525d758a86950419ea1 |
Hashes for liburlparser-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16bba0c88733587ff50b98032bbec6e781516e62eb09f73e8761641a31d107e8 |
|
MD5 | 1de0261e76ccd8f5c687d0cdcf68451e |
|
BLAKE2b-256 | 6b6abd7744ed1d70fda2e69fdc52d6c536ee6f536da40d78e1d3ecfbc6905fa8 |
Hashes for liburlparser-0.3.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1844d9d5fd968d27f539b3c76704adcea7547085125c2fadb2dff6d3e7fe2bcb |
|
MD5 | c9b94307a03f2b0a6c8f90cd8d574e9d |
|
BLAKE2b-256 | c44f4db6d7a8f38d5e2600580739070e2e3411156771e96624f63fa760814458 |