Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip install liburlparser
Or
pip install git+https://github.com/mohammadraziei/liburlparser
Or
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.20s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link: https://github.com/mohammadraziei/liburlparser
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for liburlparser-1.1.0-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 682098d6d8164b7929e14b7b4d25c3ca7b826cf9c6f73e4f427022f03523a6db |
|
MD5 | fd421d52535da1d46e0f1cc8f9636b29 |
|
BLAKE2b-256 | 34a3b366c7cabe4d1a4caebe8ba70730dd5e27e6fe7328531e074041a9299ab3 |
Hashes for liburlparser-1.1.0-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84c2d3c7d85fe022042d1f6938af9936c18226d6c4de875386fa7289224c9997 |
|
MD5 | efadd5ea2685cdee59005264fe44b1d9 |
|
BLAKE2b-256 | 72edef214af49cef4a3ca1bcafe5ecedc12345ed82d63b7584236aedd52811b4 |
Hashes for liburlparser-1.1.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f76035945139a75da0a813d877eec37211a10150870b2d76e33e8c4894df64fb |
|
MD5 | eb0eb561eaf3c4a2ce61859f6a9e0dcd |
|
BLAKE2b-256 | e2d2e512d7a6806156a5d622dc00106175c5c2a4a96a6df1372beefd5808b80e |
Hashes for liburlparser-1.1.0-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 693cec48db67f08d4d298e81b0cdfe71d7a4e82de70251b0e98580cae4251fb9 |
|
MD5 | 703a19a9fc124bc01c012e781f143536 |
|
BLAKE2b-256 | 4e0f665f7dcf731644ceaebdd90f631dac522c64920169b1e8b2b20823ad7e9e |
Hashes for liburlparser-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4db0257f2da91fcd24f72f8a310e597e33174264c9f0f762f98f617d7e7b7374 |
|
MD5 | 442f62fb6538f9b77f17afcd4447d233 |
|
BLAKE2b-256 | 2542366efbcbfec8c803cc0057a7c9b6d5237ebab6659230347e8e92e4ffe87b |
Hashes for liburlparser-1.1.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e27c63528c648ad0a1b74c6dc429879d127f3a205b4a4e9c32debe6dea727cc8 |
|
MD5 | 05333932e1df95ae096e48289c08c27e |
|
BLAKE2b-256 | 00606f4ab2be1c588a441f0550d343ad1f931d8d1e3c863e818348068a856c59 |
Hashes for liburlparser-1.1.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5e4d7bc44d6fbf79bc6539bc96150680a71069a925a8596a2b71c7c06da60f4 |
|
MD5 | 5f1f0252e56eeacbf7a4a0254bc4fc36 |
|
BLAKE2b-256 | 0f61d643d0719df189466427f191b5934aa46b1f2ca727d00567950fd549b4ad |
Hashes for liburlparser-1.1.0-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d61fe7a27228bcd309d6ee2d19a26c99ca58e6699c15af6b0d769bba8012d044 |
|
MD5 | bb479f64125deeb678eb5981bd13fd84 |
|
BLAKE2b-256 | 58f11adf34b854b234dd1d3b1ebfddcd74e399ae76eef9b0cd50297ed8d4ce1a |
Hashes for liburlparser-1.1.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3403f42c1cc41df9370c9223b341c163ef9c18d8756207b83750ae61512e47d |
|
MD5 | 6d9cdf9f6ccae15e8de3bd2f9ebdd485 |
|
BLAKE2b-256 | c82c219863644f58336d56e7c4c26a168b7a00b0a36c4b0f02fa874b99c86990 |
Hashes for liburlparser-1.1.0-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc13dd80600958dd292cab46338184f1afd8976e8768525183c839f6bae4c195 |
|
MD5 | e26cada3f7e11b3b41d42ba00a0cfb51 |
|
BLAKE2b-256 | b84aa413b634213ceaaa88f652ccea10867702f2465ff8b55d7e1c607b5229a8 |
Hashes for liburlparser-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ae18d0e9b3fedeee5f80e4f351ebbebb03e85a9e2f98dcba98e543839d176ce |
|
MD5 | 06bac36cfa9d1176170faf7c1268386f |
|
BLAKE2b-256 | 2c1722e07af78d8541b9605a7e8d6cc5273c087cd2d74115f090d9b26d15e620 |
Hashes for liburlparser-1.1.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6df36aca7015e4d5946d57d81723573b10b7b4877b82fb2280192ece46d6dc54 |
|
MD5 | d3b175564e8dee53c18fe41b287d58ab |
|
BLAKE2b-256 | dd2c48403e8f4d32bd48ea0ada030b278c439cab8648e18282cdaf07b8e5f5d9 |
Hashes for liburlparser-1.1.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b23c939514d41677742248737fc91dc4d778093b4d5153bf4309210e8bca982 |
|
MD5 | 0615167f5d447aab1ef2308dc7a3ea26 |
|
BLAKE2b-256 | 6052065a0a0a8061dc9ecf5abd825c7aee2dcf5a7e9f8d3717c2cb7c98cf1ad9 |
Hashes for liburlparser-1.1.0-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c23ac9faa76c89e494d371c4302f61604e8dc9ccf8fdb4a8ddaeace177e4a921 |
|
MD5 | 82eda4eb33d425c2e135cad7930d5ebb |
|
BLAKE2b-256 | d21650eb1cc3c3d89ec820d1166fefd135488f5bb9fa8f50d485cbd0d46961f0 |
Hashes for liburlparser-1.1.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 167117d78c96ed68bae7c1d1e9477fe9625d72d1ed08c2e2bdfe23e4209ecb48 |
|
MD5 | 53d499ad59a9b73ff76f80d6d166fa75 |
|
BLAKE2b-256 | 2beb92ccf69d2bc501feb4277699686e8459052e795c4fb82f8224408d5a5717 |
Hashes for liburlparser-1.1.0-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 107e23a5f87d1bcfdbd57a338359221e72ace8f44cf2da025dcc718f55d16a91 |
|
MD5 | 15864b310f51eb0e4c828338d4048ec1 |
|
BLAKE2b-256 | e3e5bf96c6f27822f38d86893a6648ba2465e753017d7e672eeac184dfaf6154 |
Hashes for liburlparser-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8019497065516e83d02906ccf2bd2491ab5f2b723ad3975e6fef9a9032321560 |
|
MD5 | a891641f8083c2fd5a558f050a2668d5 |
|
BLAKE2b-256 | b81d461a30e2b8bb941a7bc54e5075170e6cb78aef070d7365f1a704397c7f6d |
Hashes for liburlparser-1.1.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d16fa8bb6eba2a623a82c5dce1bd02d49de634c0592fe6b707a1f933070e6510 |
|
MD5 | 4f90fb16278faa035277d6c016062272 |
|
BLAKE2b-256 | 5c5144ffa06558557a6dc57714dc3e13da838084ea8bb81f839674626af78ca1 |
Hashes for liburlparser-1.1.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dbcd471d71600de077ca62e884d699bfabe89646523371788fdec5726b183110 |
|
MD5 | 7a2a1bd80613aa05e146d1bf4983ab61 |
|
BLAKE2b-256 | c80ecc1a8da395d974a6143955bb718389aacf865719f08449c8ffa356be32da |
Hashes for liburlparser-1.1.0-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53db569f4d0a30c180df577eaba16739d5086dc6cdb34665c557192fe8831e15 |
|
MD5 | 83b99d00582979af5ff6b59ea7c6c50f |
|
BLAKE2b-256 | eb0540074814e9e9594c9824c862f277371381f7a1537355822deed83855b53f |
Hashes for liburlparser-1.1.0-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02c23694046cd3ba0c13bd43c33ede0187608ebf4334a9070ab2987255abaa36 |
|
MD5 | 6b47d5b549ec1bcf22e70266386f89e5 |
|
BLAKE2b-256 | f31b5e38f73a372bc62e769f891b4e7be328661dd030df53520e153d551a8158 |
Hashes for liburlparser-1.1.0-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c22b50df7353f43633d1b3dfe1a85f6f69504b73bdd81a7bc295304397e9a1e |
|
MD5 | f8305b3c64a4f663737e65e680fc935c |
|
BLAKE2b-256 | 9fdcca306975f34443026b419aaf1729b3da4da29899e20294f9b6354f086fda |
Hashes for liburlparser-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 673edbdaca221fd7415fd16987a7a1b47ec2b41298d5ffdae8c2792e0145899f |
|
MD5 | 1d612c38fea2b5f78e1d02ad59043b22 |
|
BLAKE2b-256 | 9a55ea79ae53ebcdf9d53f5346dec1eb2088346157df0c718aaae35d67affd22 |
Hashes for liburlparser-1.1.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc339005dff5b4b0821cab25963505e5490a240e6d06bd881e7718f17eb4093d |
|
MD5 | 5a5f0cace888bbb717bd6c7007df2f1d |
|
BLAKE2b-256 | 2decd3c698dfe418b8bb614fd8d430a2e9e845949997960f23490cabc2885aa4 |