Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip install liburlparser
Or
pip install git+https://github.com/mohammadraziei/liburlparser
Or
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.20s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link: https://github.com/mohammadraziei/liburlparser
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for liburlparser-1.2.0-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2319b9f9ba065388ba92ce7775abd7f4249aa066e512965b90d192e20bd0695a |
|
MD5 | f3925a8a9027db9170c28bd783190935 |
|
BLAKE2b-256 | 53a6a22c255c0b423cbf2f4a7bc8cef69693eaa3f02f355694a4a9c17e234482 |
Hashes for liburlparser-1.2.0-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62ad8cfc551a3844a4046dab7bcfca6fa6f7af97db2f6480f89f3ebb5c8c4c03 |
|
MD5 | fcce9829fb3bdcca232bf7a6f5e0bd11 |
|
BLAKE2b-256 | 7d4aabb13a1f4adb010d433613a1a417754ffec68f2d62b24873b49b88559228 |
Hashes for liburlparser-1.2.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28ef796b850969f37084687526b466fccb3115af9518e089ad4fa1e8a2381358 |
|
MD5 | 41db75cafc1b0e36037c8e7cfefb2053 |
|
BLAKE2b-256 | dcb8cb1919f8addcf80fca2a9f2f863e7eaea2ed5cf98263361d48d9ad389aa3 |
Hashes for liburlparser-1.2.0-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f265304734fcc863c90b1e20ae3db3bd60fcaba22bd50621923ad2529cb2b50 |
|
MD5 | bd743a9f243be460fc9d3b1c65139af8 |
|
BLAKE2b-256 | e61518cb28879e180075303917de0424327a084e59a5953164ba151dc41d4273 |
Hashes for liburlparser-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e31a9344ad78a3c1a2d2885862bb2b3510be7393efc00b0379ea19f8b9237603 |
|
MD5 | 2405e458b183c287f4df1b74ae612130 |
|
BLAKE2b-256 | edb8ba5b2a64d3c6729bbaded22c08be7ac99c0b3b9b9e56c20e772399c1bcff |
Hashes for liburlparser-1.2.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5dd3319735039353d6eec0344f2e3369e58395bdd37a0b3676b32e6143c6283b |
|
MD5 | a2974505e07d03f9f7fb65f51832f64d |
|
BLAKE2b-256 | 94b9a54ca9a22941f2c7d6731389c87133029d292095e7097b8625297e40715a |
Hashes for liburlparser-1.2.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7a058f1dfed74b42baff5c7309c5289acedd138500437e24ff5752927054f03 |
|
MD5 | 72b198b6cbd4d9e2067f45d18dbe082d |
|
BLAKE2b-256 | d5b91eecb676b615c6f8fac463791c8a69da6755c8955ccbdff3af775be88dff |
Hashes for liburlparser-1.2.0-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e176dd72cfd16e3ba1671f79c7a5835c20df5264220f79d9ea8d4c2856fd9481 |
|
MD5 | c0bc654a04181de1728c672479eff21c |
|
BLAKE2b-256 | 361d44662747802c6bfb090442719f968ab9ebe6a83aefdd2bc36b9f66dc853b |
Hashes for liburlparser-1.2.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 340a0fce21fc9168479280d8607707d85c3d6aeb1c3edf3f71aa4d4565ce6653 |
|
MD5 | a784d3aad6d950f3e4ff16d0127463e1 |
|
BLAKE2b-256 | 4a4daed3a53651ede910faf81d7ac127be635a7eb2e06a167333c69be09dae1e |
Hashes for liburlparser-1.2.0-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 995fc941bdf924f90fcbe2617d2f34f8e9e60bd568f0a66462349bd5576d57dc |
|
MD5 | 4b3d177e7058d886c183ff00ea28a288 |
|
BLAKE2b-256 | 3e3e602f5aa41d246a72edb4dafecf3652dd5f5d12b195a5520a9e2753ef7441 |
Hashes for liburlparser-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d26937f3b32eb70443ae6ddfd738ac47fcb37df54158ca123a12e6a5a9018e3 |
|
MD5 | d09b74e7233de0e178b40d2ca3243cce |
|
BLAKE2b-256 | 12e9b5c68c295227c47c903556ffa984344dd2eea73287fedad243bc77d588ec |
Hashes for liburlparser-1.2.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4dd7347b59f5ea9b47957c20568765a13e9a4c1cb339b2326aeaedae22c08936 |
|
MD5 | b1f39a801dfcdf08379ed9472040cde5 |
|
BLAKE2b-256 | 014b0265bcbaa87bd802f250063f765b4b9baddb6fbaee876a4ff7c74a949087 |
Hashes for liburlparser-1.2.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4fbe450e02779e58baea2bf533de84e6d1399558b51d4a7d2e9b8ccc35439ea9 |
|
MD5 | cd33cce8d44227d2377997fdc1b2b632 |
|
BLAKE2b-256 | b8a9abde793700c850d19437748a8e7ba435a684eb3d2713468cd167610dfdfb |
Hashes for liburlparser-1.2.0-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 391d49721a551cd6f6950a83a8cbfbcba6ecd6fb9a6154af3bd976beb9b4c28c |
|
MD5 | 4959c59a1b51cc16433bf89652b61a33 |
|
BLAKE2b-256 | 7a4cfd68a94cac2668e77e205afc3d5ed093710e062101463fba680c6e665d0c |
Hashes for liburlparser-1.2.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a55a0e0c6efe4f0ed620e844f23802d70ca18bd01c0f55258a6b8db5a751166f |
|
MD5 | 979a3138040198789c50c2c2ea7582d4 |
|
BLAKE2b-256 | 9a920963b058db6d6f7572125d29af3393c6c79bcfac9c9601d8fd0f9fca808b |
Hashes for liburlparser-1.2.0-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d340bdf5ea6b28ee41552b61793d9ed428c28fd0b2e35ac246d439cbac31acbe |
|
MD5 | cdfc2d30720f9b67d53b1461c8e13fe8 |
|
BLAKE2b-256 | 3bd39acba4c48d4102bbed4afab4ec26ea5a58d51514a2e539755dab2c2607b4 |
Hashes for liburlparser-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7456fd8c4a1592eae048fc5021f74f92a289a3ef6c31994938ae262e522743ed |
|
MD5 | a02924bf49a1bbb355398e3b9c331576 |
|
BLAKE2b-256 | 1526b12e174b1e73ac13f3fe4cc02d1ac6177cb2c747975d3c84392ddb78185f |
Hashes for liburlparser-1.2.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f725005746d2cb273b2c39e87119df3b87930fea171c860a896e30dc383788d0 |
|
MD5 | 2ad3dd031235af86c598f4483836d55f |
|
BLAKE2b-256 | e59c069491f852965df2bf0ed89f1b69c602a01f52c03a1ec16a1ab166e178c9 |
Hashes for liburlparser-1.2.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3e80d0e2c2980c61e9e6c5bfe0ec3da7c85ac6fcb3d3e7ebbd98432379b7ac6 |
|
MD5 | d894b7ff8e9e9adc14efcd6dce27420a |
|
BLAKE2b-256 | 2596318eb0544d2365f09ee89e73bf37718cf588cca805748d703616f82902cc |
Hashes for liburlparser-1.2.0-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 265e92d00bec23edaa1c5b2f6fb1d2dac75a32c0561f38f8ef2fc255eb9b8b42 |
|
MD5 | e9245a1fb4d6e334a1d853c2da48da13 |
|
BLAKE2b-256 | bffc8ee0c89401a3bcce6503d1200a350d9ffa29e6bd81270af9588875677f41 |
Hashes for liburlparser-1.2.0-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 80c742994b7280b46a0524204c93c0a6da4d58d8d76991b65b883a41e0da3d65 |
|
MD5 | ec3d97e1b3f2a46db9317388638ec24c |
|
BLAKE2b-256 | 547a837d0fde4eb227ef7b72fb7823497bbe71ee45c4ad8ad1714a460235b748 |
Hashes for liburlparser-1.2.0-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8c1b73a5d44692c3a8e29a4836f40d74278381ce8521abe932d7c4815936551 |
|
MD5 | 9879045dd9551918b1a250c465470add |
|
BLAKE2b-256 | 6ce0928ca57f27958a0579b1dd6ad643b5772ce576bf08d7340875448f846071 |
Hashes for liburlparser-1.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93e03e7bda35b4b3beaa51f2cd74445b54af54f40a650898bc98c62c6bb857c4 |
|
MD5 | 35f06282e4286aa0370f819a0d3c6511 |
|
BLAKE2b-256 | e4e15d3eb8f66d0b8c0df403cf1632f47b7935154b81dc93b47e16fdef92fc0b |
Hashes for liburlparser-1.2.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | af0510812ba3105b4dab8766332b00bbe46959475b8ea56fad930430e91faa70 |
|
MD5 | 16bf38935efa6df371343815cb2eaa6a |
|
BLAKE2b-256 | 7dcbb3af4de508d2fd1d631c8196e3da5fe93d8d83375b6749cea5a1b9705533 |