Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip install liburlparser
Or
pip install git+https://github.com/mohammadraziei/liburlparser
Or
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.20s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link: https://github.com/mohammadraziei/liburlparser
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for liburlparser-0.2.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbe19022955db8e3733b0fb378672526fb5b53e47325b048e37a419058ede3d1 |
|
MD5 | d89251a64aa50e02b7c3db15cc4b7ef4 |
|
BLAKE2b-256 | 0411ac12c42606cf5a0b6e3c70ab66eb937a29033083801a30668f320a20083d |
Hashes for liburlparser-0.2.0-pp39-pypy39_pp73-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1910388e433e59c20270419c178eaa0ad5f3f2715937e09550916bba0b19288d |
|
MD5 | 090ebbde7d6bea20418f4d8764d7e55d |
|
BLAKE2b-256 | 7c1c3faa32edfa4cbad2ce92f6555f1260bc42ad8fe2498fd501167873bb54e5 |
Hashes for liburlparser-0.2.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06642d14419317390ac06e75fdf6ad50980c397ac14594e62f2438081ed834d8 |
|
MD5 | 467b24a81b7f883e75ed7b43056f585a |
|
BLAKE2b-256 | be8d73625baf8df784316fbc0ed02d8419aed068c3922df0ad6aeb695767d63d |
Hashes for liburlparser-0.2.0-pp38-pypy38_pp73-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ceffecd8554d9409ca61103ffb2211c09be865124b181477ce889b97adc2a502 |
|
MD5 | d28399b151bf624f8e62f6f55c6bc877 |
|
BLAKE2b-256 | 335e04e9d42edf80cac73814950b99b450f988f833590cded80e3c053a9ba7cd |
Hashes for liburlparser-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b70cf79f6ddc595197391d670fa378390435def42467b207d13b325ad402ced |
|
MD5 | 30e7995e9b821da6cfc8d32ffe35af60 |
|
BLAKE2b-256 | 58f543d63cde27c6abd181b0b9a26b29a4009453d14765c3de96636505be8fb0 |
Hashes for liburlparser-0.2.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd6b50160c77d09fbfe994a267afb9abf260f806309ebfc1208bc4dfd56a0241 |
|
MD5 | c5c852955ada574c1aac34543587e31a |
|
BLAKE2b-256 | 1694d5a1e5caf3608ba3571d7f7653d13fbe3ea9eed20e48bf7d2c3c9d239c0e |
Hashes for liburlparser-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f839f53709fdd48c219a4ac420340d5493ab32560c82c9f8dc416eb58090455 |
|
MD5 | d7a5e5f133e72b46a8904d23b348cc40 |
|
BLAKE2b-256 | 82b874c5ef1e15e1cab2c3ad0da97f919cb40aae55f4f6d88aaae587f14fd49b |
Hashes for liburlparser-0.2.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40d1a5dcfce420c0db3c48cb9f8420138a26b22696e8ffd995ea39ed16e68ab4 |
|
MD5 | b2103673979ccadffa6e4fa8dec57380 |
|
BLAKE2b-256 | 930c23600adf6d8441af2a561b9afa5052713867624b1cd22c2e54549ddbf47e |
Hashes for liburlparser-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96a43016f30e6f874c3334b19c442c40313f6bfcba545babfbb5390aad890238 |
|
MD5 | f3a72f96cd33c7f39773fb9c5e04b802 |
|
BLAKE2b-256 | d67ccf38ab6c5cb8a71f6d01e465204518b154b998beadd79982eb511f44ef6f |
Hashes for liburlparser-0.2.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b8208c6d67cf6f5cafd80aeba633238d837def377d032cc36329b84b876f04d |
|
MD5 | a6e01790ab83142a2e6e55d8b64a7594 |
|
BLAKE2b-256 | 4ebf7abacb6bf79fb4c7ec5f41eaf7220e86d0415ce749913aa65380962d7e83 |
Hashes for liburlparser-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ebb411e24902a42d737972b84df68893a0d4a8467a91c36bc348e1d8c58d08ee |
|
MD5 | 8a87a00da6139c005f578f68aa1b8b4c |
|
BLAKE2b-256 | 75ffc57eed649e0cfbdea2011a0c2d37132f2341b1f369db53e67a85695fd644 |
Hashes for liburlparser-0.2.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4a813184b8ab5bf993a7f58ee9f9ec7b0aed36c71c344b3c68ffa3d2cadbbe1 |
|
MD5 | c4234eca848177908befc327285d5d65 |
|
BLAKE2b-256 | 81943c196343bee73f891a08c87c23c2b4772a8595fc489d7e8a6d81f5a1cb23 |