Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip install liburlparser
Or
pip install git+https://github.com/mohammadraziei/liburlparser
Or
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.20s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link: https://github.com/mohammadraziei/liburlparser
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for liburlparser-1.0.1-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8cc79c522dc25fb380a5231ec5b65d2052b2af66b200644c85d65e9e0f70449e |
|
MD5 | f530de83922c07d0e568dd6281b6aa85 |
|
BLAKE2b-256 | 745290e3e05f5271dc7a1db60e4b89abe2d6a4964ecdac70a43a464759363817 |
Hashes for liburlparser-1.0.1-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d894f91a7f88e0cbed7f77700feda9395a809f1b1b80b84d4578659e437a1f1f |
|
MD5 | 0dbd25c66562be17fdb9324625cd7637 |
|
BLAKE2b-256 | ddfb6bf5b1bb01a88bb88d28b978378e0f97f2c8510674251b9dfd059bce8c9f |
Hashes for liburlparser-1.0.1-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9daafbcc9af8e31b00cb9e74a803eee79c8f8abefa4bd0f02fd06d7ff4fca49 |
|
MD5 | cac1fa425a4ca8c7d823f9e26d6618b2 |
|
BLAKE2b-256 | bfbf011968d82e81a48a7b623e7f885aab50efaee7f430e90a1dc241f0b656a5 |
Hashes for liburlparser-1.0.1-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 372114d4b5d0b7e495769c707f699011610e2c58fdefc244dd2680fd72788b97 |
|
MD5 | e80d98918d247a6f0904a6b0f418137d |
|
BLAKE2b-256 | 85942778f77310fb63356b477ac164d0130d7f0d6fbbf4d4cc62d6c038db33d5 |
Hashes for liburlparser-1.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e07074bc15ada109f8a9ed37b9f29098fd2e0fd854f19f00d2e7af14735313d |
|
MD5 | f423e58870c23c60056870b848dbad0e |
|
BLAKE2b-256 | 60d8a615ed2117504a8031bc739e43c3c1a6712bfb9ec1b857329894b36314ce |
Hashes for liburlparser-1.0.1-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbf76bb0f6f648a35cbfa2207cf9b1cee8911873bdd0ba05a7bec31cca2d4ee9 |
|
MD5 | 53fa9c43a133d6e2853df076107e5d7d |
|
BLAKE2b-256 | 5ece44e940386589804cab7e72439ab9115036ccaa2107649ed0d602c9d82734 |
Hashes for liburlparser-1.0.1-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38553c673850dd134115a023d27be805e908049e51953d3c71d1004dc9738b7d |
|
MD5 | 12c95d56e585543e70180943a9b171a8 |
|
BLAKE2b-256 | 3701db069ea56f2ea2d133b4a7af46b3863024ce96cf483ff5b822a36af9fc6c |
Hashes for liburlparser-1.0.1-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d14560b8fa37ade0a24e6b9f06d370406b42d182d7e0fdd8480e3a1b4ca6a945 |
|
MD5 | 75528b2eae39a387d37d2691a4617467 |
|
BLAKE2b-256 | a11dc429f6c708075d607f468afe8d600330051f9b2a4f5ea09268310493ac23 |
Hashes for liburlparser-1.0.1-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbde442a002f07e7cf498512fe1dbf9f4874e0ccba155d58c31787a89804b3e8 |
|
MD5 | 95e6ad5141b7744559d2ef8f6744a82c |
|
BLAKE2b-256 | 56b14a847180b30d8ea005f1af39e09f879cfe02525e3d0e6ec3f2a0666a82a1 |
Hashes for liburlparser-1.0.1-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34181be1499f5853ddb70fe17a20984802a4a9a22ab1553f90040706f97feb76 |
|
MD5 | cde3112a8dc1d1052f00fe064d01278b |
|
BLAKE2b-256 | 402cf0c1d7774d868a4b86ff49bb0aed7036f02cbffb532c2c0c823390aa7051 |
Hashes for liburlparser-1.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70137dd423be3bf1ed6f3ba320b902579817f8b71886b165b0074f312a03cac7 |
|
MD5 | 0a47b7beb549a1ba7c800d1cc1415cad |
|
BLAKE2b-256 | 519ffa1e0f471b9065a937a9e2365665fcfe2fd63f0b8588ac2623591c3283d3 |
Hashes for liburlparser-1.0.1-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0045bb4d6bda63e4ca22ccbc398d8290301ecd02b56a0b8f65fd2e8af85b9ef7 |
|
MD5 | 6126aec2cf7ad69392e32e1c75a53f8e |
|
BLAKE2b-256 | 823704d6b5839425e8a1414f8c2da80e0906cd28468b3f020ce69269e00b0368 |
Hashes for liburlparser-1.0.1-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b57805bedb9fee54406638650bccafe3d4dbd95b9d03b15aa20a437270e763d |
|
MD5 | 81b973c990049db9b141892dfeda1d94 |
|
BLAKE2b-256 | a6f03ac314ff8577b2d2d0f6b7da0d3dadb238976c5be065cab0f3763e44f863 |
Hashes for liburlparser-1.0.1-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d8794d37fd2fabe28502149e7931a351eb2548b29758c8b05ed75dbb7629079 |
|
MD5 | 399704858d10fee938937f4306d2060b |
|
BLAKE2b-256 | 67051e793bd2f2bc4d559d1be2906d3383795e4a65b48037fff8ee290b2f39fb |
Hashes for liburlparser-1.0.1-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2ed9844e2e90abd4952e81babd2e6b792f2c6beeaacfc81d0fe07e261e4c8ac |
|
MD5 | a8177791801822ef969f5969597327c7 |
|
BLAKE2b-256 | 05ba010d5bbfabb8fca7457d146b84952dc04fd293e3b1a43505520e87857efd |
Hashes for liburlparser-1.0.1-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4bb6fa693ae35a488c864b1a525fb9e7748864368aa52f891fb2fdc8cb7fb50 |
|
MD5 | 96aa48e62418962cfdfe1020c364a07c |
|
BLAKE2b-256 | aa321d929354b7ec499b2748ac3d1d382f1c2fbdd2fbe857f6015079914a174e |
Hashes for liburlparser-1.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce634650bc3e47d24a85de308e2581d8870e5f334747d590aa5da1fcda993417 |
|
MD5 | dc32e1a05f41e55accdc9c21ced3dd76 |
|
BLAKE2b-256 | f61598fb8a10a7a6dac30de971e4e86fc86c78d59ae73331b3701f393ce521d3 |
Hashes for liburlparser-1.0.1-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d76e1d39282a5ae516f9f423a556f3107a5b70a92660caf55c642521e2140ec |
|
MD5 | d8109deab1b3eee4586ea66280de98f1 |
|
BLAKE2b-256 | 4f381458c1e10bde2bd9680f70132a79a060a8342eb39ca1c3e21ed7a3f583ab |
Hashes for liburlparser-1.0.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d81658d23de8f8a02221db1b0dd20e807856db7993b8060a694ff911794b0155 |
|
MD5 | d9c6ffad97611dcf22641ecf8d202916 |
|
BLAKE2b-256 | 1b01031d5c0e6d2326509f280c7b380167dd92d7a90591d447aa2d3d24896d19 |
Hashes for liburlparser-1.0.1-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e3a29a96cbe400a8e2111500c6b7925ed6430636a460e39ff30ac400fa410b3 |
|
MD5 | b41e41b7ae1f5917ed1449bf2c5b6a09 |
|
BLAKE2b-256 | 2c9cbfe8cf2f3284a2eea41bcc2bcb5e0fde16b2f2777a212f5520ec46840c0d |
Hashes for liburlparser-1.0.1-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cd4582d62599ad5481aee140d5cbf04305001df046122ea62c73eedaea4ab5d |
|
MD5 | f352f184fa562901615eb45d20dbfcb9 |
|
BLAKE2b-256 | bdefd883c4a3733c02e0969d6f982bdac0b99460e8544f1bedad839a5ed9c6af |
Hashes for liburlparser-1.0.1-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1500de8cbb14c02438ac3025d4cc12daf7d66755b2a610e011d578514b646093 |
|
MD5 | 263775c3fab359786d667bc47302722e |
|
BLAKE2b-256 | 04300b7832865249899b6e401b2abd69c10bb3ed7bced2132b06e050376eac6e |
Hashes for liburlparser-1.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fdb4ca6d52acd065db857d0f1cf15174cf1ba7110741e84b34c592a4d17ec71c |
|
MD5 | 431dfb28605413b1ca544ac34e676fcc |
|
BLAKE2b-256 | 0fd65f18205be5d0826456c03ac65ee88720e70c7d76e5ff6dcff6b5ec48ccf5 |
Hashes for liburlparser-1.0.1-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 501471a7e3e54e08461075258acca2f2ec940c5c21ee59ba3b4f3c7f0f44deed |
|
MD5 | 41cd246843fd6b40c301d91bbb85d9d3 |
|
BLAKE2b-256 | 245ae7f8f87da5259cb8ea9d01ca42433c271e9251158f464e49215344bae04f |