Fastest Url parser in the world
Project description
Fastest domain extractor library written in C++ with python binding.
First and complete library for parsing url in C++ and Python and Command Line
About The Project
Features
- Multiple programming language supported such as
Python
,C++
andShell
- Intuitive interface and identical in C++ and Python
- Provide two seperated class Url and Host for the purpose of clean code
- Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
- Support unknown suffix like "google.comm" (it detect "comm" as suffix)
- Update public_suffix_list automatically before each build and deploy
- Host properties:
- subdomain
- domain
- domain_name
- suffix
- Url properties:
- protocol
- userinfo
- host (and all the host properties)
- port
- path
- query
- params
- fragment
Setup
C++:
build steps:
git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install
Python and Command Line:
Be aware that it required python>=3.8
Installation
pip install liburlparser
Or
pip install git+https://github.com/mohammadraziei/liburlparser
Or
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser
Usage
Command Line
python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json
Python
you can use liburlparser so intutively
all of classes has help section
import liburlparser
help(liburlparser)
print(liburlparser.__version__)
from liburlparser import Url, Host
help(Url)
help(Host)
parse url and host
from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())
Also there is some helping api to get better performance for some small tasks
# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast
if you are fan of pydomainextractor
, there is some interface similar to it
import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url
# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api
C++
there is some examples in examples folder
#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");
you can see all methods in python we can use in c++ very easily
Performance
Extract From Host
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host | 1.12s |
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
liburlparser | liburlparser.Host.from_url | 2.10s |
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link: https://github.com/mohammadraziei/liburlparser
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for liburlparser-1.3.0-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a4f28e801159e85358040248ced6ef707a3e17f8c1ced5039bd4a78d7bac711 |
|
MD5 | de2ad607adbca80eab9a2270d85f1c28 |
|
BLAKE2b-256 | 0c10759b0b3392ea9e856687f4c4403a12c188ad00a91231cd9153500fafbc1c |
Hashes for liburlparser-1.3.0-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ced0ac8c39191dfc660032227cd49a7493e0ce3eb97237c3258581f15939b382 |
|
MD5 | ee8659dc79899f7306f5ffa16d897a23 |
|
BLAKE2b-256 | 36add22f00174117dddc589f2d9536b4bdb1cf6e5f333dd5ea56959f9effa59e |
Hashes for liburlparser-1.3.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2126db0f2a15c03a8f3461b60bf71a9d3ff85bdda91f7e0ee2611f0c1567078d |
|
MD5 | a01a1b41b5d4a8ef1c444a2334c3ab63 |
|
BLAKE2b-256 | 4ab56472042bb7e31a6d64d9079f9d2cd7f6d797ecb3bf70f8ca25676d30f573 |
Hashes for liburlparser-1.3.0-cp311-cp311-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d77048c1d3d9fcc0bed99b0b64f511bdc7836867a877172c7c34dff2d777d1e8 |
|
MD5 | d52dc955450370d129e95d7fe74c4316 |
|
BLAKE2b-256 | af09a82bdb2b76173e425f009030b9bcb254394b2898e087dd7b8c60a5340235 |
Hashes for liburlparser-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24fdb226d58b521f0b4eaebcb89d5ae3de688a3ad536c5b869e053c415d73291 |
|
MD5 | 2c4ef57a6423df1ec5253283f49ef611 |
|
BLAKE2b-256 | aed1a4fb2eddc1773bb9d712980695f8adbed6275c4c8270d0c53c9424e63bb0 |
Hashes for liburlparser-1.3.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cfaccb6faef12ca7cc6ee3ea8bf0c1365c3e529ef96c36135fffa9ba6925d1c2 |
|
MD5 | 34221468e9db3184b3c94a61ffa73738 |
|
BLAKE2b-256 | d21f28c97ac67c7487ed8b1c08aab73e2bb333a665b8021b1dba34e455fb56f5 |
Hashes for liburlparser-1.3.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a55b6a47a60834fc100d4a40b45a5d454173c5ab451e4174a2f9926299a68349 |
|
MD5 | 83330695f99f6d4daa31c550b8a39421 |
|
BLAKE2b-256 | 026ab4f89da9ad901fe8f205c6e1ad0391a4f25375d7b1c925f79048439c6c64 |
Hashes for liburlparser-1.3.0-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 413004de409a698e4b2d77b7d82fe50e5491403739a6ebe13e40cb904418a147 |
|
MD5 | d18fe8f1bab9ab8c6571b45048539179 |
|
BLAKE2b-256 | 1d323b689bf47b6edff9cf0094378feaad7542345c4d0c7ae561e22d4742dc38 |
Hashes for liburlparser-1.3.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0810776e12745e15159e0e6573cd836450976ff81cdd9aac835488a613cc7a7d |
|
MD5 | 50c8392483745cb5b3812ad3343d0c10 |
|
BLAKE2b-256 | 46a8c7b49dab9420ba2d0bee221d037b805ccd8b7374d7dbd4c8f25fc3d1bd4b |
Hashes for liburlparser-1.3.0-cp310-cp310-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 12c3ea3fcc088084edec8dd09411ec52d2b756cf719701d00d7fdf0231445a4b |
|
MD5 | ca82abbce640d2772d5b71d5906e4c0d |
|
BLAKE2b-256 | 8ba6347220f244be110b7a728794eec0ac49f1654d0d84746ac87aa274ccf3da |
Hashes for liburlparser-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f5876726b49bdd99020c749b9bc03f2de4acd7db2a06395529e6706d1702391b |
|
MD5 | 2bf0f24cd402cbaad86543e1237908ae |
|
BLAKE2b-256 | cb806cbde242a5c37378bef53759cd90e07e7584557cb6d3d4957d7e117c575a |
Hashes for liburlparser-1.3.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 147054c181339e2758e2c11a75b7a21218bdfd971e627977150fc0e03560ce65 |
|
MD5 | ffecad280d31dae40223ad1e933c32d0 |
|
BLAKE2b-256 | d7fe4db37b560c3f321df21b0380ee49e87a2fd96bf3a82a6ff086c053d67198 |
Hashes for liburlparser-1.3.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11a9144c62871b65e00d35e18ee8beb644565361b89fcf6b5b5db770f3a7f270 |
|
MD5 | e8ee89b8a0d9568205b3c5120a1cf3a4 |
|
BLAKE2b-256 | 3d4f907fa9794228aa9413e7e856417990fc53f1a2d79b7a17757764cf86097e |
Hashes for liburlparser-1.3.0-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0eae4a1b435d8b756ae16585da8373e0fd35b8f9d4ea4f45777ca86cfbaf791f |
|
MD5 | a03cdd5fccb56acf673fbe5a894900b3 |
|
BLAKE2b-256 | 1020e4cf40cdcba2b2bfc0b1b5462546f6d03799a3904a417725cff7b91da8da |
Hashes for liburlparser-1.3.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6a256c65739530f555795bf87b2034fa3c2422078b62bb1f6f0f5ed3e16c238 |
|
MD5 | 0592fa9f42b3299be75730af55edeca7 |
|
BLAKE2b-256 | fdccdeffe918dc5140c61613fe174e03eb185bef8a465790f53a684cd2cb7531 |
Hashes for liburlparser-1.3.0-cp39-cp39-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1123ad38a5f9651476b55dc98ce678f677ebe66a8282ca9be5ee5854843c3979 |
|
MD5 | 2a0bf0078633700a324b5948c341f83f |
|
BLAKE2b-256 | d36799bdc0ef698a2baa5e1c1144ef8097012f8b86fe37236a20091286c500dc |
Hashes for liburlparser-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c0af830cd5c6aec8897d015f7e7ef235a675b2e4c884c9f2f0df8a4ab213216 |
|
MD5 | 7585c41374d331581f114f10298b9deb |
|
BLAKE2b-256 | 3227e636f4db2d73c62d625a7fe30247bf81324da7c89e5c6216a0cb1582cfb4 |
Hashes for liburlparser-1.3.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed73ef93fed9991810f5c1a97e992cbbee27bc06a7a00674ce006f2ccea2b6e2 |
|
MD5 | bc0f158df3c5fc3ac32d43a89bc2ace1 |
|
BLAKE2b-256 | 7098e8eea43af5e9dc3428403d13215ec96a71c7837aed5fc2ef857968c38920 |
Hashes for liburlparser-1.3.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4c086b50b5cb991716ae25738768de71cd6d127c4e29e0fee5d19202bb0c54f |
|
MD5 | d76647785a4abff6d6ed29b52501d1c6 |
|
BLAKE2b-256 | 5794f4b0e853deea1e1c63f2ac63731eb657434e756517bfdd5df39a2f24be47 |
Hashes for liburlparser-1.3.0-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c55c5d8e195fc072c766ee575aca8aba020c99c79506faa8cf16c24ba7fd6018 |
|
MD5 | 3e11964b675340c8f48482e0034fd363 |
|
BLAKE2b-256 | 39e3ee22553880d48e2e997c1292f97a440437ac8fc81dbbc432de98f6f6e1ad |
Hashes for liburlparser-1.3.0-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | de7e149056eea36567769dfb1262a6aa54ce1ce543f5345175bb426cdbeb9879 |
|
MD5 | 5d28cde65171fff42f5de3dc7ca7848f |
|
BLAKE2b-256 | ae1e7b582a22f973d482a644ff77caf8f4f6171efcd31d479b8be069889d3697 |
Hashes for liburlparser-1.3.0-cp38-cp38-musllinux_1_1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40fc273b410f059a013f00fdc499cd978c190578ed6d4ad3d9286a7e0e2124ee |
|
MD5 | 146256cf556bfc6b49487be27933d1a5 |
|
BLAKE2b-256 | e082e0f46ebe8dd2e58409b3e5a0c1d70a851be453c2c65284377a88c6cbd41d |
Hashes for liburlparser-1.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08ce2d9bed5ca18601851bc6b7bbc8e6eb8b5778b155918d9d51bf50876b06b4 |
|
MD5 | f476478050f4f293b6b954b307a13233 |
|
BLAKE2b-256 | 619a427fe3b7e126c44ed6ca20bcf9c1de187141d19716a4dfac9864f580741d |
Hashes for liburlparser-1.3.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 821fb0d98d091116fd17ab4e73738d914bbb9bd8cbc9bce47783ceffebe0f745 |
|
MD5 | 9e7aa89fc8251c3f9e140986dfca4799 |
|
BLAKE2b-256 | e33265431d65bc0d2d513c721a29102a379de377c2a626be1d87f529f35321d1 |