Skip to main content

Highly optimized Domain Name Extraction library written in C++

Project description

Logo

Highly optimized domain name extraction library written in C++

license Python Build PyPi

Table of Contents

About The Project

PyDomainExtractor is a library intended for parsing domain names into their parts fast. The library is written in C++ to achieve the highest performance possible.

Built With

Performance

Extract From Domain

Test was measured on a file containing 10 million random domains from various TLDs (Sep. 24th 2020)

Library Function Time
PyDomainExtractor pydomainextractor.extract 2.30s
publicsuffix2 publicsuffix2.get_sld 25.77s
tldextract __call__ 34.22s
tld tld.parse_tld 36.64s

Extract From URL

Test was measured on a file containing 1 million random urls (Sep. 24th 2020)

Library Function Time
PyDomainExtractor pydomainextractor.extract 2.76s
publicsuffix2 publicsuffix2.get_sld 14.33s
tldextract __call__ 44.34s
tld tld.parse_tld 79.13s

Prerequisites

In order to compile this package you should have GCC, libidn2, and Python development package installed.

  • Fedora
sudo dnf install python3-devel libidn2-devel gcc-c++
  • Ubuntu 18.04
sudo apt install python3-dev libidn2-dev g++-9

Installation

pip3 install PyDomainExtractor

Usage

Extraction

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

# Loads a custom SuffixList data. Should follow PublicSuffixList's format.
domain_extractor = pydomainextractor.DomainExtractor(
    'tld\n'
    'custom.tld\n'
)

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': 'google',
>>>     'domain': 'com',
>>>     'suffix': ''
>>> }

domain_extractor.extract('google.custom.tld')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'custom.tld'
>>> }

URL Extraction

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract('http://google.com/')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

Validation

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.is_valid_domain('google.com')
>>> True

domain_extractor.is_valid_domain('domain.اتصالات')
>>> True

domain_extractor.is_valid_domain('xn--mgbaakc7dvf.xn--mgbaakc7dvf')
>>> True

domain_extractor.is_valid_domain('domain-.com')
>>> False

domain_extractor.is_valid_domain('-sub.domain.com')
>>> False

domain_extractor.is_valid_domain('\xF0\x9F\x98\x81nonalphanum.com')
>>> False

TLDs List

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.get_tld_list()
>>> [
>>>     'bostik',
>>>     'backyards.banzaicloud.io',
>>>     'biz.bb',
>>>     ...
>>> ]

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - gal@intsights.com

Project Link: https://github.com/Intsights/PyDomainExtractor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyDomainExtractor-0.9.0.tar.gz (126.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page