Skip to main content

Highly optimized Domain Name Extraction library written in C++

Project description

Logo

Highly optimized domain name extraction library written in C++

license Python Build PyPi

Table of Contents

About The Project

PyDomainExtractor is a library intended for parsing domain names into their parts fast. The library is written in C++ to achieve the highest performance possible.

Built With

Performance

Test was measured on a file containing 10 million random domains from various TLDs

Library Function Time Improvement Factor
tldextract __call__ 67.0s 1.0x
publicsuffix2 publicsuffix2.get_tld 25.8s 2.6x
PyDomainExtractor pydomainextractor.extract 2.76s 24.3x

Prerequisites

In order to compile this package you should have GCC, libidn2, and Python development package installed.

  • Fedora
sudo dnf install python3-devel libidn2-devel gcc-c++
  • Ubuntu 18.04
sudo apt install python3-dev libidn2-dev g++-9

Installation

pip3 install PyDomainExtractor

Usage

Extraction

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

# Loads a custom SuffixList data. Should follow PublicSuffixList's format.
domain_extractor = pydomainextractor.DomainExtractor(
    'tld\n'
    'custom.tld\n'
)

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': 'google',
>>>     'domain': 'com',
>>>     'suffix': ''
>>> }

domain_extractor.extract('google.custom.tld')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'custom.tld'
>>> }

Validation

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.is_valid_domain('google.com')
>>> True

domain_extractor.is_valid_domain('domain.اتصالات')
>>> True

domain_extractor.is_valid_domain('xn--mgbaakc7dvf.xn--mgbaakc7dvf')
>>> True

domain_extractor.is_valid_domain('domain-.com')
>>> False

domain_extractor.is_valid_domain('-sub.domain.com')
>>> False

domain_extractor.is_valid_domain('\xF0\x9F\x98\x81nonalphanum.com')
>>> False

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - gal@intsights.com

Project Link: https://github.com/Intsights/PyDomainExtractor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyDomainExtractor-0.4.0.tar.gz (101.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page