Skip to main content

Predict categories based on domain names and their content

Project description

https://github.com/themains/piedomains/actions/workflows/python-package.yml/badge.svg https://img.shields.io/pypi/v/piedomains.svg Documentation Status https://static.pepy.tech/badge/piedomains

The package infers the kind of content hosted by a domain using the domain name, the textual content, and the screenshot of the homepage.

We use domain category labels from Shallalist and build our own training dataset by scraping and taking screenshots of the homepage. The final dataset used to train the model is posted on the Harvard Dataverse. Python notebooks used to build the models can be found here and the model files can be found here

Installation

We strongly recommend installing piedomains inside a Python virtual environment (see venv documentation)

pip install piedomains

General API

  1. domain.pred_shalla_cat_with_text(input)

  • What it does:

  • Predicts the kind of content hosted by a domain based on the domain name and the HTML of the homepage.

  • The function can use locally stored HTML files or fetch fresh HTML files.

  • If you specify a local folder, the function will look for HTML files corresponding to the domain.

  • The HTML files must be stored as domainname.html.

  • The function returns a pandas dataframe with predicted labels and corresponding probabilities.

  • Inputs:

  • input: list of domains. Either input or html_path must be specified.

  • html_path: path to the folder where the HTMLs are stored. Either input or html_path must be specified.

  • latest: use the latest model. The default is True.

  • Note: The function will by default look for a html folder on the same level as model files.

  • Output:

  • Returns a pandas dataframe with the predicted labels and probabilities

  • Sample usage:

    from piedomains import domain
    domains = [
        "forbes.com",
        "xvideos.com",
        "last.fm",
        "facebook.com",
        "bellesa.co",
        "marketwatch.com"
    ]
    # with only domains
    result = domain.pred_shalla_cat_with_text(domains)
    # with html path where htmls are stored (offline mode)
    result = domain.pred_shalla_cat_with_text(html_path="path/to/htmls")
    # with domains and html path, html_path will be used to store htmls
    result = domain.pred_shalla_cat_with_text(domains, html_path="path/to/htmls")
    print(result)
  • Sample output:

                domain  text_label  text_prob  \
    0      xvideos.com        porn   0.918919
    1  marketwatch.com     finance   0.627119
    2       forbes.com        news   0.575000
    3       bellesa.co        porn   0.962932
    4     facebook.com  recreation   0.200815
    5          last.fm       music   0.229545
    
                                      text_domain_probs  used_domain_text  \
    0  {'adv': 0.001249639527059502, 'aggressive': 9....              True
    1  {'adv': 0.001249639527059502, 'aggressive': 9....              True
    2  {'adv': 0.010590500641848523, 'aggressive': 0....              True
    3  {'adv': 0.00021545223423966907, 'aggressive': ...              True
    4  {'adv': 0.006381039197812215, 'aggressive': 0....              True
    5  {'adv': 0.002181818181818182, 'aggressive': 0....              True
    
                                          extracted_text
    0  xvideos furry ass history mature rough redhead...
    1  marketwatch gold stocks video chrome economy v...
    2  forbes featured leadership watch money breakin...
    3  bellesa audio vixen sensual passionate orgy ki...
    4    facebook watch messenger portal bulletin oculus
    5  last twitter music reset company back merchand...
  1. domain.pred_shalla_cat_with_images(input)

  • What it does:

  • Predicts the kind of content hosted by a domain based on screenshot of the homepage.

  • The function can use locally stored screenshots files or fetch fresh screenshots of the homepage.

  • If you specify a local folder, the function will look for jpegs corresponding to the domain.

  • The screenshots must be stored as domainname.jpg.

  • The function returns a pandas dataframe with label and corresponding probabilities.

  • Inputs:

  • input: list of domains. Either input or image_path must be specified.

  • image_path: path to the folder where the screenshots are stored. Either input or image_path must be specified.

  • latest: use the latest model. Default is True.

  • Note: The function will by default look for a images` folder on the same level as model files.

  • Output:

  • Returns panda dataframe with label and probabilities

  • Sample usage:

    from piedomains import domain
    domains = [
        "forbes.com",
        "xvideos.com",
        "last.fm",
        "facebook.com",
        "bellesa.co",
        "marketwatch.com"
    ]
    # with only domains
    result = domain.pred_shalla_cat_with_images(domains)
    # with image path where images are stored (offline mode)
    result = domain.pred_shalla_cat_with_images(image_path="path/to/images")
    # with domains and image path, image_path will be used to store images
    result = domain.pred_shalla_cat_with_images(domains, image_path="path/to/images")
    print(result)
  • Sample output:

                domain image_label  image_prob  \
    0       bellesa.co    shopping    0.366663
    1     facebook.com        porn    0.284601
    2  marketwatch.com  recreation    0.367953
    3      xvideos.com        porn    0.916550
    4       forbes.com  recreation    0.415165
    5          last.fm    shopping    0.303097
    
                                      image_domain_probs  used_domain_screenshot
    0  {'adv': 0.0009261096129193902, 'aggressive': 3...                    True
    1  {'adv': 0.030470917001366615, 'aggressive': 0....                    True
    2  {'adv': 0.006861348636448383, 'aggressive': 0....                    True
    3  {'adv': 0.0004964823601767421, 'aggressive': 0...                    True
    4  {'adv': 0.0016061498317867517, 'aggressive': 8...                    True
    5  {'adv': 0.007956285960972309, 'aggressive': 0....                    True
  1. domain.pred_shalla_cat(input)

  • What it does:

  • Predicts the kind of content hosted by a domain based on a screenshot of the homepage.

  • The function can use locally stored screenshots and HTMLs or fetch fresh data.

  • If you specify local folders, the function will look for jpegs corresponding to the domain.

  • The screenshots must be stored as domainname.jpg.

  • The HTML files must be stored as domainname.html.

  • The function returns a pandas dataframe with the predicted labels and corresponding probabilities.

  • Inputs:

  • input: list of domains. Either input or html_path must be specified.

  • html_path: path to the folder where the screenshots are stored. Either input, image_path, or html_path must be specified.

  • image_path: path to the folder where the screenshots are stored. Either input, image_path, or html_path must be specified.

  • latest: use the latest model. Default is True.

  • Note: The function will by default look for a html folder on the same level as model files.

  • Note: The function will by default look for a images folder on the same level as model files.

  • Output

  • Returns panda dataframe with label and probabilities

  • Sample usage:

    from piedomains import domain
    domains = [
        "forbes.com",
        "xvideos.com",
        "last.fm",
        "facebook.com",
        "bellesa.co",
        "marketwatch.com"
    ]
    # with only domains
    result = domain.pred_shalla_cat(domains)
    # with html path where htmls are stored (offline mode)
    result = domain.pred_shalla_cat(html_path="path/to/htmls")
    # with image path where images are stored (offline mode)
    result = domain.pred_shalla_cat(image_path="path/to/images")
    print(result)
  • Sample output:

                  domain  text_label  text_prob  \
    0      xvideos.com        porn   0.918919
    1  marketwatch.com     finance   0.627119
    2       forbes.com        news   0.575000
    3       bellesa.co        porn   0.962932
    4     facebook.com  recreation   0.200815
    5          last.fm       music   0.229545
    
                                      text_domain_probs  used_domain_text  \
    0  {'adv': 0.001249639527059502, 'aggressive': 9....              True
    1  {'adv': 0.001249639527059502, 'aggressive': 9....              True
    2  {'adv': 0.010590500641848523, 'aggressive': 0....              True
    3  {'adv': 0.00021545223423966907, 'aggressive': ...              True
    4  {'adv': 0.006381039197812215, 'aggressive': 0....              True
    5  {'adv': 0.002181818181818182, 'aggressive': 0....              True
    
                                          extracted_text image_label  image_prob  \
    0  xvideos furry ass history mature rough redhead...        porn    0.916550
    1  marketwatch gold stocks video chrome economy v...  recreation    0.370665
    2  forbes featured leadership watch money breakin...  recreation    0.422517
    3  bellesa audio vixen sensual passionate orgy ki...        porn    0.409875
    4    facebook watch messenger portal bulletin oculus        porn    0.284601
    5  last twitter music reset company back merchand...    shopping    0.420788
    
                                      image_domain_probs  used_domain_screenshot  \
    0  {'adv': 0.0004964823601767421, 'aggressive': 0...                    True
    1  {'adv': 0.007065971381962299, 'aggressive': 0....                    True
    2  {'adv': 0.0016623957781121135, 'aggressive': 7...                    True
    3  {'adv': 0.0008810096187517047, 'aggressive': 0...                    True
    4  {'adv': 0.030470917001366615, 'aggressive': 0....                    True
    5  {'adv': 0.01235155574977398, 'aggressive': 0.0...                    True
    
          label  label_prob                              combined_domain_probs
    0      porn    0.917735  {'adv': 0.0008730609436181221, 'aggressive': 0...
    1   finance    0.315346  {'adv': 0.004157805454510901, 'aggressive': 0....
    2      news    0.367533  {'adv': 0.006126448209980318, 'aggressive': 0....
    3      porn    0.686404  {'adv': 0.0005482309264956868, 'aggressive': 0...
    4      porn    0.223327  {'adv': 0.018425978099589416, 'aggressive': 0....
    5  shopping    0.232422  {'adv': 0.007266686965796081, 'aggressive': 0....

Authors

Rajashekar Chintalapati and Gaurav Sood

Contributor Code of Conduct

The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct.

License

The package is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piedomains-0.1.0.tar.gz (3.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

piedomains-0.1.0-py2.py3-none-any.whl (3.4 MB view details)

Uploaded Python 2Python 3

File details

Details for the file piedomains-0.1.0.tar.gz.

File metadata

  • Download URL: piedomains-0.1.0.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for piedomains-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5147739c5ca7e8dde1a15b4b1b55cd80cf8f4b85547cdc144cec5b69be96af36
MD5 e7b9518d0309560e3746d8e113f65ac5
BLAKE2b-256 19f81c0f234c2b5267c494e94cf877a8ed563befd50ac776277992555e80f78f

See more details on using hashes here.

File details

Details for the file piedomains-0.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: piedomains-0.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for piedomains-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 51d71595e588b72bfb1a679ba6b29a202240df6f6417076067dce3a4d20ebd11
MD5 896ca366e5133f67713ad8f2ca1af1d7
BLAKE2b-256 3a905b99e320ea2e5d5ad6f6b4119bc92dda65383584dbdc6a61f35e21aa4eb2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page