piedomains

Predict categories based on domain names and their content

These details have not been verified by PyPI

Project links

Homepage

Project description

https://github.com/themains/piedomains/actions/workflows/python-package.yml/badge.svg

https://img.shields.io/pypi/v/piedomains.svg

https://static.pepy.tech/badge/piedomains

The package infers the kind of content hosted by a domain using the domain name or full URL, the textual content, and the screenshot of the homepage.

We use domain category labels from Shallalist and build our own training dataset by scraping and taking screenshots of the homepage. The final dataset used to train the model is posted on the Harvard Dataverse. Python notebooks used to build the models can be found here and the model files can be found here

Installation

We strongly recommend installing piedomains inside a Python virtual environment (see venv documentation)

pip install piedomains

General API

domain.pred_shalla_cat_with_text(input)

What it does:

Predicts the kind of content hosted by a domain based on the domain name or full URL and the HTML content.

The function can use locally stored HTML files or fetch fresh HTML files from the specified URLs.

If you specify a local folder, the function will look for HTML files corresponding to the domain name.

The HTML files must be stored as domainname.html.

The function returns a pandas dataframe with predicted labels and corresponding probabilities.

Inputs:

input: list of URLs or domain names. Either input or html_path must be specified.

html_path: path to the folder where the HTMLs are stored. Either input or html_path must be specified.

latest: use the latest model. The default is True.

Note: The function will by default look for a html folder on the same level as model files.

Output:

Returns a pandas dataframe with the predicted labels and probabilities
Sample usage:
from piedomains import domain
# URLs and domains can be mixed
inputs = [
    "forbes.com",
    "https://xvideos.com",
    "last.fm",
    "https://facebook.com/news",
    "bellesa.co",
    "https://marketwatch.com/investing"
]
# with URLs/domains
result = domain.pred_shalla_cat_with_text(inputs)
# with html path where htmls are stored (offline mode)
result = domain.pred_shalla_cat_with_text(html_path="path/to/htmls")
# with URLs/domains and html path, html_path will be used to store htmls
result = domain.pred_shalla_cat_with_text(inputs, html_path="path/to/htmls")
print(result)
Sample output:
            domain  text_label  text_prob  \
0      xvideos.com        porn   0.918919
1  marketwatch.com     finance   0.627119
2       forbes.com        news   0.575000
3       bellesa.co        porn   0.962932
4     facebook.com  recreation   0.200815
5          last.fm       music   0.229545

                                  text_domain_probs  used_domain_text  \
0  {'adv': 0.001249639527059502, 'aggressive': 9....              True
1  {'adv': 0.001249639527059502, 'aggressive': 9....              True
2  {'adv': 0.010590500641848523, 'aggressive': 0....              True
3  {'adv': 0.00021545223423966907, 'aggressive': ...              True
4  {'adv': 0.006381039197812215, 'aggressive': 0....              True
5  {'adv': 0.002181818181818182, 'aggressive': 0....              True

                                      extracted_text
0  xvideos furry ass history mature rough redhead...
1  marketwatch gold stocks video chrome economy v...
2  forbes featured leadership watch money breakin...
3  bellesa audio vixen sensual passionate orgy ki...
4    facebook watch messenger portal bulletin oculus
5  last twitter music reset company back merchand...

domain.pred_shalla_cat_with_images(input)

What it does:

Predicts the kind of content hosted by a domain based on screenshot of the homepage.

The function can use locally stored screenshots files or fetch fresh screenshots of the homepage.

If you specify a local folder, the function will look for jpegs corresponding to the domain.

The screenshots must be stored as domainname.jpg.

The function returns a pandas dataframe with label and corresponding probabilities.

Inputs:

input: list of domains. Either input or image_path must be specified.

image_path: path to the folder where the screenshots are stored. Either input or image_path must be specified.

latest: use the latest model. Default is True.

Note: The function will by default look for a images` folder on the same level as model files.

Output:

Returns panda dataframe with label and probabilities
Sample usage:
from piedomains import domain
domains = [
    "forbes.com",
    "xvideos.com",
    "last.fm",
    "facebook.com",
    "bellesa.co",
    "marketwatch.com"
]
# with only domains
result = domain.pred_shalla_cat_with_images(domains)
# with image path where images are stored (offline mode)
result = domain.pred_shalla_cat_with_images(image_path="path/to/images")
# with domains and image path, image_path will be used to store images
result = domain.pred_shalla_cat_with_images(domains, image_path="path/to/images")
print(result)
Sample output:
            domain image_label  image_prob  \
0       bellesa.co    shopping    0.366663
1     facebook.com        porn    0.284601
2  marketwatch.com  recreation    0.367953
3      xvideos.com        porn    0.916550
4       forbes.com  recreation    0.415165
5          last.fm    shopping    0.303097

                                  image_domain_probs  used_domain_screenshot
0  {'adv': 0.0009261096129193902, 'aggressive': 3...                    True
1  {'adv': 0.030470917001366615, 'aggressive': 0....                    True
2  {'adv': 0.006861348636448383, 'aggressive': 0....                    True
3  {'adv': 0.0004964823601767421, 'aggressive': 0...                    True
4  {'adv': 0.0016061498317867517, 'aggressive': 8...                    True
5  {'adv': 0.007956285960972309, 'aggressive': 0....                    True

domain.pred_shalla_cat(input)

What it does:

Predicts the kind of content hosted by a domain based on a screenshot of the homepage.

The function can use locally stored screenshots and HTMLs or fetch fresh data.

If you specify local folders, the function will look for jpegs corresponding to the domain.

The screenshots must be stored as domainname.jpg.

The HTML files must be stored as domainname.html.

The function returns a pandas dataframe with the predicted labels and corresponding probabilities.

Archive.org Historical Classification (NEW)

domain.pred_shalla_cat_archive(input, archive_date)
domain.pred_shalla_cat_with_text_archive(input, archive_date)
domain.pred_shalla_cat_with_images_archive(input, archive_date)
What it does:

Predicts content categories using historical snapshots from archive.org

Fetches content from the closest available snapshot to the specified date

Supports the same analysis as regular functions but with historical data

Useful for analyzing how website content has changed over time

Inputs:

input: list of URLs or domain names to classify

archive_date: target date as ‘YYYYMMDD’ string (e.g., ‘20200101’ for Jan 1, 2020)

html_path: optional path for storing archived HTML files

image_path: optional path for storing archived screenshots

use_cache: whether to reuse existing archived files

latest: whether to download latest model version

Sample usage:

from piedomains import domain

# Classify domains using content from January 1, 2020
domains = ["amazon.com", "facebook.com", "cnn.com"]
result = domain.pred_shalla_cat_archive(domains, "20200101")
print(result[["domain", "pred_label", "pred_prob", "archive_date"]])

# Text-only classification from archive
text_result = domain.pred_shalla_cat_with_text_archive(domains, "20200101")

# Compare different time periods
old_result = domain.pred_shalla_cat_archive(domains, "20100101")  # 2010
new_result = domain.pred_shalla_cat_archive(domains, "20200101")  # 2020

Inputs:

input: list of domains. Either input or html_path must be specified.

html_path: path to the folder where the screenshots are stored. Either input, image_path, or html_path must be specified.

image_path: path to the folder where the screenshots are stored. Either input, image_path, or html_path must be specified.

latest: use the latest model. Default is True.

Note: The function will by default look for a html folder on the same level as model files.

Note: The function will by default look for a images folder on the same level as model files.

Output

Returns panda dataframe with label and probabilities

Sample usage:

from piedomains import domain
domains = [
    "forbes.com",
    "xvideos.com",
    "last.fm",
    "facebook.com",
    "bellesa.co",
    "marketwatch.com"
]
# with only domains
result = domain.pred_shalla_cat(domains)
# with html path where htmls are stored (offline mode)
result = domain.pred_shalla_cat(html_path="path/to/htmls")
# with image path where images are stored (offline mode)
result = domain.pred_shalla_cat(image_path="path/to/images")
print(result)

Sample output:

              domain  text_label  text_prob  \
0      xvideos.com        porn   0.918919
1  marketwatch.com     finance   0.627119
2       forbes.com        news   0.575000
3       bellesa.co        porn   0.962932
4     facebook.com  recreation   0.200815
5          last.fm       music   0.229545

                                  text_domain_probs  used_domain_text  \
0  {'adv': 0.001249639527059502, 'aggressive': 9....              True
1  {'adv': 0.001249639527059502, 'aggressive': 9....              True
2  {'adv': 0.010590500641848523, 'aggressive': 0....              True
3  {'adv': 0.00021545223423966907, 'aggressive': ...              True
4  {'adv': 0.006381039197812215, 'aggressive': 0....              True
5  {'adv': 0.002181818181818182, 'aggressive': 0....              True

                                      extracted_text image_label  image_prob  \
0  xvideos furry ass history mature rough redhead...        porn    0.916550
1  marketwatch gold stocks video chrome economy v...  recreation    0.370665
2  forbes featured leadership watch money breakin...  recreation    0.422517
3  bellesa audio vixen sensual passionate orgy ki...        porn    0.409875
4    facebook watch messenger portal bulletin oculus        porn    0.284601
5  last twitter music reset company back merchand...    shopping    0.420788

                                  image_domain_probs  used_domain_screenshot  \
0  {'adv': 0.0004964823601767421, 'aggressive': 0...                    True
1  {'adv': 0.007065971381962299, 'aggressive': 0....                    True
2  {'adv': 0.0016623957781121135, 'aggressive': 7...                    True
3  {'adv': 0.0008810096187517047, 'aggressive': 0...                    True
4  {'adv': 0.030470917001366615, 'aggressive': 0....                    True
5  {'adv': 0.01235155574977398, 'aggressive': 0.0...                    True

      label  label_prob                              combined_domain_probs
0      porn    0.917735  {'adv': 0.0008730609436181221, 'aggressive': 0...
1   finance    0.315346  {'adv': 0.004157805454510901, 'aggressive': 0....
2      news    0.367533  {'adv': 0.006126448209980318, 'aggressive': 0....
3      porn    0.686404  {'adv': 0.0005482309264956868, 'aggressive': 0...
4      porn    0.223327  {'adv': 0.018425978099589416, 'aggressive': 0....
5  shopping    0.232422  {'adv': 0.007266686965796081, 'aggressive': 0....

Authors

Rajashekar Chintalapati and Gaurav Sood

Contributor Code of Conduct

The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct.

License

The package is released under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.5.0

Dec 17, 2025

0.4.2

Dec 15, 2025

0.4.1

Dec 15, 2025

0.4.0

Dec 15, 2025

0.3.10

Sep 2, 2025

0.3.9

Sep 2, 2025

0.3.8

Sep 2, 2025

0.3.7

Sep 2, 2025

0.3.6

Sep 2, 2025

0.3.5

Sep 2, 2025

0.3.4

Sep 2, 2025

0.3.3

Sep 2, 2025

0.3.2

Sep 1, 2025

0.3.1

Sep 1, 2025

0.3.0

Sep 1, 2025

This version

0.2.1

Sep 1, 2025

0.2.0

Sep 1, 2025

0.1.0

Aug 30, 2025

0.0.19

Apr 28, 2023

0.0.18

Apr 20, 2023

0.0.17

Apr 17, 2023

0.0.16

Apr 14, 2023

0.0.15

Apr 14, 2023

0.0.14

Apr 13, 2023

0.0.13

Apr 13, 2023

0.0.12

Apr 13, 2023

0.0.11

Feb 5, 2023

0.0.10

Feb 4, 2023

0.0.9

Feb 4, 2023

0.0.8

Jan 29, 2023

0.0.7

Jan 29, 2023

0.0.6

Jan 28, 2023

0.0.5

Jan 28, 2023

0.0.4

Oct 28, 2022

0.0.3

Oct 28, 2022

0.0.2

May 4, 2022

0.0.1

May 3, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piedomains-0.2.1.tar.gz (3.4 MB view details)

Uploaded Sep 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

piedomains-0.2.1-py2.py3-none-any.whl (3.4 MB view details)

Uploaded Sep 1, 2025 Python 2Python 3

File details

Details for the file piedomains-0.2.1.tar.gz.

File metadata

Download URL: piedomains-0.2.1.tar.gz
Upload date: Sep 1, 2025
Size: 3.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for piedomains-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`290f9b88d5363f689fb7318b22a8a1fcc54ee9c295335a2999f274c1633aeb1e`
MD5	`f3f263f2cef4ec6dd3ca440d09e5e62a`
BLAKE2b-256	`3d08839043c878aeb7bad6347fd8607ec546cc916eeadb9c639d352b0840924c`

See more details on using hashes here.

File details

Details for the file piedomains-0.2.1-py2.py3-none-any.whl.

File metadata

Download URL: piedomains-0.2.1-py2.py3-none-any.whl
Upload date: Sep 1, 2025
Size: 3.4 MB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.13

File hashes

Hashes for piedomains-0.2.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`d6052e79bd32895504841eef6638eb05a1ca97dfb41d51f66b79dc49953cc21b`
MD5	`5d7ea7c9c1d35e6e702b59e0f440793b`
BLAKE2b-256	`42548ed36781c291946f6120d2e1be35cc01bbab5b2a4f5b6360d57a5c02f776`

See more details on using hashes here.

piedomains 0.2.1

Navigation

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

General API

Authors

Contributor Code of Conduct

License

Project details

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes