Skip to main content

Predict categories based on domain names and their content

Project description

https://github.com/themains/piedomains/actions/workflows/python-package.yml/badge.svg https://img.shields.io/pypi/v/piedomains.svg https://readthedocs.org/projects/piedomains/badge/?version=latest

piedomains predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.

🚀 Quickstart

Install and classify domains in 3 lines:

pip install piedomains

from piedomains import DomainClassifier
classifier = DomainClassifier()

# Classify current content
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
print(result[['domain', 'pred_label', 'pred_prob']])

# Expected output:
#        domain    pred_label  pred_prob
# 0     cnn.com          news   0.876543
# 1  amazon.com      shopping   0.923456
# 2 wikipedia.org   education   0.891234

📊 Key Features

  • High Accuracy: Combines text analysis + visual screenshots for 90%+ accuracy

  • Historical Analysis: Classify websites from any point in time using archive.org

  • Fast & Scalable: Batch processing with caching for 1000s of domains

  • Easy Integration: Modern Python API with pandas output

  • 41 Categories: From news/finance to adult/gambling content

Usage Examples

Basic Classification

from piedomains import DomainClassifier

classifier = DomainClassifier()

# Combined analysis (most accurate)
result = classifier.classify(["github.com", "reddit.com"])

# Text-only (faster)
result = classifier.classify_by_text(["news.google.com"])

# Images-only (good for visual content)
result = classifier.classify_by_images(["instagram.com"])

Historical Analysis

# Analyze how Facebook looked in 2010 vs today
old_facebook = classifier.classify(["facebook.com"], archive_date="20100101")
new_facebook = classifier.classify(["facebook.com"])

print(f"2010: {old_facebook.iloc[0]['pred_label']}")
print(f"2024: {new_facebook.iloc[0]['pred_label']}")

Batch Processing

# Process large lists efficiently
domains = ["site1.com", "site2.com", ...] # 1000s of domains
results = classifier.classify_batch(
    domains,
    method="text",           # text|images|combined
    batch_size=50,           # Process 50 at a time
    show_progress=True       # Progress bar
)

🏷️ Supported Categories

News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.

📈 Performance

  • Speed: ~10-50 domains/minute (depends on method and network)

  • Accuracy: 85-95% depending on content type and method

  • Memory: <500MB for batch processing

  • Caching: Automatic content caching for faster re-runs

🔧 Installation

Requirements: Python 3.9+

# Basic installation
pip install piedomains

# For development
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e .

🔄 Migration from v0.2.x

Old API (still supported):

from piedomains import domain
result = domain.pred_shalla_cat_with_text(["example.com"])

New API (recommended):

from piedomains import DomainClassifier
classifier = DomainClassifier()
result = classifier.classify_by_text(["example.com"])

📖 Documentation

🤝 Contributing

# Setup development environment
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e ".[dev]"

# Run tests
pytest piedomains/tests/ -v

# Run linting
flake8 piedomains/

📄 License

MIT License - see LICENSE file.

📚 Citation

If you use piedomains in research, please cite:

@software{piedomains,
  title={piedomains: AI-powered domain content classification},
  author={Chintalapati, Rajashekar and Sood, Gaurav},
  year={2024},
  url={https://github.com/themains/piedomains}
}

Legacy Documentation

For legacy API documentation, see LEGACY_API.rst

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piedomains-0.3.3.tar.gz (3.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

piedomains-0.3.3-py2.py3-none-any.whl (3.4 MB view details)

Uploaded Python 2Python 3

File details

Details for the file piedomains-0.3.3.tar.gz.

File metadata

  • Download URL: piedomains-0.3.3.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for piedomains-0.3.3.tar.gz
Algorithm Hash digest
SHA256 f15a629d9b3b105f9ba92f793f2bdc1c1c733ebe60e940eac9382f538d4e9160
MD5 3c08ded1ec8654cd3c7d62f421f43c0b
BLAKE2b-256 33345d425085e872a7682e59832a5a3e2fd29f3e3b450b5604e0058e214db8da

See more details on using hashes here.

Provenance

The following attestation bundles were made for piedomains-0.3.3.tar.gz:

Publisher: python-publish.yml on themains/piedomains

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file piedomains-0.3.3-py2.py3-none-any.whl.

File metadata

  • Download URL: piedomains-0.3.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for piedomains-0.3.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 3e3a14e73a096dca71891720dc17556ee965e94bc3dca1cf1ab94a299fbb525b
MD5 4fad9f2397d56c8600165512f08b990f
BLAKE2b-256 44c498cdc62480105605016eae28558eb5c9af4992ebfb44dd26ba0f8aaa65ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for piedomains-0.3.3-py2.py3-none-any.whl:

Publisher: python-publish.yml on themains/piedomains

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page