Skip to main content

Predict categories based on domain names and their content

Project description

piedomains: Classify website content using ML Models or LLMs

CI PyPI Version Downloads Documentation

🚀 What's New in v0.5.0

  • Playwright Migration: Complete transition from Selenium to modern Playwright for faster, more reliable web content extraction
  • 12.8x Performance Boost: Optimized parallel processing (13.2s → 1.0s per domain)
  • Enhanced Docker Security: Production-ready containerization with security sandboxing and resource limits
  • Unified Content Pipeline: Text and image extraction now use the same Playwright engine for consistency

Installation

pip install piedomains

Requires Python 3.11+

Basic Usage

from piedomains import DomainClassifier

classifier = DomainClassifier()
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
print(result[['domain', 'pred_label', 'pred_prob']])

# Output:
#        domain    pred_label  pred_prob
# 0     cnn.com          news   0.876543
# 1  amazon.com      shopping   0.923456
# 2 wikipedia.org   education   0.891234

Classification Methods

# Combined text + image analysis (most accurate)
result = classifier.classify(["github.com"])

# Text-only classification (faster)
result = classifier.classify_by_text(["news.google.com"])

# Image-only classification
result = classifier.classify_by_images(["instagram.com"])

# Batch processing
results = classifier.classify_batch(domains, method="text", batch_size=50)

Historical Analysis

# Analyze archived versions from archive.org
old_result = classifier.classify(["facebook.com"], archive_date="20100101")

LLM Classification

# Configure LLM provider
classifier.configure_llm(
    provider="openai",
    model="gpt-4o",
    api_key="sk-...",
    categories=["news", "shopping", "social", "tech"]
)

# LLM-powered classification
result = classifier.classify_by_llm(["example.com"])

# With custom instructions
result = classifier.classify_by_llm(
    ["site.com"],
    custom_instructions="Classify by educational value"
)

Set API keys via environment variables:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="..."

Categories

41 categories: news, finance, shopping, education, government, adult content, gambling, social networks, search engines, and others based on Shallalist taxonomy.

Security & Docker

v0.5.0 includes production-ready Docker containerization for secure domain analysis:

# Build secure sandbox container
docker build -t piedomains-sandbox .

# Run with security constraints (2GB RAM, 2 CPU, read-only filesystem)
docker run --rm --memory=2g --cpus=2 --read-only \
  --tmpfs /tmp --tmpfs /var/tmp \
  piedomains-sandbox python -c "
from piedomains import DomainClassifier
classifier = DomainClassifier()
result = classifier.classify(['example.com'])
print(result[['domain', 'pred_label']])
"

Batch Processing in Container:

# Use the included secure classification script
cd examples/sandbox
echo -e "wikipedia.org\ngithub.com\ncnn.com" > domains.txt
python3 secure_classify.py --file domains.txt

For testing, use known-safe domains: ["wikipedia.org", "github.com", "cnn.com"]

Documentation

Development

git clone https://github.com/themains/piedomains
cd piedomains
pip install -e ".[dev]"
pytest tests/ -v

License

MIT License

Citation

@software{piedomains,
  title={piedomains: AI-powered domain content classification},
  author={Chintalapati, Rajashekar and Sood, Gaurav},
  year={2024},
  url={https://github.com/themains/piedomains}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piedomains-0.5.0.tar.gz (3.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

piedomains-0.5.0-py3-none-any.whl (146.3 kB view details)

Uploaded Python 3

File details

Details for the file piedomains-0.5.0.tar.gz.

File metadata

  • Download URL: piedomains-0.5.0.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for piedomains-0.5.0.tar.gz
Algorithm Hash digest
SHA256 bb56e93a645ca4b5be425f5ae0ecbe799ee2ce86e9a2fd8410ac8ef2f43f5a92
MD5 51d635908dca5c87263adfe54652ede0
BLAKE2b-256 cbc4a216ba2b59bb782c008d461389fc011e286371759f16713f649e5dd030a8

See more details on using hashes here.

Provenance

The following attestation bundles were made for piedomains-0.5.0.tar.gz:

Publisher: python-publish.yml on themains/piedomains

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file piedomains-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: piedomains-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 146.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for piedomains-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c19a40689beadf155e2ee94c8c2f3f78038a98b8102eaa84044f450284815f07
MD5 890dd45acdb485b4dac23ad131dfdfae
BLAKE2b-256 c3b7721cbc073bec2b34423ca72bd2960963e0bd49883b0d53c4e3feee52ea9b

See more details on using hashes here.

Provenance

The following attestation bundles were made for piedomains-0.5.0-py3-none-any.whl:

Publisher: python-publish.yml on themains/piedomains

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page