Skip to main content

Predict categories based on domain names and their content

Project description

piedomains

CI PyPI Version

Classify website content categories using machine learning models or LLMs (GPT-4, Claude, Gemini).

Installation

pip install piedomains

Requires Python 3.11+

Basic Usage

from piedomains import DomainClassifier

classifier = DomainClassifier()
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
print(result[['domain', 'pred_label', 'pred_prob']])

# Output:
#        domain    pred_label  pred_prob
# 0     cnn.com          news   0.876543
# 1  amazon.com      shopping   0.923456
# 2 wikipedia.org   education   0.891234

Classification Methods

# Combined text + image analysis (most accurate)
result = classifier.classify(["github.com"])

# Text-only classification (faster)
result = classifier.classify_by_text(["news.google.com"])

# Image-only classification
result = classifier.classify_by_images(["instagram.com"])

# Batch processing
results = classifier.classify_batch(domains, method="text", batch_size=50)

Historical Analysis

# Analyze archived versions from archive.org
old_result = classifier.classify(["facebook.com"], archive_date="20100101")

LLM Classification

# Configure LLM provider
classifier.configure_llm(
    provider="openai",
    model="gpt-4o",
    api_key="sk-...",
    categories=["news", "shopping", "social", "tech"]
)

# LLM-powered classification
result = classifier.classify_by_llm(["example.com"])

# With custom instructions
result = classifier.classify_by_llm(
    ["site.com"],
    custom_instructions="Classify by educational value"
)

Set API keys via environment variables:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="..."

Categories

41 categories: news, finance, shopping, education, government, adult content, gambling, social networks, search engines, and others based on Shallalist taxonomy.

Security

When analyzing unknown domains, use Docker or isolated environments:

docker build -t piedomains-sandbox .
docker run --rm -it piedomains-sandbox python -c "
from piedomains import DomainClassifier
classifier = DomainClassifier()
result = classifier.classify(['example.com'])
print(result[['domain', 'pred_label']])
"

For testing, use known-safe domains: ["wikipedia.org", "github.com", "cnn.com"]

Documentation

Development

git clone https://github.com/themains/piedomains
cd piedomains
pip install -e ".[dev]"
pytest tests/ -v

License

MIT License

Citation

@software{piedomains,
  title={piedomains: AI-powered domain content classification},
  author={Chintalapati, Rajashekar and Sood, Gaurav},
  year={2024},
  url={https://github.com/themains/piedomains}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piedomains-0.4.2.tar.gz (3.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

piedomains-0.4.2-py3-none-any.whl (143.5 kB view details)

Uploaded Python 3

File details

Details for the file piedomains-0.4.2.tar.gz.

File metadata

  • Download URL: piedomains-0.4.2.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for piedomains-0.4.2.tar.gz
Algorithm Hash digest
SHA256 f97f53136b8065806383f8d4ec1e42d8150158e8baf62e580362e5d9e5b54a93
MD5 64b6b0070ab3621012587086f582451f
BLAKE2b-256 e78a389d5688b59108d3c7a1fedb14db281c79444e3d43088d274fb7ee422808

See more details on using hashes here.

Provenance

The following attestation bundles were made for piedomains-0.4.2.tar.gz:

Publisher: python-publish.yml on themains/piedomains

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file piedomains-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: piedomains-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 143.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for piedomains-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dfc7108bc7f5dfd13f4a12df4c776a1c359ff7b0b96af6ad0204183366c53c82
MD5 e476ad8c49ca30b3e03f7f9ba62c969e
BLAKE2b-256 2ae01661981325e4edb152cccedbeb37f9f3faa64f4983760ad992c5632cbb5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for piedomains-0.4.2-py3-none-any.whl:

Publisher: python-publish.yml on themains/piedomains

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page