Predict categories based on domain names and their content

These details have been verified by PyPI

Project links

Owner

themains

GitHub Statistics

Maintainers

rajashekar

These details have not been verified by PyPI

Project description

piedomains: AI-powered domain content classification

piedomains predicts website content categories using traditional ML models or modern LLMs (GPT-4, Claude, Gemini). Analyze domain names, text content, and homepage screenshots to classify websites as news, shopping, adult content, education, etc. with high accuracy and flexible custom categories.

🚀 Quickstart

Install and classify domains in 3 lines:

pip install piedomains

from piedomains import DomainClassifier
classifier = DomainClassifier()

# Classify current content
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
print(result[['domain', 'pred_label', 'pred_prob']])

# Expected output:
#        domain    pred_label  pred_prob
# 0     cnn.com          news   0.876543
# 1  amazon.com      shopping   0.923456
# 2 wikipedia.org   education   0.891234

📊 Key Features

High Accuracy: Combines text analysis + visual screenshots for 90%+ accuracy
LLM-Powered: Use GPT-4o, Claude 3.5, Gemini with custom categories and instructions
Historical Analysis: Classify websites from any point in time using archive.org
Fast & Scalable: Batch processing with caching for 1000s of domains
Easy Integration: Modern Python API with pandas output
Flexible Categories: 41 default categories or define your own with AI models

⚡ Usage Examples

Basic Classification

from piedomains import DomainClassifier

classifier = DomainClassifier()

# Combined analysis (most accurate)
result = classifier.classify(["github.com", "reddit.com"])

# Text-only (faster)
result = classifier.classify_by_text(["news.google.com"])

# Images-only (good for visual content)  
result = classifier.classify_by_images(["instagram.com"])

Historical Analysis

# Analyze how Facebook looked in 2010 vs today
old_facebook = classifier.classify(["facebook.com"], archive_date="20100101")
new_facebook = classifier.classify(["facebook.com"])

print(f"2010: {old_facebook.iloc[0]['pred_label']}")
print(f"2024: {new_facebook.iloc[0]['pred_label']}")

Batch Processing

# Process large lists efficiently
domains = ["site1.com", "site2.com", ...] # 1000s of domains
results = classifier.classify_batch(
    domains, 
    method="text",           # text|images|combined
    batch_size=50,           # Process 50 at a time
    show_progress=True       # Progress bar
)

🤖 LLM-Powered Classification

Use modern AI models (GPT-4, Claude, Gemini) for flexible, accurate classification:

from piedomains import DomainClassifier

classifier = DomainClassifier()

# Configure your preferred AI provider
classifier.configure_llm(
    provider="openai",           # openai, anthropic, google
    model="gpt-4o",              # multimodal model
    api_key="sk-...",            # or set via environment variable
    categories=["news", "shopping", "social", "tech", "education"]
)

# Text-only LLM classification
result = classifier.classify_by_llm(["cnn.com", "github.com"])

# Multimodal classification (text + screenshots)
result = classifier.classify_by_llm_multimodal(["instagram.com"])

# Custom classification instructions
result = classifier.classify_by_llm(
    ["khanacademy.org", "reddit.com"],
    custom_instructions="Classify by educational value: educational, entertainment, mixed"
)

# Track usage and costs
stats = classifier.get_llm_usage_stats()
print(f"API calls: {stats['total_requests']}, Cost: ${stats['estimated_cost_usd']:.4f}")

LLM Benefits:

Custom Categories: Define your own classification schemes
Multimodal Analysis: Combines text + visual understanding
Latest AI: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro
Cost Tracking: Built-in usage monitoring and limits
Flexible Prompts: Customize instructions for specific use cases

Supported Providers:

OpenAI: GPT-4o, GPT-4-turbo, GPT-3.5-turbo
Anthropic: Claude 3.5 Sonnet, Claude 3 Opus/Haiku
Google: Gemini 1.5 Pro, Gemini Pro Vision
Others: Any litellm-supported model

# Set API keys via environment variables
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="..."

🏷️ Supported Categories

News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.

📈 Performance

Speed: ~10-50 domains/minute (depends on method and network)
Accuracy: 85-95% depending on content type and method
Memory: <500MB for batch processing
Caching: Automatic content caching for faster re-runs

🔧 Installation

Requirements: Python 3.11+

# Basic installation
pip install piedomains

# For development
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e .

💡 API Usage

from piedomains import DomainClassifier
classifier = DomainClassifier()
result = classifier.classify_by_text(["example.com"])

📖 Documentation

API Reference: https://piedomains.readthedocs.io
Examples: /examples directory
Notebooks: /notebooks (training & analysis)

🤝 Contributing

# Setup development environment
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run linting
ruff check piedomains/

📄 License

MIT License - see LICENSE file.

📚 Citation

If you use piedomains in research, please cite:

@software{piedomains,
  title={piedomains: AI-powered domain content classification},
  author={Chintalapati, Rajashekar and Sood, Gaurav},
  year={2024},
  url={https://github.com/themains/piedomains}
}

Project details

These details have been verified by PyPI

Project links

Owner

themains

GitHub Statistics

Maintainers

rajashekar

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.0

Dec 17, 2025

0.4.2

Dec 15, 2025

This version

0.4.1

Dec 15, 2025

0.4.0

Dec 15, 2025

0.3.10

Sep 2, 2025

0.3.9

Sep 2, 2025

0.3.8

Sep 2, 2025

0.3.7

Sep 2, 2025

0.3.6

Sep 2, 2025

0.3.5

Sep 2, 2025

0.3.4

Sep 2, 2025

0.3.3

Sep 2, 2025

0.3.2

Sep 1, 2025

0.3.1

Sep 1, 2025

0.3.0

Sep 1, 2025

0.2.1

Sep 1, 2025

0.2.0

Sep 1, 2025

0.1.0

Aug 30, 2025

0.0.19

Apr 28, 2023

0.0.18

Apr 20, 2023

0.0.17

Apr 17, 2023

0.0.16

Apr 14, 2023

0.0.15

Apr 14, 2023

0.0.14

Apr 13, 2023

0.0.13

Apr 13, 2023

0.0.12

Apr 13, 2023

0.0.11

Feb 5, 2023

0.0.10

Feb 4, 2023

0.0.9

Feb 4, 2023

0.0.8

Jan 29, 2023

0.0.7

Jan 29, 2023

0.0.6

Jan 28, 2023

0.0.5

Jan 28, 2023

0.0.4

Oct 28, 2022

0.0.3

Oct 28, 2022

0.0.2

May 4, 2022

0.0.1

May 3, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piedomains-0.4.1.tar.gz (3.6 MB view details)

Uploaded Dec 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

piedomains-0.4.1-py3-none-any.whl (137.9 kB view details)

Uploaded Dec 15, 2025 Python 3

File details

Details for the file piedomains-0.4.1.tar.gz.

File metadata

Download URL: piedomains-0.4.1.tar.gz
Upload date: Dec 15, 2025
Size: 3.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for piedomains-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`7824243847e9eb128747b2f1bba04518c57c84b43f5177450f4c15cf526331b9`
MD5	`d296ce99bbe8799e285489172ac737a5`
BLAKE2b-256	`6bac23ebb7313541b53b5465c35b47acdaedaa6423000bfb66b78a40a2ac9ceb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for piedomains-0.4.1.tar.gz:

Publisher: python-publish.yml on themains/piedomains

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: piedomains-0.4.1.tar.gz
- Subject digest: 7824243847e9eb128747b2f1bba04518c57c84b43f5177450f4c15cf526331b9
- Sigstore transparency entry: 764249114
- Sigstore integration time: Dec 15, 2025
Source repository:
- Permalink: themains/piedomains@c363b0011aea47d56959dd4ab27d4874c5b97ed3
- Branch / Tag: refs/heads/main
- Owner: https://github.com/themains
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@c363b0011aea47d56959dd4ab27d4874c5b97ed3
- Trigger Event: workflow_dispatch

File details

Details for the file piedomains-0.4.1-py3-none-any.whl.

File metadata

Download URL: piedomains-0.4.1-py3-none-any.whl
Upload date: Dec 15, 2025
Size: 137.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for piedomains-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`30ac343e5741f220d40498997c1ac0f071f27ee772981159d7fd0700d611cdaa`
MD5	`0690f5356d714b11355ab97d414c2d75`
BLAKE2b-256	`646da17fdaae4f693fd1160149f9d07c5e3c23f3377cb9e5e4e95a1950082a6c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for piedomains-0.4.1-py3-none-any.whl:

Publisher: python-publish.yml on themains/piedomains

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: piedomains-0.4.1-py3-none-any.whl
- Subject digest: 30ac343e5741f220d40498997c1ac0f071f27ee772981159d7fd0700d611cdaa
- Sigstore transparency entry: 764249115
- Sigstore integration time: Dec 15, 2025
Source repository:
- Permalink: themains/piedomains@c363b0011aea47d56959dd4ab27d4874c5b97ed3
- Branch / Tag: refs/heads/main
- Owner: https://github.com/themains
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@c363b0011aea47d56959dd4ab27d4874c5b97ed3
- Trigger Event: workflow_dispatch

piedomains 0.4.1

Navigation

Verified details

Project links

Owner

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

piedomains: AI-powered domain content classification

🚀 Quickstart

📊 Key Features

⚡ Usage Examples

Basic Classification

Historical Analysis

Batch Processing

🤖 LLM-Powered Classification

🏷️ Supported Categories

📈 Performance

🔧 Installation

💡 API Usage

📖 Documentation

🤝 Contributing

📄 License

📚 Citation

Project details

Verified details

Project links

Owner

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance