Predict categories based on domain names and their content
Project description
piedomains predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.
🚀 Quickstart
Install and classify domains in 3 lines:
pip install piedomains
from piedomains import DomainClassifier
classifier = DomainClassifier()
# Classify current content
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
print(result[['domain', 'pred_label', 'pred_prob']])
# Expected output:
# domain pred_label pred_prob
# 0 cnn.com news 0.876543
# 1 amazon.com shopping 0.923456
# 2 wikipedia.org education 0.891234
📊 Key Features
High Accuracy: Combines text analysis + visual screenshots for 90%+ accuracy
Historical Analysis: Classify websites from any point in time using archive.org
Fast & Scalable: Batch processing with caching for 1000s of domains
Easy Integration: Modern Python API with pandas output
41 Categories: From news/finance to adult/gambling content
⚡ Usage Examples
Basic Classification
from piedomains import DomainClassifier
classifier = DomainClassifier()
# Combined analysis (most accurate)
result = classifier.classify(["github.com", "reddit.com"])
# Text-only (faster)
result = classifier.classify_by_text(["news.google.com"])
# Images-only (good for visual content)
result = classifier.classify_by_images(["instagram.com"])
Historical Analysis
# Analyze how Facebook looked in 2010 vs today
old_facebook = classifier.classify(["facebook.com"], archive_date="20100101")
new_facebook = classifier.classify(["facebook.com"])
print(f"2010: {old_facebook.iloc[0]['pred_label']}")
print(f"2024: {new_facebook.iloc[0]['pred_label']}")
Batch Processing
# Process large lists efficiently
domains = ["site1.com", "site2.com", ...] # 1000s of domains
results = classifier.classify_batch(
domains,
method="text", # text|images|combined
batch_size=50, # Process 50 at a time
show_progress=True # Progress bar
)
🏷️ Supported Categories
News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.
📈 Performance
Speed: ~10-50 domains/minute (depends on method and network)
Accuracy: 85-95% depending on content type and method
Memory: <500MB for batch processing
Caching: Automatic content caching for faster re-runs
🔧 Installation
Requirements: Python 3.9+
# Basic installation
pip install piedomains
# For development
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e .
🔄 Migration from v0.2.x
Old API (still supported):
from piedomains import domain
result = domain.pred_shalla_cat_with_text(["example.com"])
New API (recommended):
from piedomains import DomainClassifier
classifier = DomainClassifier()
result = classifier.classify_by_text(["example.com"])
📖 Documentation
API Reference: https://piedomains.readthedocs.io
Examples: /examples directory
Notebooks: /piedomains/notebooks (training & analysis)
🤝 Contributing
# Setup development environment
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e ".[dev]"
# Run tests
pytest piedomains/tests/ -v
# Run linting
flake8 piedomains/
📄 License
MIT License - see LICENSE file.
📚 Citation
If you use piedomains in research, please cite:
@software{piedomains,
title={piedomains: AI-powered domain content classification},
author={Chintalapati, Rajashekar and Sood, Gaurav},
year={2024},
url={https://github.com/themains/piedomains}
}
—
Legacy Documentation
For legacy API documentation, see LEGACY_API.rst
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file piedomains-0.3.1.tar.gz.
File metadata
- Download URL: piedomains-0.3.1.tar.gz
- Upload date:
- Size: 3.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e65d66874573b9d3c2a091e6dd273d87aa594eb452cad51b6990b28c0399f6db
|
|
| MD5 |
cff2200099d311945bb302a190c333b4
|
|
| BLAKE2b-256 |
365613d0f18eaadc4bf61ee92d4b560feb0e3c1ec67e401dddadb8f797908e55
|
File details
Details for the file piedomains-0.3.1-py2.py3-none-any.whl.
File metadata
- Download URL: piedomains-0.3.1-py2.py3-none-any.whl
- Upload date:
- Size: 3.4 MB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8ca418e1430941241c07fb145755ab7d936a7b12b5d48579002f63764e06456
|
|
| MD5 |
cbdd23c0ba23c76c43671dd7465faa6f
|
|
| BLAKE2b-256 |
aebb606578e5d5414a68718a27943717587a9acc5014699482afa4f0b763b248
|