A Python library that categorizes HTML links based on their domain names, title text, and anchor text.
Project description
Link Categorizer
A Python library that categorizes HTML links based on their domain names, title text, and anchor text.
Overview
Link Categorizer analyzes an array of Python dictionaries representing HTML
anchor tags and assigns each link to a category based on its domain name. The
library extracts the domain from the href attribute and matches it against
known patterns to determine the most appropriate category.
If a link's domain does not match any known patterns, it is assigned to the "unknown" category by default.
Installation
pip install link-categorizer
Development Installation
To install the package for development:
# Clone the repository
git clone https://github.com/yourusername/link-categorizer.git
cd link-categorizer
# Create and activate a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows, use: venv\Scripts\activate
# Install in development mode
pip install -e .
# Install development dependencies
pip install -e ".[dev]"
Usage
from link_categorizer import categorize_links
# Example list of link dictionaries
links = [
{"href": "https://github.com/user/repo", "title": "Source code", "text": "GitHub Repository"},
{"href": "https://medium.com/article", "title": "My thoughts", "text": "Read my blog"},
{"href": "https://x.com/username", "text": "Follow me"},
{"href": "https://twitter.com/username", "text": "Follow me on Twitter"},
{"href": "https://docs.python.org/3/", "text": "Python Documentation"},
]
# Categorize the links
categorized_links = categorize_links(links)
# Result will be a dictionary with categories as keys and lists of links as values
# {
# "repository": [{"href": "https://github.com/user/repo", ...}],
# "blog": [{"href": "https://medium.com/article", ...}],
# "social": [{"href": "https://x.com/username", ...}, {"href": "https://twitter.com/username", ...}],
# "documentation": [{"href": "https://docs.python.org/3/", ...}]
# }
How Categorization Works
The library employs a straightforward domain-based approach to categorize links:
- Extract the domain name from the link's
hrefattribute - Compare the domain against a predefined list of domain patterns
- Assign the category associated with the matching domain pattern
- If no match is found, assign the "unknown" category
Supported Categories
The library can identify various link categories including:
- Repository (code hosting platforms)
- Social media
- Blog platforms
- Documentation sites
- News outlets
- Video platforms
- E-commerce sites
- And more...
Project Structure
link-categorizer/
├── src/
│ └── link_categorizer/
│ ├── __init__.py
│ └── categorizer.py
├── tests/
├── setup.py
├── LICENSE
└── README.md
Running Tests
There are several ways to run the tests:
Method 1: Install the package in development mode
# From the project root directory
uv pip install -e .
python -m unittest discover -s tests
Method 2: Use PYTHONPATH environment variable
# From the project root directory
PYTHONPATH=. python -m unittest discover -s tests
Method 3: Run with pytest (if installed)
# From the project root directory
uv pip install pytest
python -m pytest
To run a specific test file:
# From the project root directory
python -m unittest tests.link_categorizer_test
Adding More Tests
The test structure is designed to be easily expandable:
Adding new test cases to existing categories
Open tests/link_categorizer_test.py and add new test cases to the appropriate test method:
Best practices for tests
- Keep test methods focused on specific categories or behaviors
- Use descriptive test method names
- Add docstrings to explain what each test is verifying
- Use the
self.subTest()context manager for better error reporting (already implemented)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file link_categorizer-0.1.6.tar.gz.
File metadata
- Download URL: link_categorizer-0.1.6.tar.gz
- Upload date:
- Size: 37.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.32.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac4aba97e1930cbd1cfef5dc31c9ceb306fdaf208a9768fb55e86ef41e98b49d
|
|
| MD5 |
42e9f0f82364897e1ae91a04ec7dfd97
|
|
| BLAKE2b-256 |
94d4746b368425a08e9ab319367f57284b06f48e9a8b30a54b2e9e5e83c38ebb
|
File details
Details for the file link_categorizer-0.1.6-py3-none-any.whl.
File metadata
- Download URL: link_categorizer-0.1.6-py3-none-any.whl
- Upload date:
- Size: 5.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.32.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2c79b6f6126179275ee5a2b2543592a6b1544c75c9debfbc2fbea3811ec14e2
|
|
| MD5 |
702e50c691596a55c09441d4149fa572
|
|
| BLAKE2b-256 |
dd3332639d6c64afe3feabed01f140997b2ab01321175c9e8d79c7ebdbe1efb8
|