A Python library for pseudonymizing Ukrainian text using Presidio analyzer with Ukrainian NER model
Project description
Ukrainian Text Pseudonymizer
A Python library for pseudonymizing Ukrainian text using the Presidio analyzer framework with a Ukrainian NER model. This tool can identify and anonymize various types of entities in Ukrainian text, including:
- Person names
- Job titles
- Locations
- Organizations
- Date/Time expressions
- Email addresses
- Credit card numbers
- URLs
- Phone numbers
Requirements
- Python 3.8
- Git LFS (for model download)
Installation
- Install the package using
uv:
uv pip install pseudonymizer-uk
- Install Git LFS (required for model download):
git lfs install
- Download the Ukrainian NER model:
git clone https://huggingface.co/dchaplinsky/uk_ner_web_trf_13class
Usage
from pseudonymizer_uk import UkPseudonymizer
# Initialize the pseudonymizer with the path to the downloaded model
pseudonymizer = UkPseudonymizer(path_to_model="./uk_ner_web_trf_13class")
# Pseudonymize text
text = "Іван Франко народився в селі Нагуєвичі"
anonymized_text = pseudonymizer.pseudonymize(text)
Supported Entity Types
By default, the pseudonymizer recognizes the following entity types:
- PERSON
- JOB
- LOCATION
- ORGANIZATION
- DATE_TIME
You can customize which entities to recognize by passing the entities parameter:
pseudonymizer = UkPseudonymizer(
path_to_model="uk_ner_web_trf_13class",
entities=['PERSON', 'LOCATION'] # Only recognize persons and locations
)
Custom Recognizers and Operators
You can extend the functionality by adding custom recognizers and operators:
from presidio_analyzer import EntityRecognizer
from presidio_anonymizer import OperatorConfig
# Add custom recognizer
pseudonymizer.add_custom_recognizer(your_custom_recognizer)
# Add custom operator
pseudonymizer.add_custom_operator(
"CUSTOM_ENTITY",
OperatorConfig("custom", {"param": "value"})
)
Development
To set up the development environment:
- Clone the repository:
git clone https://github.com/fox-rudie/pseudonymizer-uk.git
cd pseudonymizer-uk
- Create a virtual environment and install dependencies using
uv:
uv venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows
uv pip install -e ".[dev]"
- Install pre-commit hooks (optional):
uv pip install pre-commit
pre-commit install
Publishing
To publish a new version to PyPI:
-
Update version in
pyproject.tomland__init__.py -
Build the package:
uv pip install build
python -m build
- Upload to PyPI:
uv pip install twine
python -m twine upload dist/*
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pseudonymizer_uk-0.1.3.tar.gz.
File metadata
- Download URL: pseudonymizer_uk-0.1.3.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8488f49337a02159272be39cb5ad0373c8a1d1318214d78241f31a84c4836d2f
|
|
| MD5 |
e6c1d67218d938caa8ade1078ac56749
|
|
| BLAKE2b-256 |
ef6a320d1b06357dd0a5272bfd7fe95fce9901a9a9bad3f4ed1c84741625fd0c
|
File details
Details for the file pseudonymizer_uk-0.1.3-py3-none-any.whl.
File metadata
- Download URL: pseudonymizer_uk-0.1.3-py3-none-any.whl
- Upload date:
- Size: 5.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d996856590b1cae51402a3daac9a5a35679ab82e97a815b222059e05a00cd8ee
|
|
| MD5 |
9da91637e68a97ccc2ebf5d7184e54ca
|
|
| BLAKE2b-256 |
5c87b8c5057bb1a1a8a0f536a26ae3c7ee53cb3aadb91d9fd63f580b57f96f3e
|