Modern Python library for search-query normalization
Project description
Query Normalizer
Modern Python library for search-query normalization with bilingual support (Russian/English).
Features
- Text Cleaning: Removes BBCode, HTML/XML tags, HTML entities
- Keyboard Layout Fixing: Automatically fixes mixed latin/cyrillic layouts
- Mixed Script Detection: Handles confusable characters and mixed alphabets
- Lemmatization: Converts words to base forms (classic mode)
- Stopword Removal: Filters out common words (classic mode)
- Punctuation Preservation: Keeps punctuation for embedding models
- Dual Modes: Optimized for classic search and embedding models
Installation
pip install query-normalizer
Or with server support:
pip install query-normalizer[server]
Library Usage
from query_normalizer import QueryNormalizer
normalizer = QueryNormalizer()
# Classic mode: lemmatized, stopwords removed
result = normalizer.normalize_for_classic("Это ghbdtn алфaвиты и машины")
print(result.normalized_query) # "привет алфавит машина"
print(result.tokens) # ["привет", "алфавит", "машина"]
# Embedding mode: natural language preserved
result = normalizer.normalize_for_embedding("Это ghbdtn алфaвиты и машины")
print(result.normalized_query) # "это привет алфавиты и машины"
Configuration
You can customize normalization behavior via NormalizationConfig:
from query_normalizer import QueryNormalizer, NormalizationConfig
config = NormalizationConfig(
keyboard_layout_fix_threshold=0.9, # Higher threshold layout
known_word_bonus=1.5, # Increase trust in known words
stopword_bonus=0.5, # Increase trust in stopwords
stop_words={"custom", "stop", "words"}, # Custom stopword list
)
normalizer = QueryNormalizer(config=config)
result = normalizer.normalize_for_classic("test query")
Available config options:
keyboard_layout_fix_threshold: Threshold for keyboard layout fixing (default: 0.75)known_word_bonus: Bonus for dictionary words in language detection (default: 1.0)stopword_bonus: Bonus for stopwords in language detection (default: 0.25)english_stop_words: Custom English stopwords setrussian_stop_words: Custom Russian stopwords setstop_words: Custom combined stopwords setkeyboard_latin_to_cyrillic: Custom latin-to-cyrillic keyboard mappingkeyboard_cyrillic_to_latin: Custom cyrillic-to-latin keyboard mappingscript_aliases: Supported script aliasespunctuation_tokens: Punctuation tokens to handle
CLI Usage
# Basic normalization
query-normalizer "Это ghbdtn алфaвиты и машины"
# Classic mode only
query-normalizer "test query" --mode classic
# Embedding mode only
query-normalizer "test query" --mode embedding
# Show debug info
query-normalizer "test query" --debug
Server Usage
# Install with server dependencies
pip install query-normalizer[server]
# Run FastAPI server
uvicorn query_normalizer.server:app --reload
API will be available at http://127.0.0.1:8000, Swagger UI at http://127.0.0.1:8000/docs
API Endpoints
POST /normalize/classic- Optimized for classic search (lemmatized, stopwords removed)POST /normalize/embedding- Optimized for embedding models (natural language preserved)POST /normalize- Both normalizations in one responseGET /health- Health check
Example Request
curl -X POST http://127.0.0.1:8000/normalize \
-H 'Content-Type: application/json' \
-d '{"query":"Это ghbdtn алфaвиты и машины", "debug": true}'
Example Response:
{
"classic": {
"normalized_query": "привет алфавит машина",
"tokens": ["привет", "алфавит", "машина"],
"corrections_applied": [
"stopword:это",
"keyboard-layout:ghbdtn->привет",
"mixed-alphabet:алфaвиты->алфавиты",
"lemma:алфавиты->алфавит",
"lemma:машины->машина",
"stopword:и"
]
},
"embedding": {
"normalized_query": "это привет алфавиты и машины",
"tokens": [],
"corrections_applied": [
"keyboard-layout:ghbdtn->привет",
"mixed-alphabet:алфaвиты->алфавиты"
]
}
}
Development
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=query_normalizer --cov-report=term-missing
# Format code
ruff format .
# Lint code
ruff check .
# Type check
mypy query_normalizer/
Dependencies
pymorphy3- Russian lemmatizationsimplemma- English lemmatizationnltk- English stopwordsstop-words- Russian stopwordsconfusable-homoglyphs- Mixed alphabet detectionbeautifulsoup4- HTML/XML parsing
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file query_normalizer-0.2.1.tar.gz.
File metadata
- Download URL: query_normalizer-0.2.1.tar.gz
- Upload date:
- Size: 53.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
978892a5edbc099871128a967215a5e9bd4014e946cf09f18e56f7103e5077ee
|
|
| MD5 |
dc1439235838cc2ee3cac61daf0401fa
|
|
| BLAKE2b-256 |
b2d3dafb284e20fbe191f0a1c3bcdc831c67e50725cb6a593251ee9e7cf836ef
|
Provenance
The following attestation bundles were made for query_normalizer-0.2.1.tar.gz:
Publisher:
release.yml on Open-Workshop/query-normalizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
query_normalizer-0.2.1.tar.gz -
Subject digest:
978892a5edbc099871128a967215a5e9bd4014e946cf09f18e56f7103e5077ee - Sigstore transparency entry: 1293672106
- Sigstore integration time:
-
Permalink:
Open-Workshop/query-normalizer@00cae1bd60bf2d1cf6cf37b099892134b57b92b1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Open-Workshop
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@00cae1bd60bf2d1cf6cf37b099892134b57b92b1 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file query_normalizer-0.2.1-py3-none-any.whl.
File metadata
- Download URL: query_normalizer-0.2.1-py3-none-any.whl
- Upload date:
- Size: 34.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e96c732889f87adb372b10a23dba38ec781292a3280fcbffe8b23a00e39fd7bb
|
|
| MD5 |
318b6654b8ad09bd31ac01ac3d62c3a9
|
|
| BLAKE2b-256 |
187fb6fd706d5e90560acc8b696dd502eadd76a0d6a81fbba707aa1b0a132097
|
Provenance
The following attestation bundles were made for query_normalizer-0.2.1-py3-none-any.whl:
Publisher:
release.yml on Open-Workshop/query-normalizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
query_normalizer-0.2.1-py3-none-any.whl -
Subject digest:
e96c732889f87adb372b10a23dba38ec781292a3280fcbffe8b23a00e39fd7bb - Sigstore transparency entry: 1293672109
- Sigstore integration time:
-
Permalink:
Open-Workshop/query-normalizer@00cae1bd60bf2d1cf6cf37b099892134b57b92b1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Open-Workshop
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@00cae1bd60bf2d1cf6cf37b099892134b57b92b1 -
Trigger Event:
workflow_dispatch
-
Statement type: