Modern Python library for search-query normalization

These details have not been verified by PyPI

Project links

Project description

Query Normalizer

Modern Python library for search-query normalization with bilingual support (Russian/English).

Features

Text Cleaning: Removes BBCode, HTML/XML tags, HTML entities
Keyboard Layout Fixing: Automatically fixes mixed latin/cyrillic layouts
Mixed Script Detection: Handles confusable characters and mixed alphabets
Lemmatization: Converts words to base forms (classic mode)
Stopword Removal: Filters out common words (classic mode)
Punctuation Preservation: Keeps punctuation for embedding models
Dual Modes: Optimized for classic search and embedding models

Installation

pip install query-normalizer

Or with server support:

pip install query-normalizer[server]

Library Usage

from query_normalizer import QueryNormalizer

normalizer = QueryNormalizer()

# Classic mode: lemmatized, stopwords removed
result = normalizer.normalize_for_classic("Это ghbdtn алфaвиты и машины")
print(result.normalized_query)  # "привет алфавит машина"
print(result.tokens)  # ["привет", "алфавит", "машина"]

# Embedding mode: natural language preserved
result = normalizer.normalize_for_embedding("Это ghbdtn алфaвиты и машины")
print(result.normalized_query)  # "это привет алфавиты и машины"

Configuration

You can customize normalization behavior via NormalizationConfig:

from query_normalizer import QueryNormalizer, NormalizationConfig

config = NormalizationConfig(
    keyboard_layout_fix_threshold=0.9,  # Higher threshold layout
    known_word_bonus=1.5,                # Increase trust in known words
    stopword_bonus=0.5,                  # Increase trust in stopwords
    stop_words={"custom", "stop", "words"},  # Custom stopword list
)

normalizer = QueryNormalizer(config=config)
result = normalizer.normalize_for_classic("test query")

Available config options:

keyboard_layout_fix_threshold: Threshold for keyboard layout fixing (default: 0.75)
known_word_bonus: Bonus for dictionary words in language detection (default: 1.0)
stopword_bonus: Bonus for stopwords in language detection (default: 0.25)
english_stop_words: Custom English stopwords set
russian_stop_words: Custom Russian stopwords set
stop_words: Custom combined stopwords set
keyboard_latin_to_cyrillic: Custom latin-to-cyrillic keyboard mapping
keyboard_cyrillic_to_latin: Custom cyrillic-to-latin keyboard mapping
script_aliases: Supported script aliases
punctuation_tokens: Punctuation tokens to handle

CLI Usage

# Basic normalization
query-normalizer "Это ghbdtn алфaвиты и машины"

# Classic mode only
query-normalizer "test query" --mode classic

# Embedding mode only  
query-normalizer "test query" --mode embedding

# Show debug info
query-normalizer "test query" --debug

Server Usage

# Install with server dependencies
pip install query-normalizer[server]

# Run FastAPI server
uvicorn query_normalizer.server:app --reload

API will be available at http://127.0.0.1:8000, Swagger UI at http://127.0.0.1:8000/docs

API Endpoints

POST /normalize/classic - Optimized for classic search (lemmatized, stopwords removed)
POST /normalize/embedding - Optimized for embedding models (natural language preserved)
POST /normalize - Both normalizations in one response
GET /health - Health check

Example Request

curl -X POST http://127.0.0.1:8000/normalize \
  -H 'Content-Type: application/json' \
  -d '{"query":"Это ghbdtn алфaвиты и машины", "debug": true}'

Example Response:

{
  "classic": {
    "normalized_query": "привет алфавит машина",
    "tokens": ["привет", "алфавит", "машина"],
    "corrections_applied": [
      "stopword:это",
      "keyboard-layout:ghbdtn->привет",
      "mixed-alphabet:алфaвиты->алфавиты",
      "lemma:алфавиты->алфавит",
      "lemma:машины->машина",
      "stopword:и"
    ]
  },
  "embedding": {
    "normalized_query": "это привет алфавиты и машины",
    "tokens": [],
    "corrections_applied": [
      "keyboard-layout:ghbdtn->привет",
      "mixed-alphabet:алфaвиты->алфавиты"
    ]
  }
}

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=query_normalizer --cov-report=term-missing

# Format code
ruff format .

# Lint code
ruff check .

# Type check
mypy query_normalizer/

Dependencies

pymorphy3 - Russian lemmatization
simplemma - English lemmatization
nltk - English stopwords
stop-words - Russian stopwords
confusable-homoglyphs - Mixed alphabet detection
beautifulsoup4 - HTML/XML parsing

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Apr 14, 2026

This version

0.2.1

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

query_normalizer-0.2.1.tar.gz (53.0 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

query_normalizer-0.2.1-py3-none-any.whl (34.3 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file query_normalizer-0.2.1.tar.gz.

File metadata

Download URL: query_normalizer-0.2.1.tar.gz
Upload date: Apr 14, 2026
Size: 53.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for query_normalizer-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`978892a5edbc099871128a967215a5e9bd4014e946cf09f18e56f7103e5077ee`
MD5	`dc1439235838cc2ee3cac61daf0401fa`
BLAKE2b-256	`b2d3dafb284e20fbe191f0a1c3bcdc831c67e50725cb6a593251ee9e7cf836ef`

See more details on using hashes here.

Provenance

The following attestation bundles were made for query_normalizer-0.2.1.tar.gz:

Publisher: release.yml on Open-Workshop/query-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: query_normalizer-0.2.1.tar.gz
- Subject digest: 978892a5edbc099871128a967215a5e9bd4014e946cf09f18e56f7103e5077ee
- Sigstore transparency entry: 1293672106
- Sigstore integration time: Apr 14, 2026
Source repository:
- Permalink: Open-Workshop/query-normalizer@00cae1bd60bf2d1cf6cf37b099892134b57b92b1
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Open-Workshop
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@00cae1bd60bf2d1cf6cf37b099892134b57b92b1
- Trigger Event: workflow_dispatch

File details

Details for the file query_normalizer-0.2.1-py3-none-any.whl.

File metadata

Download URL: query_normalizer-0.2.1-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 34.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for query_normalizer-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e96c732889f87adb372b10a23dba38ec781292a3280fcbffe8b23a00e39fd7bb`
MD5	`318b6654b8ad09bd31ac01ac3d62c3a9`
BLAKE2b-256	`187fb6fd706d5e90560acc8b696dd502eadd76a0d6a81fbba707aa1b0a132097`

See more details on using hashes here.

Provenance

The following attestation bundles were made for query_normalizer-0.2.1-py3-none-any.whl:

Publisher: release.yml on Open-Workshop/query-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: query_normalizer-0.2.1-py3-none-any.whl
- Subject digest: e96c732889f87adb372b10a23dba38ec781292a3280fcbffe8b23a00e39fd7bb
- Sigstore transparency entry: 1293672109
- Sigstore integration time: Apr 14, 2026
Source repository:
- Permalink: Open-Workshop/query-normalizer@00cae1bd60bf2d1cf6cf37b099892134b57b92b1
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Open-Workshop
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@00cae1bd60bf2d1cf6cf37b099892134b57b92b1
- Trigger Event: workflow_dispatch

query-normalizer 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Query Normalizer

Features

Installation

Library Usage

Configuration

CLI Usage

Server Usage

API Endpoints

Example Request

Development

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance