Skip to main content

Modern Python library for search-query normalization

Project description

Query Normalizer

Modern Python library for search-query normalization with bilingual support (Russian/English).

Features

  • Text Cleaning: Removes BBCode, HTML/XML tags, HTML entities
  • Keyboard Layout Fixing: Automatically fixes mixed latin/cyrillic layouts
  • Mixed Script Detection: Handles confusable characters and mixed alphabets
  • Lemmatization: Converts words to base forms (classic mode)
  • Stopword Removal: Filters out common words (classic mode)
  • Punctuation Preservation: Keeps punctuation for embedding models
  • Dual Modes: Optimized for classic search and embedding models

Installation

pip install query-normalizer

Or with server support:

pip install query-normalizer[server]

Library Usage

from query_normalizer import QueryNormalizer

normalizer = QueryNormalizer()

# Classic mode: lemmatized, stopwords removed
result = normalizer.normalize_for_classic("Это ghbdtn алфaвиты и машины")
print(result.normalized_query)  # "привет алфавит машина"
print(result.tokens)  # ["привет", "алфавит", "машина"]

# Embedding mode: natural language preserved
result = normalizer.normalize_for_embedding("Это ghbdtn алфaвиты и машины")
print(result.normalized_query)  # "это привет алфавиты и машины"

Configuration

You can customize normalization behavior via NormalizationConfig:

from query_normalizer import QueryNormalizer, NormalizationConfig

config = NormalizationConfig(
    keyboard_layout_fix_threshold=0.9,  # Higher threshold layout
    known_word_bonus=1.5,                # Increase trust in known words
    stopword_bonus=0.5,                  # Increase trust in stopwords
    stop_words={"custom", "stop", "words"},  # Custom stopword list
)

normalizer = QueryNormalizer(config=config)
result = normalizer.normalize_for_classic("test query")

Available config options:

  • keyboard_layout_fix_threshold: Threshold for keyboard layout fixing (default: 0.75)
  • known_word_bonus: Bonus for dictionary words in language detection (default: 1.0)
  • stopword_bonus: Bonus for stopwords in language detection (default: 0.25)
  • english_stop_words: Custom English stopwords set
  • russian_stop_words: Custom Russian stopwords set
  • stop_words: Custom combined stopwords set
  • keyboard_latin_to_cyrillic: Custom latin-to-cyrillic keyboard mapping
  • keyboard_cyrillic_to_latin: Custom cyrillic-to-latin keyboard mapping
  • script_aliases: Supported script aliases
  • punctuation_tokens: Punctuation tokens to handle

CLI Usage

# Basic normalization
query-normalizer "Это ghbdtn алфaвиты и машины"

# Classic mode only
query-normalizer "test query" --mode classic

# Embedding mode only  
query-normalizer "test query" --mode embedding

# Show debug info
query-normalizer "test query" --debug

Server Usage

# Install with server dependencies
pip install query-normalizer[server]

# Run FastAPI server
uvicorn query_normalizer.server:app --reload

API will be available at http://127.0.0.1:8000, Swagger UI at http://127.0.0.1:8000/docs

API Endpoints

  • POST /normalize/classic - Optimized for classic search (lemmatized, stopwords removed)
  • POST /normalize/embedding - Optimized for embedding models (natural language preserved)
  • POST /normalize - Both normalizations in one response
  • GET /health - Health check

Example Request

curl -X POST http://127.0.0.1:8000/normalize \
  -H 'Content-Type: application/json' \
  -d '{"query":"Это ghbdtn алфaвиты и машины", "debug": true}'

Example Response:

{
  "classic": {
    "normalized_query": "привет алфавит машина",
    "tokens": ["привет", "алфавит", "машина"],
    "corrections_applied": [
      "stopword:это",
      "keyboard-layout:ghbdtn->привет",
      "mixed-alphabet:алфaвиты->алфавиты",
      "lemma:алфавиты->алфавит",
      "lemma:машины->машина",
      "stopword:и"
    ]
  },
  "embedding": {
    "normalized_query": "это привет алфавиты и машины",
    "tokens": [],
    "corrections_applied": [
      "keyboard-layout:ghbdtn->привет",
      "mixed-alphabet:алфaвиты->алфавиты"
    ]
  }
}

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=query_normalizer --cov-report=term-missing

# Format code
ruff format .

# Lint code
ruff check .

# Type check
mypy query_normalizer/

Dependencies

  • pymorphy3 - Russian lemmatization
  • simplemma - English lemmatization
  • nltk - English stopwords
  • stop-words - Russian stopwords
  • confusable-homoglyphs - Mixed alphabet detection
  • beautifulsoup4 - HTML/XML parsing

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

query_normalizer-0.2.1.tar.gz (53.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

query_normalizer-0.2.1-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file query_normalizer-0.2.1.tar.gz.

File metadata

  • Download URL: query_normalizer-0.2.1.tar.gz
  • Upload date:
  • Size: 53.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for query_normalizer-0.2.1.tar.gz
Algorithm Hash digest
SHA256 978892a5edbc099871128a967215a5e9bd4014e946cf09f18e56f7103e5077ee
MD5 dc1439235838cc2ee3cac61daf0401fa
BLAKE2b-256 b2d3dafb284e20fbe191f0a1c3bcdc831c67e50725cb6a593251ee9e7cf836ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for query_normalizer-0.2.1.tar.gz:

Publisher: release.yml on Open-Workshop/query-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file query_normalizer-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for query_normalizer-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e96c732889f87adb372b10a23dba38ec781292a3280fcbffe8b23a00e39fd7bb
MD5 318b6654b8ad09bd31ac01ac3d62c3a9
BLAKE2b-256 187fb6fd706d5e90560acc8b696dd502eadd76a0d6a81fbba707aa1b0a132097

See more details on using hashes here.

Provenance

The following attestation bundles were made for query_normalizer-0.2.1-py3-none-any.whl:

Publisher: release.yml on Open-Workshop/query-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page