Skip to main content

Modern Python library for search-query normalization

Project description

Query Normalizer

Modern Python library for search-query normalization with bilingual support (Russian/English).

Features

  • Text Cleaning: Removes BBCode, HTML/XML tags, HTML entities
  • Keyboard Layout Fixing: Automatically fixes mixed latin/cyrillic layouts
  • Mixed Script Detection: Handles confusable characters and mixed alphabets
  • Lemmatization: Converts words to base forms (classic mode)
  • Stopword Removal: Filters out common words (classic mode)
  • Punctuation Preservation: Keeps punctuation for embedding models
  • Dual Modes: Optimized for classic search and embedding models

Installation

pip install query-normalizer

Or with server support:

pip install query-normalizer[server]

Library Usage

from query_normalizer import QueryNormalizer

normalizer = QueryNormalizer()

# Classic mode: lemmatized, stopwords removed
result = normalizer.normalize_for_classic("Это ghbdtn алфaвиты и машины")
print(result.normalized_query)  # "привет алфавит машина"
print(result.tokens)  # ["привет", "алфавит", "машина"]

# Embedding mode: natural language preserved
result = normalizer.normalize_for_embedding("Это ghbdtn алфaвиты и машины")
print(result.normalized_query)  # "это привет алфавиты и машины"

Configuration

You can customize normalization behavior via NormalizationConfig:

from query_normalizer import QueryNormalizer, NormalizationConfig

config = NormalizationConfig(
    keyboard_layout_fix_threshold=0.9,  # Higher threshold layout
    known_word_bonus=1.5,                # Increase trust in known words
    stopword_bonus=0.5,                  # Increase trust in stopwords
    stop_words={"custom", "stop", "words"},  # Custom stopword list
)

normalizer = QueryNormalizer(config=config)
result = normalizer.normalize_for_classic("test query")

Available config options:

  • keyboard_layout_fix_threshold: Threshold for keyboard layout fixing (default: 0.75)
  • known_word_bonus: Bonus for dictionary words in language detection (default: 1.0)
  • stopword_bonus: Bonus for stopwords in language detection (default: 0.25)
  • english_stop_words: Custom English stopwords set
  • russian_stop_words: Custom Russian stopwords set
  • stop_words: Custom combined stopwords set
  • keyboard_latin_to_cyrillic: Custom latin-to-cyrillic keyboard mapping
  • keyboard_cyrillic_to_latin: Custom cyrillic-to-latin keyboard mapping
  • script_aliases: Supported script aliases
  • punctuation_tokens: Punctuation tokens to handle

CLI Usage

# Basic normalization
query-normalizer "Это ghbdtn алфaвиты и машины"

# Classic mode only
query-normalizer "test query" --mode classic

# Embedding mode only  
query-normalizer "test query" --mode embedding

# Show debug info
query-normalizer "test query" --debug

Server Usage

# Install with server dependencies
pip install query-normalizer[server]

# Run FastAPI server
uvicorn query_normalizer.server:app --reload

API will be available at http://127.0.0.1:8000, Swagger UI at http://127.0.0.1:8000/docs

API Endpoints

  • POST /normalize/classic - Optimized for classic search (lemmatized, stopwords removed)
  • POST /normalize/embedding - Optimized for embedding models (natural language preserved)
  • POST /normalize - Both normalizations in one response
  • GET /health - Health check

Example Request

curl -X POST http://127.0.0.1:8000/normalize \
  -H 'Content-Type: application/json' \
  -d '{"query":"Это ghbdtn алфaвиты и машины", "debug": true}'

Example Response:

{
  "classic": {
    "normalized_query": "привет алфавит машина",
    "tokens": ["привет", "алфавит", "машина"],
    "corrections_applied": [
      "stopword:это",
      "keyboard-layout:ghbdtn->привет",
      "mixed-alphabet:алфaвиты->алфавиты",
      "lemma:алфавиты->алфавит",
      "lemma:машины->машина",
      "stopword:и"
    ]
  },
  "embedding": {
    "normalized_query": "это привет алфавиты и машины",
    "tokens": [],
    "corrections_applied": [
      "keyboard-layout:ghbdtn->привет",
      "mixed-alphabet:алфaвиты->алфавиты"
    ]
  }
}

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=query_normalizer --cov-report=term-missing

# Format code
ruff format .

# Lint code
ruff check .

# Type check
mypy query_normalizer/

Dependencies

  • pymorphy3 - Russian lemmatization
  • simplemma - English lemmatization
  • nltk - English stopwords
  • stop-words - Russian stopwords
  • confusable-homoglyphs - Mixed alphabet detection
  • beautifulsoup4 - HTML/XML parsing

License

GNU General Public License v3.0 (GPL-3.0) - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

query_normalizer-0.2.2.tar.gz (53.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

query_normalizer-0.2.2-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file query_normalizer-0.2.2.tar.gz.

File metadata

  • Download URL: query_normalizer-0.2.2.tar.gz
  • Upload date:
  • Size: 53.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for query_normalizer-0.2.2.tar.gz
Algorithm Hash digest
SHA256 88830aaf0c050e99a93a4452a68bea381771cad692653c0aa436666efd2eefd2
MD5 8a2ac500164a139cc8d180e72fb14cb1
BLAKE2b-256 92ffb0415830962b61f642e48030d114f8d5a7d2768b3ffd1d451588a7327ee8

See more details on using hashes here.

Provenance

The following attestation bundles were made for query_normalizer-0.2.2.tar.gz:

Publisher: release.yml on Open-Workshop/query-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file query_normalizer-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for query_normalizer-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 93b95f30baf20bdf4f9f28e3067801eabe42985e29cc7a4e5f2c96460438a8e8
MD5 1bb45d07d133a8424e67e3ed149785ea
BLAKE2b-256 7c1b9c4a8326aed084cbad99300abe79a0efa5cc7af3c6a850de121942ec160d

See more details on using hashes here.

Provenance

The following attestation bundles were made for query_normalizer-0.2.2-py3-none-any.whl:

Publisher: release.yml on Open-Workshop/query-normalizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page