Skip to main content

Text normalization and tokenization tools

Project description

dataknobs-xization

Text normalization and tokenization tools.

Installation

pip install dataknobs-xization

Features

  • Markdown Chunking: Parse and chunk markdown documents for RAG applications
    • Preserves heading hierarchy and semantic structure
    • Supports code blocks, tables, lists, and other markdown constructs
    • Streaming support for large documents
    • Flexible configuration for chunk size, overlap, and heading inclusion
  • Text Normalization: Standardize text for consistent processing
  • Masking Tokenizer: Advanced tokenization with masking capabilities
  • Annotations: Text annotation system
  • Authorities: Authority management for text processing
  • Lexicon: Lexicon-based text analysis

Usage

Markdown Chunking

from dataknobs_xization import parse_markdown, chunk_markdown_tree

# Parse markdown into tree structure
markdown_text = """
# User Guide
## Installation
Install the package using pip.
"""

tree = parse_markdown(markdown_text)

# Generate chunks for RAG
chunks = chunk_markdown_tree(tree, max_chunk_size=500)

for chunk in chunks:
    print(f"Headings: {chunk.metadata.get_heading_path()}")
    print(f"Text: {chunk.text}\n")

For more details, see the Markdown Chunking documentation.

Text Normalization and Tokenization

from dataknobs_xization import normalize, MaskingTokenizer

# Text normalization
normalized = normalize.normalize_text("Hello, World!")

# Tokenization with masking
tokenizer = MaskingTokenizer()
tokens = tokenizer.tokenize("This is a sample text.")

# Working with annotations
from dataknobs_xization import annotations
doc = annotations.create_document("Sample text", {"metadata": "value"})

Dependencies

This package depends on:

  • dataknobs-common
  • dataknobs-structures
  • dataknobs-utils
  • nltk

License

See LICENSE file in the root repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataknobs_xization-1.1.0.tar.gz (62.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataknobs_xization-1.1.0-py3-none-any.whl (46.8 kB view details)

Uploaded Python 3

File details

Details for the file dataknobs_xization-1.1.0.tar.gz.

File metadata

  • Download URL: dataknobs_xization-1.1.0.tar.gz
  • Upload date:
  • Size: 62.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.7

File hashes

Hashes for dataknobs_xization-1.1.0.tar.gz
Algorithm Hash digest
SHA256 62d21cc57b1255142836f52b2472a44fe2d5d978c7cd6b902bf36cd354c0c0e8
MD5 10371a6f2fb1e9380290a0ccdb95e195
BLAKE2b-256 db48a3fba485694aee01b18ce131591e31eef637a022b57a464ebc7db7d7a7c8

See more details on using hashes here.

File details

Details for the file dataknobs_xization-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dataknobs_xization-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ceb23393ca2f1a0283a88e3b6e0c74d8808fb77d5c8c8807a7276d982a5d712
MD5 6be262a5270bd3bfa7bebdf6a530503f
BLAKE2b-256 367dea332c67a8c59f7317f34011ef730d2c6caba3a59540a9ce637c551e9499

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page