Skip to main content

Text normalization and tokenization tools

Project description

dataknobs-xization

Text normalization and tokenization tools.

Installation

pip install dataknobs-xization

Features

  • Markdown Chunking: Parse and chunk markdown documents for RAG applications
    • Preserves heading hierarchy and semantic structure
    • Supports code blocks, tables, lists, and other markdown constructs
    • Streaming support for large documents
    • Flexible configuration for chunk size, overlap, and heading inclusion
  • Content Transformation: Convert JSON, YAML, and CSV to markdown for RAG ingestion
    • Generic conversion that preserves structure through headings
    • Custom schemas for specialized formatting
    • Configurable formatting options
  • Text Normalization: Standardize text for consistent processing
  • Masking Tokenizer: Advanced tokenization with masking capabilities
  • Annotations: Text annotation system
  • Authorities: Authority management for text processing
  • Lexicon: Lexicon-based text analysis

Usage

Markdown Chunking

from dataknobs_xization import parse_markdown, chunk_markdown_tree

# Parse markdown into tree structure
markdown_text = """
# User Guide
## Installation
Install the package using pip.
"""

tree = parse_markdown(markdown_text)

# Generate chunks for RAG
chunks = chunk_markdown_tree(tree, max_chunk_size=500)

for chunk in chunks:
    print(f"Headings: {chunk.metadata.get_heading_path()}")
    print(f"Text: {chunk.text}\n")

For more details, see the Markdown Chunking documentation.

Content Transformation

Convert structured data (JSON, YAML, CSV) to well-formatted markdown for RAG ingestion:

from dataknobs_xization import ContentTransformer, json_to_markdown

# Quick conversion
data = [
    {"name": "Chain of Thought", "description": "Step by step reasoning"},
    {"name": "Few-Shot", "description": "Learning from examples"}
]
markdown = json_to_markdown(data, title="Prompt Patterns")

# Or use the transformer class for more control
transformer = ContentTransformer(
    base_heading_level=2,
    include_field_labels=True,
    code_block_fields=["example", "code"],
    list_fields=["steps", "items"]
)

# Transform JSON
result = transformer.transform_json(data)

# Transform YAML
result = transformer.transform_yaml("config.yaml")

# Transform CSV
result = transformer.transform_csv("data.csv", title_field="name")

Custom Schemas

Register schemas for specialized formatting of known data structures:

transformer = ContentTransformer()

# Register a schema for prompt patterns
transformer.register_schema("pattern", {
    "title_field": "name",
    "description_field": "description",
    "sections": [
        {"field": "use_case", "heading": "When to Use"},
        {"field": "example", "heading": "Example", "format": "code", "language": "python"},
        {"field": "variations", "heading": "Variations", "format": "list"}
    ],
    "metadata_fields": ["category", "difficulty"]
})

# Use the schema
patterns = [
    {
        "name": "Chain of Thought",
        "description": "Prompting technique for complex reasoning",
        "use_case": "Multi-step problems requiring logical reasoning",
        "example": "Let's think step by step...",
        "category": "reasoning",
        "difficulty": "intermediate"
    }
]

markdown = transformer.transform_json(patterns, schema="pattern")

Convenience Functions

from dataknobs_xization import json_to_markdown, yaml_to_markdown, csv_to_markdown

# Quick conversions
md = json_to_markdown(data, title="My Data")
md = yaml_to_markdown("config.yaml", title="Config")
md = csv_to_markdown("data.csv", title_field="name")

Text Normalization and Tokenization

from dataknobs_xization import normalize, MaskingTokenizer

# Text normalization
normalized = normalize.normalize_text("Hello, World!")

# Tokenization with masking
tokenizer = MaskingTokenizer()
tokens = tokenizer.tokenize("This is a sample text.")

# Working with annotations
from dataknobs_xization import annotations
doc = annotations.create_document("Sample text", {"metadata": "value"})

Dependencies

This package depends on:

  • dataknobs-common
  • dataknobs-structures
  • dataknobs-utils
  • nltk

License

See LICENSE file in the root repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataknobs_xization-1.3.2.tar.gz (358.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataknobs_xization-1.3.2-py3-none-any.whl (94.0 kB view details)

Uploaded Python 3

File details

Details for the file dataknobs_xization-1.3.2.tar.gz.

File metadata

  • Download URL: dataknobs_xization-1.3.2.tar.gz
  • Upload date:
  • Size: 358.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for dataknobs_xization-1.3.2.tar.gz
Algorithm Hash digest
SHA256 e3431e978e45a2f8652cb21d8979df8bec66ec009a2336d47eed1b90c8f05251
MD5 9f3ceacb86b2716382a690670c9da8ce
BLAKE2b-256 d5b5164fc266e55a42b932aef1b3f7e68ac9315b36074b1d3ee79d99c8e3ab63

See more details on using hashes here.

File details

Details for the file dataknobs_xization-1.3.2-py3-none-any.whl.

File metadata

  • Download URL: dataknobs_xization-1.3.2-py3-none-any.whl
  • Upload date:
  • Size: 94.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for dataknobs_xization-1.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b4ef1c867d1802f91dbf16f7222e8e8f7b124884b2036443ac4966885a95a27c
MD5 8e1fcbc6b56cedcffe6accdf2196e974
BLAKE2b-256 e4fd599b0b47e85860b0996a241b6eef1b3395859af8a94d1c01616257b3fb5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page