Text normalization and tokenization tools
Project description
dataknobs-xization
Text normalization and tokenization tools.
Installation
pip install dataknobs-xization
Features
- Markdown Chunking: Parse and chunk markdown documents for RAG applications
- Preserves heading hierarchy and semantic structure
- Supports code blocks, tables, lists, and other markdown constructs
- Streaming support for large documents
- Flexible configuration for chunk size, overlap, and heading inclusion
- Content Transformation: Convert JSON, YAML, and CSV to markdown for RAG ingestion
- Generic conversion that preserves structure through headings
- Custom schemas for specialized formatting
- Configurable formatting options
- Text Normalization: Standardize text for consistent processing
- Masking Tokenizer: Advanced tokenization with masking capabilities
- Annotations: Text annotation system
- Authorities: Authority management for text processing
- Lexicon: Lexicon-based text analysis
Usage
Markdown Chunking
from dataknobs_xization import parse_markdown, chunk_markdown_tree
# Parse markdown into tree structure
markdown_text = """
# User Guide
## Installation
Install the package using pip.
"""
tree = parse_markdown(markdown_text)
# Generate chunks for RAG
chunks = chunk_markdown_tree(tree, max_chunk_size=500)
for chunk in chunks:
print(f"Headings: {chunk.metadata.get_heading_path()}")
print(f"Text: {chunk.text}\n")
For more details, see the Markdown Chunking documentation.
Content Transformation
Convert structured data (JSON, YAML, CSV) to well-formatted markdown for RAG ingestion:
from dataknobs_xization import ContentTransformer, json_to_markdown
# Quick conversion
data = [
{"name": "Chain of Thought", "description": "Step by step reasoning"},
{"name": "Few-Shot", "description": "Learning from examples"}
]
markdown = json_to_markdown(data, title="Prompt Patterns")
# Or use the transformer class for more control
transformer = ContentTransformer(
base_heading_level=2,
include_field_labels=True,
code_block_fields=["example", "code"],
list_fields=["steps", "items"]
)
# Transform JSON
result = transformer.transform_json(data)
# Transform YAML
result = transformer.transform_yaml("config.yaml")
# Transform CSV
result = transformer.transform_csv("data.csv", title_field="name")
Custom Schemas
Register schemas for specialized formatting of known data structures:
transformer = ContentTransformer()
# Register a schema for prompt patterns
transformer.register_schema("pattern", {
"title_field": "name",
"description_field": "description",
"sections": [
{"field": "use_case", "heading": "When to Use"},
{"field": "example", "heading": "Example", "format": "code", "language": "python"},
{"field": "variations", "heading": "Variations", "format": "list"}
],
"metadata_fields": ["category", "difficulty"]
})
# Use the schema
patterns = [
{
"name": "Chain of Thought",
"description": "Prompting technique for complex reasoning",
"use_case": "Multi-step problems requiring logical reasoning",
"example": "Let's think step by step...",
"category": "reasoning",
"difficulty": "intermediate"
}
]
markdown = transformer.transform_json(patterns, schema="pattern")
Convenience Functions
from dataknobs_xization import json_to_markdown, yaml_to_markdown, csv_to_markdown
# Quick conversions
md = json_to_markdown(data, title="My Data")
md = yaml_to_markdown("config.yaml", title="Config")
md = csv_to_markdown("data.csv", title_field="name")
Text Normalization and Tokenization
from dataknobs_xization import normalize, MaskingTokenizer
# Text normalization
normalized = normalize.normalize_text("Hello, World!")
# Tokenization with masking
tokenizer = MaskingTokenizer()
tokens = tokenizer.tokenize("This is a sample text.")
# Working with annotations
from dataknobs_xization import annotations
doc = annotations.create_document("Sample text", {"metadata": "value"})
Dependencies
This package depends on:
dataknobs-commondataknobs-structuresdataknobs-utils- nltk
License
See LICENSE file in the root repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataknobs_xization-1.3.4.tar.gz.
File metadata
- Download URL: dataknobs_xization-1.3.4.tar.gz
- Upload date:
- Size: 379.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.9 {"installer":{"name":"uv","version":"0.11.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c18a4effcb7b6bca940742c487db1f1c57a49b68713e933cd47b13582b9e362f
|
|
| MD5 |
8fc44bbf5aeac4decb4784a7d05e7d79
|
|
| BLAKE2b-256 |
2c01e349d54118ecc421f7bb6f881123bbb1bc1d4f59cd9350cabcea7eb43c9f
|
File details
Details for the file dataknobs_xization-1.3.4-py3-none-any.whl.
File metadata
- Download URL: dataknobs_xization-1.3.4-py3-none-any.whl
- Upload date:
- Size: 102.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.9 {"installer":{"name":"uv","version":"0.11.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a05945ab880bc61fbde5c063b8e1cdd3f5b92b46702f720af651e5def4423c59
|
|
| MD5 |
390d9ffb950e8398f05feebe5107c2ae
|
|
| BLAKE2b-256 |
e0cd85f5f9a1607344976395c98188bc3b95146d533641122f4c931d341ea669
|