Modular semantic text chunking framework with rich metadata (ru/en)

These details have not been verified by PyPI

Project links

Framework
- Pytest
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Project description

Smart Chunker Engine

Модульный фреймворк для интеллектуального разбиения текста на семантические чанки с богатыми метаданными (русский/английский).

Возможности / Features

Многоступенчатый пайплайн (нормализация, разбиение, семантика, метаданные)
Гибкая настройка через config
Поддержка русского и английского (и других языков при наличии моделей)
Экспорт в JSON/CSV/Parquet (через chunk_metadata_adapter)
Простая интеграция в ML/NLP пайплайны, API, CI

Установка / Installation

pip install -r requirements.txt
# или
pip install .

Быстрый старт / Quick Start

from smart_chunker_engine.pipeline import SmartChunkerPipeline
pipeline = SmartChunkerPipeline()
chunks = pipeline.run("Это пример текста для разбиения на чанки.")
for c in chunks:
    print(c.text)

Примеры использования / Usage Examples

1. Базовый пример (русский)

from smart_chunker_engine.pipeline import SmartChunkerPipeline
pipeline = SmartChunkerPipeline()
text = "Это пример текста для разбиения на чанки. Каждый чанк будет содержать метаданные."
chunks = pipeline.run(text)
for c in chunks:
    print(f"[{c.start}:{c.end}] {c.text}")

2. Кастомные настройки пайплайна

config = {
    'split': {'chunk_size': 50},
    'boundary': {'window_size': 10, 'threshold': 0.12},
    'stats_gate': {'var_thr': 0.1},
    'triple_cluster': {'min_cluster': 2},
    'tfidf': {'top_n': 100},
    'iter_refine': {'lambda_': 0.3, 'max_iter': 2}
}
pipeline = SmartChunkerPipeline(config)
chunks = pipeline.run("Текст для теста с кастомными параметрами.")

3. Экспорт чанков в JSON

from smart_chunker_engine.exporter import export_chunks
export_chunks(chunks, "output.json", format="json")

4. Обработка английского текста

config = {'split': {'chunk_size': 40}, 'spacy_model': 'en_core_web_sm'}
pipeline = SmartChunkerPipeline(config)
text = "This is an example of English text. The chunker works for multiple languages."
chunks = pipeline.run(text)

5. Обработка батча текстов

pipeline = SmartChunkerPipeline({'split': {'chunk_size': 60}})
texts = ["Первый текст для чанкинга.", "Второй текст для примера."]
all_chunks = [pipeline.run(t) for t in texts]

6. Интеграция с pandas (DataFrame)

import pandas as pd
from smart_chunker_engine.pipeline import SmartChunkerPipeline
pipeline = SmartChunkerPipeline()
df = pd.DataFrame({'text': ["Текст 1.", "Текст 2."]})
df['chunks'] = df['text'].apply(lambda t: pipeline.run(t))

Описание основных настроек / Main Config Options

Этап	Ключ config	Описание параметров
Split	split	chunk_size, overlap, language, chunk_type
Boundary	boundary	window_size, step, threshold, model_name
StatsGate	stats_gate	var_thr, ent_thr, gini_thr
TripleExtractor	spacy_model	model_name (ru_core_news_md, en_core_web_sm, ...)
TripleCluster	triple_cluster	min_cluster, model_name, device, batch_size
TfidfLayer	tfidf	top_n
Metablock	metablock	threshold, model_name
IterativeRefine	iter_refine	lambda_, theta_high, theta_low, epsilon, max_iter, model_name, device

Пример полного config:

config = {
    'split': {'chunk_size': 100, 'overlap': 10, 'language': 'ru'},
    'boundary': {'window_size': 15, 'step': 5, 'threshold': 0.13},
    'stats_gate': {'var_thr': 0.12, 'ent_thr': 7.5, 'gini_thr': 0.2},
    'spacy_model': 'ru_core_news_md',
    'triple_cluster': {'min_cluster': 3, 'device': 'cpu'},
    'tfidf': {'top_n': 150},
    'metablock': {'threshold': 0.22},
    'iter_refine': {'lambda_': 0.35, 'theta_high': 0.7, 'theta_low': 0.3, 'epsilon': 0.005, 'max_iter': 3}
}

Документация / Documentation

COMPONENTS.md — описание всех модулей
PIPELINE_OVERVIEW.md — архитектура и поток данных
FAQ.md, HOWTO.md — практические советы
Примеры: examples/
Тесты: tests/

License

MIT

Project details

These details have not been verified by PyPI

Project links

Framework
- Pytest
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

This version

0.1.0

May 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_chunker_engine-0.1.0.tar.gz (50.5 kB view details)

Uploaded May 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

smart_chunker_engine-0.1.0-py3-none-any.whl (33.6 kB view details)

Uploaded May 21, 2025 Python 3

File details

Details for the file smart_chunker_engine-0.1.0.tar.gz.

File metadata

Download URL: smart_chunker_engine-0.1.0.tar.gz
Upload date: May 21, 2025
Size: 50.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for smart_chunker_engine-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`493e3fe11bd86e5a3660120090c03e6a5cf29d40b47803173fbfb5f92c7e3382`
MD5	`244f65a53b58a09600ac6ce5d59cd78f`
BLAKE2b-256	`a80c404cd9cbbf435c8ec33f6c33194ada56150a6ffbb2f465590806f301e843`

See more details on using hashes here.

File details

Details for the file smart_chunker_engine-0.1.0-py3-none-any.whl.

File metadata

Download URL: smart_chunker_engine-0.1.0-py3-none-any.whl
Upload date: May 21, 2025
Size: 33.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for smart_chunker_engine-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cdb59733c11facad598aaa08355cbc6491ef00e37897a055d72c20e0caebc81e`
MD5	`cb9131e83d4d6ebef85c8c096916d21b`
BLAKE2b-256	`611d1f6ce2fd1a498254f8d7b0437cd77c9f68657d971baedf2fbfc2f8cbf077`

See more details on using hashes here.

smart-chunker-engine 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Smart Chunker Engine

Возможности / Features

Установка / Installation

Быстрый старт / Quick Start

Примеры использования / Usage Examples

1. Базовый пример (русский)

2. Кастомные настройки пайплайна

3. Экспорт чанков в JSON

4. Обработка английского текста

5. Обработка батча текстов

6. Интеграция с pandas (DataFrame)

Описание основных настроек / Main Config Options

Документация / Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes