Toolkit for loading, preprocessing, and clustering structured text datasets

Project description

corpus_cluster_explorer

corpus_cluster_explorer — это Python-пакет для загрузки, предобработки и кластеризации структурированных текстовых датасетов.

Пакет поддерживает два основных сценария работы:

Полный пайплайн
Загрузка структурированного датасета, автоматическое обнаружение текстовых полей, выбор полей для анализа, токенизация корпуса, извлечение биграмм, обучение Word2Vec, оценка качества кластеризации, запуск KMeans и экспорт результатов.
Продолжение работы с токенизированного корпуса
Загрузка ранее сохранённого токенизированного корпуса в формате JSONL и продолжение анализа сразу с этапа эмбеддингов и кластеризации.

Возможности

загрузка структурированных текстовых датасетов:
- JSONL
- JSON
- CSV
- TSV
автоматическое обнаружение текстовых полей
предобработка текста:
- приведение к нижнему регистру
- удаление ссылок
- удаление @username
- удаление хэштегов
- очистка пунктуации
- лемматизация русских слов с помощью pymorphy3
извлечение биграмм (gensim.Phrases)
обучение Word2Vec
кластеризация (KMeans)
оценка качества (silhouette score)
PCA для визуализации
экспорт:
- токенизированного корпуса
- корпуса с кластерной разметкой

Установка

pip install corpus-cluster-explorer

Быстрый старт

Полный пайплайн

from corpus_cluster_explorer import CorpusExplorer

explorer = CorpusExplorer()

# загрузка данных
explorer.load("data.jsonl")

# посмотреть какие текстовые поля найдены
print(explorer.text_fields)

# выбрать поля
explorer.choose_fields(explorer.text_fields)

# токенизация
explorer.tokenize()

# сохранить при необходимости
explorer.save_tokenized("tokenized.jsonl")

# эмбеддинги + подбор кластеров
explorer.fit_embeddings()
valid_k, scores = explorer.evaluate_clusters()
print(valid_k, scores)

# кластеризация
explorer.cluster(4)

# сохранить результат
explorer.save_clustered("clustered.jsonl")

Продолжение с токенизированного корпуса

from corpus_cluster_explorer import CorpusExplorer

explorer = CorpusExplorer()

explorer.load_tokenized("tokenized.jsonl")

explorer.fit_embeddings()
valid_k, scores = explorer.evaluate_clusters()

explorer.cluster(4)
explorer.save_clustered("clustered.jsonl")

API

Загрузка

explorer.load("data.jsonl")
explorer.text_fields

Выбор полей

explorer.choose_fields(["text"])

или

explorer.choose_fields(explorer.text_fields)

Токенизация

explorer.tokenize()

Статистика

explorer.token_stats()

Кластеризация

explorer.fit_embeddings()
explorer.evaluate_clusters()
explorer.cluster(4)

Сохранение

explorer.save_tokenized("tokenized.jsonl")
explorer.save_clustered("clustered.jsonl")

CLI

corpus-explorer data.jsonl --fields text comments_text --clusters 4

Форматы данных

Tokenized JSONL

tokens
combined_text
field_text_map

Clustered JSONL

tokens
cluster_ids
cluster_labels

Замечания

русские слова лемматизируются
нерусские токены сохраняются
используются только биграммы
можно продолжать работу с сохранённого токенизированного корпуса
рекомендуется сначала посмотреть explorer.text_fields, затем выбирать поля

Лицензия

MIT

Project details

Release history Release notifications | RSS feed

This version

0.1.2

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_cluster_explorer-0.1.2.tar.gz (11.8 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

corpus_cluster_explorer-0.1.2-py3-none-any.whl (13.7 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file corpus_cluster_explorer-0.1.2.tar.gz.

File metadata

Download URL: corpus_cluster_explorer-0.1.2.tar.gz
Upload date: Apr 30, 2026
Size: 11.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for corpus_cluster_explorer-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`3a9cff6909624676194d09a0c667508181af6cac93bea822af731fa2ee60c460`
MD5	`373d04e9d44b22b2bca5553021bc5958`
BLAKE2b-256	`8de9bd3b9f47252dc4f5025f94b12bf86e39944dbfd398a3bae93af1aaeb527a`

See more details on using hashes here.

File details

Details for the file corpus_cluster_explorer-0.1.2-py3-none-any.whl.

File metadata

Download URL: corpus_cluster_explorer-0.1.2-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 13.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for corpus_cluster_explorer-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3bce73e6164072d50a8a771fb1e67e70ffff727ee1e872b754f8a04a50b9828c`
MD5	`dfe96a5d8a08468640c26593214e29f3`
BLAKE2b-256	`0db8308a7d9ddbc0afaee5e1938fd98930cd86127bd5bcb72e0976b4d501dc45`

See more details on using hashes here.

corpus-cluster-explorer 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

corpus_cluster_explorer

Возможности

Установка

Быстрый старт

Полный пайплайн

Продолжение с токенизированного корпуса

API

Загрузка

Выбор полей

Токенизация

Статистика

Кластеризация

Сохранение

CLI

Форматы данных

Tokenized JSONL

Clustered JSONL

Замечания

Лицензия

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes