A standalone NMF topic modeling tool for Turkish and English texts
Project description
NMF Standalone
A comprehensive topic modeling library for Turkish and English texts using Non-negative Matrix Factorization (NMF).
Features
- Multi-language Support: Native support for Turkish and English text processing
- Advanced NMF Algorithms: Standard NMF and Orthogonal Projective NMF (OPNMF) variants
- Modern Tokenization: BPE and WordPiece tokenizers for Turkish, traditional preprocessing for English
- Comprehensive Preprocessing: Language-specific text cleaning, emoji processing, and normalization
- Rich Visualizations: Word clouds, topic distribution plots, and co-occurrence heatmaps
- Multiple Export Formats: Excel reports, JSON results, and database storage
- Coherence Evaluation: Built-in topic coherence scoring for model evaluation
- CLI and Python API: Both command-line interface and programmatic access
Installation
pip install nmf-standalone
For visualization features:
pip install nmf-standalone[visualization]
Quick Start
Command Line Interface
# Analyze Turkish app reviews with 5 topics
nmf-standalone analyze reviews.csv --column REVIEW --language TR --topics 5 --wordclouds
# Analyze English documents with lemmatization
nmf-standalone analyze docs.xlsx --column text --language EN --topics 10 --lemmatize --excel
# Use BPE tokenizer for Turkish text
nmf-standalone analyze data.csv --column content --language TR --tokenizer bpe --topics 7
Python API
from nmf_standalone import run_topic_analysis
# Simple analysis
result = run_topic_analysis(
"data.csv",
column="text_column",
language="TR",
topics=5,
generate_wordclouds=True
)
# Access results
topics = result['topic_word_scores']
for topic_id, words in topics.items():
print(f"Topic {topic_id}: {', '.join([word for word, score in words[:5]])}")
# Advanced configuration
result = run_topic_analysis(
"reviews.csv",
column="review_text",
language="TR",
topics=7,
nmf_method="opnmf", # Use projective NMF
tokenizer_type="wordpiece",
words_per_topic=20,
export_excel=True,
topic_distribution=True
)
Supported File Formats
- CSV files: Automatic delimiter detection, UTF-8 encoding
- Excel files: .xlsx and .xls formats
- Text columns: Any column containing text data for analysis
Language Support
Turkish
- Preprocessing: Advanced text cleaning, Turkish-specific normalization
- Tokenization: BPE (Byte-Pair Encoding) and WordPiece tokenizers
- Emoji Processing: Intelligent emoji-to-text conversion
- TF-IDF: BM25 and traditional TF-IDF with Turkish language adaptations
English
- Preprocessing: Standard NLP preprocessing with optional lemmatization
- Tokenization: Traditional word-based tokenization
- Lemmatization: NLTK-based lemmatization support
- TF-IDF: Classical TF-IDF with multiple weighting schemes
Algorithm Options
NMF Methods
- Standard NMF (
nmf): Classical non-negative matrix factorization - Orthogonal Projective NMF (
opnmf): Enhanced variant for better topic separation
Tokenization (Turkish)
- BPE (
bpe): Byte-Pair Encoding for subword tokenization - WordPiece (
wordpiece): Google's WordPiece algorithm
Output Formats
Generated Files
- Word Clouds: PNG images for each topic showing prominent words
- Excel Reports: Detailed topic-word matrices with scores
- Topic Distribution: Plots showing document-topic relationships
- JSON Results: Machine-readable topic and document data
- Database Storage: SQLite databases for persistent storage
Directory Structure
Output/
└── {dataset_name}/
├── {dataset}_topics.xlsx # Excel report
├── {dataset}_coherence_scores.json # Model evaluation
├── {dataset}_document_dist.png # Topic distribution
├── top_docs_{dataset}.json # Top documents per topic
└── wordclouds/ # Word cloud images
├── Topic_00.png
├── Topic_01.png
└── ...
Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
topics |
int | 5 | Number of topics to extract |
words_per_topic |
int | 15 | Top words to display per topic |
language |
str | "TR" | Language code (TR/EN) |
nmf_method |
str | "nmf" | Algorithm variant (nmf/opnmf) |
tokenizer_type |
str | "bpe" | Tokenizer for Turkish (bpe/wordpiece) |
lemmatize |
bool | False | Apply lemmatization (English only) |
generate_wordclouds |
bool | True | Create word cloud visualizations |
export_excel |
bool | True | Export Excel reports |
topic_distribution |
bool | True | Generate distribution plots |
Requirements
- Python ≥ 3.8
- NumPy ≥ 1.24.0
- Pandas ≥ 2.0.0
- scikit-learn ≥ 1.3.0
- NLTK ≥ 3.8.0
- gensim ≥ 4.3.0
- tokenizers ≥ 0.19.0
Examples
Turkish App Review Analysis
result = run_topic_analysis(
"app_reviews.csv",
column="review_text",
language="TR",
topics=8,
tokenizer_type="bpe",
generate_wordclouds=True,
export_excel=True
)
English Document Classification
result = run_topic_analysis(
"documents.xlsx",
column="content",
language="EN",
topics=12,
lemmatize=True,
nmf_method="opnmf",
words_per_topic=25
)
Medical Text Analysis
result = run_topic_analysis(
"medical_notes.csv",
column="impression",
language="EN",
topics=15,
lemmatize=True,
generate_wordclouds=True,
topic_distribution=True
)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use this library in your research, please cite:
@software{nmf_standalone,
author = {Emir Karayagiz},
title = {NMF Standalone: Topic Modeling for Turkish and English Texts},
url = {https://github.com/emirkarayagiz/nmf-standalone},
version = {0.1.0},
year = {2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nmf_standalone-0.1.7.tar.gz.
File metadata
- Download URL: nmf_standalone-0.1.7.tar.gz
- Upload date:
- Size: 58.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcf1835390e6b7766227fd1f1be34afe77d3747c0429a002ed21793395ad7ac8
|
|
| MD5 |
e08328becba50c3c748b32080df67507
|
|
| BLAKE2b-256 |
851dce796968908ec600bececf1637ee1f26832b5dc4d82e22d3e80bc01a64e2
|
File details
Details for the file nmf_standalone-0.1.7-py3-none-any.whl.
File metadata
- Download URL: nmf_standalone-0.1.7-py3-none-any.whl
- Upload date:
- Size: 76.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cae29854446dde305e3282a4b46d3824811702c45b9efb6e212feee0730e5ed
|
|
| MD5 |
124f238271db22f0702ffff6c09c396f
|
|
| BLAKE2b-256 |
3fde596b468eadc668c0e4698017b07a4558278f6b5b021feb7ecbb269c2ee73
|