A standalone NMF topic modeling tool for Turkish and English texts

These details have not been verified by PyPI

Project links

Project description

NMF Standalone

A comprehensive topic modeling library for Turkish and English texts using Non-negative Matrix Factorization (NMF).

Features

Multi-language Support: Native support for Turkish and English text processing
Advanced NMF Algorithms: Standard NMF and Orthogonal Projective NMF (OPNMF) variants
Modern Tokenization: BPE and WordPiece tokenizers for Turkish, traditional preprocessing for English
Comprehensive Preprocessing: Language-specific text cleaning, emoji processing, and normalization
Rich Visualizations: Word clouds, topic distribution plots, and co-occurrence heatmaps
Multiple Export Formats: Excel reports, JSON results, and database storage
Coherence Evaluation: Built-in topic coherence scoring for model evaluation
CLI and Python API: Both command-line interface and programmatic access

Installation

pip install nmf-standalone

For visualization features:

pip install nmf-standalone[visualization]

Quick Start

Command Line Interface

# Analyze Turkish app reviews with 5 topics
nmf-standalone analyze reviews.csv --column REVIEW --language TR --topics 5 --wordclouds

# Analyze English documents with lemmatization
nmf-standalone analyze docs.xlsx --column text --language EN --topics 10 --lemmatize --excel

# Use BPE tokenizer for Turkish text
nmf-standalone analyze data.csv --column content --language TR --tokenizer bpe --topics 7

Python API

from nmf_standalone import run_topic_analysis

# Simple analysis
result = run_topic_analysis(
    "data.csv",
    column="text_column", 
    language="TR",
    topics=5,
    generate_wordclouds=True
)

# Access results
topics = result['topic_word_scores']
for topic_id, words in topics.items():
    print(f"Topic {topic_id}: {', '.join([word for word, score in words[:5]])}")

# Advanced configuration
result = run_topic_analysis(
    "reviews.csv",
    column="review_text",
    language="TR", 
    topics=7,
    nmf_method="opnmf",  # Use projective NMF
    tokenizer_type="wordpiece",
    words_per_topic=20,
    export_excel=True,
    topic_distribution=True
)

Supported File Formats

CSV files: Automatic delimiter detection, UTF-8 encoding
Excel files: .xlsx and .xls formats
Text columns: Any column containing text data for analysis

Language Support

Turkish

Preprocessing: Advanced text cleaning, Turkish-specific normalization
Tokenization: BPE (Byte-Pair Encoding) and WordPiece tokenizers
Emoji Processing: Intelligent emoji-to-text conversion
TF-IDF: BM25 and traditional TF-IDF with Turkish language adaptations

English

Preprocessing: Standard NLP preprocessing with optional lemmatization
Tokenization: Traditional word-based tokenization
Lemmatization: NLTK-based lemmatization support
TF-IDF: Classical TF-IDF with multiple weighting schemes

Algorithm Options

NMF Methods

Standard NMF (nmf): Classical non-negative matrix factorization
Orthogonal Projective NMF (opnmf): Enhanced variant for better topic separation

Tokenization (Turkish)

BPE (bpe): Byte-Pair Encoding for subword tokenization
WordPiece (wordpiece): Google's WordPiece algorithm

Output Formats

Generated Files

Word Clouds: PNG images for each topic showing prominent words
Excel Reports: Detailed topic-word matrices with scores
Topic Distribution: Plots showing document-topic relationships
JSON Results: Machine-readable topic and document data
Database Storage: SQLite databases for persistent storage

Directory Structure

Output/
└── {dataset_name}/
    ├── {dataset}_topics.xlsx           # Excel report
    ├── {dataset}_coherence_scores.json # Model evaluation
    ├── {dataset}_document_dist.png     # Topic distribution
    ├── top_docs_{dataset}.json         # Top documents per topic
    └── wordclouds/                     # Word cloud images
        ├── Topic_00.png
        ├── Topic_01.png
        └── ...

Configuration Options

Parameter	Type	Default	Description
`topics`	int	5	Number of topics to extract
`words_per_topic`	int	15	Top words to display per topic
`language`	str	"TR"	Language code (TR/EN)
`nmf_method`	str	"nmf"	Algorithm variant (nmf/opnmf)
`tokenizer_type`	str	"bpe"	Tokenizer for Turkish (bpe/wordpiece)
`lemmatize`	bool	False	Apply lemmatization (English only)
`generate_wordclouds`	bool	True	Create word cloud visualizations
`export_excel`	bool	True	Export Excel reports
`topic_distribution`	bool	True	Generate distribution plots

Requirements

Python ≥ 3.8
NumPy ≥ 1.24.0
Pandas ≥ 2.0.0
scikit-learn ≥ 1.3.0
NLTK ≥ 3.8.0
gensim ≥ 4.3.0
tokenizers ≥ 0.19.0

Examples

Turkish App Review Analysis

result = run_topic_analysis(
    "app_reviews.csv",
    column="review_text",
    language="TR",
    topics=8,
    tokenizer_type="bpe",
    generate_wordclouds=True,
    export_excel=True
)

English Document Classification

result = run_topic_analysis(
    "documents.xlsx", 
    column="content",
    language="EN",
    topics=12,
    lemmatize=True,
    nmf_method="opnmf",
    words_per_topic=25
)

Medical Text Analysis

result = run_topic_analysis(
    "medical_notes.csv",
    column="impression", 
    language="EN",
    topics=15,
    lemmatize=True,
    generate_wordclouds=True,
    topic_distribution=True
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this library in your research, please cite:

@software{nmf_standalone,
  author = {Emir Karayagiz},
  title = {NMF Standalone: Topic Modeling for Turkish and English Texts},
  url = {https://github.com/emirkarayagiz/nmf-standalone},
  version = {0.1.0},
  year = {2024}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.4.post2

Jul 3, 2025

0.3.4.post1

Jul 3, 2025

0.3.4

Jul 3, 2025

0.3.3

Jul 3, 2025

0.3.2

Jul 3, 2025

0.3.1

Jul 3, 2025

0.3.0

Jul 3, 2025

0.2.8

Jul 2, 2025

0.2.7

Jul 2, 2025

0.2.6

Jul 2, 2025

0.2.5

Jul 2, 2025

0.2.4

Jul 2, 2025

0.2.3

Jul 2, 2025

0.2.2

Jul 2, 2025

0.2.1

Jul 2, 2025

0.2.0

Jul 2, 2025

0.1.9

Jul 2, 2025

0.1.8

Jul 2, 2025

This version

0.1.7

Jul 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nmf_standalone-0.1.7.tar.gz (58.4 kB view details)

Uploaded Jul 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nmf_standalone-0.1.7-py3-none-any.whl (76.7 kB view details)

Uploaded Jul 2, 2025 Python 3

File details

Details for the file nmf_standalone-0.1.7.tar.gz.

File metadata

Download URL: nmf_standalone-0.1.7.tar.gz
Upload date: Jul 2, 2025
Size: 58.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.18

File hashes

Hashes for nmf_standalone-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`fcf1835390e6b7766227fd1f1be34afe77d3747c0429a002ed21793395ad7ac8`
MD5	`e08328becba50c3c748b32080df67507`
BLAKE2b-256	`851dce796968908ec600bececf1637ee1f26832b5dc4d82e22d3e80bc01a64e2`

See more details on using hashes here.

File details

Details for the file nmf_standalone-0.1.7-py3-none-any.whl.

File metadata

Download URL: nmf_standalone-0.1.7-py3-none-any.whl
Upload date: Jul 2, 2025
Size: 76.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.18

File hashes

Hashes for nmf_standalone-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3cae29854446dde305e3282a4b46d3824811702c45b9efb6e212feee0730e5ed`
MD5	`124f238271db22f0702ffff6c09c396f`
BLAKE2b-256	`3fde596b468eadc668c0e4698017b07a4558278f6b5b021feb7ecbb269c2ee73`

See more details on using hashes here.

nmf-standalone 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NMF Standalone

Features

Installation

Quick Start

Command Line Interface

Python API

Supported File Formats

Language Support

Turkish

English

Algorithm Options

NMF Methods

Tokenization (Turkish)

Output Formats

Generated Files

Directory Structure

Configuration Options

Requirements

Examples

Turkish App Review Analysis

English Document Classification

Medical Text Analysis

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes