Skip to main content

A standalone NMF topic modeling tool for Turkish and English texts

Project description

NMF Standalone

A comprehensive topic modeling library for Turkish and English texts using Non-negative Matrix Factorization (NMF).

Features

  • Multi-language Support: Native support for Turkish and English text processing
  • Advanced NMF Algorithms: Standard NMF and Orthogonal Projective NMF (OPNMF) variants
  • Modern Tokenization: BPE and WordPiece tokenizers for Turkish, traditional preprocessing for English
  • Comprehensive Preprocessing: Language-specific text cleaning, emoji processing, and normalization
  • Rich Visualizations: Word clouds, topic distribution plots, and co-occurrence heatmaps
  • Multiple Export Formats: Excel reports, JSON results, and database storage
  • Coherence Evaluation: Built-in topic coherence scoring for model evaluation
  • CLI and Python API: Both command-line interface and programmatic access

Installation

pip install nmf-standalone

For visualization features:

pip install nmf-standalone[visualization]

Quick Start

Command Line Interface

# Analyze Turkish app reviews with 5 topics
nmf-standalone analyze reviews.csv --column REVIEW --language TR --topics 5 --wordclouds

# Analyze English documents with lemmatization
nmf-standalone analyze docs.xlsx --column text --language EN --topics 10 --lemmatize --excel

# Use BPE tokenizer for Turkish text
nmf-standalone analyze data.csv --column content --language TR --tokenizer bpe --topics 7

Python API

from nmf_standalone import run_topic_analysis

# Simple analysis
result = run_topic_analysis(
    "data.csv",
    column="text_column", 
    language="TR",
    topics=5,
    generate_wordclouds=True
)

# Access results
topics = result['topic_word_scores']
for topic_id, words in topics.items():
    print(f"Topic {topic_id}: {', '.join([word for word, score in words[:5]])}")

# Advanced configuration
result = run_topic_analysis(
    "reviews.csv",
    column="review_text",
    language="TR", 
    topics=7,
    nmf_method="opnmf",  # Use projective NMF
    tokenizer_type="wordpiece",
    words_per_topic=20,
    export_excel=True,
    topic_distribution=True
)

Supported File Formats

  • CSV files: Automatic delimiter detection, UTF-8 encoding
  • Excel files: .xlsx and .xls formats
  • Text columns: Any column containing text data for analysis

Language Support

Turkish

  • Preprocessing: Advanced text cleaning, Turkish-specific normalization
  • Tokenization: BPE (Byte-Pair Encoding) and WordPiece tokenizers
  • Emoji Processing: Intelligent emoji-to-text conversion
  • TF-IDF: BM25 and traditional TF-IDF with Turkish language adaptations

English

  • Preprocessing: Standard NLP preprocessing with optional lemmatization
  • Tokenization: Traditional word-based tokenization
  • Lemmatization: NLTK-based lemmatization support
  • TF-IDF: Classical TF-IDF with multiple weighting schemes

Algorithm Options

NMF Methods

  • Standard NMF (nmf): Classical non-negative matrix factorization
  • Orthogonal Projective NMF (opnmf): Enhanced variant for better topic separation

Tokenization (Turkish)

  • BPE (bpe): Byte-Pair Encoding for subword tokenization
  • WordPiece (wordpiece): Google's WordPiece algorithm

Output Formats

Generated Files

  • Word Clouds: PNG images for each topic showing prominent words
  • Excel Reports: Detailed topic-word matrices with scores
  • Topic Distribution: Plots showing document-topic relationships
  • JSON Results: Machine-readable topic and document data
  • Database Storage: SQLite databases for persistent storage

Directory Structure

Output/
└── {dataset_name}/
    ├── {dataset}_topics.xlsx           # Excel report
    ├── {dataset}_coherence_scores.json # Model evaluation
    ├── {dataset}_document_dist.png     # Topic distribution
    ├── top_docs_{dataset}.json         # Top documents per topic
    └── wordclouds/                     # Word cloud images
        ├── Topic_00.png
        ├── Topic_01.png
        └── ...

Configuration Options

Parameter Type Default Description
topics int 5 Number of topics to extract
words_per_topic int 15 Top words to display per topic
language str "TR" Language code (TR/EN)
nmf_method str "nmf" Algorithm variant (nmf/opnmf)
tokenizer_type str "bpe" Tokenizer for Turkish (bpe/wordpiece)
lemmatize bool False Apply lemmatization (English only)
generate_wordclouds bool True Create word cloud visualizations
export_excel bool True Export Excel reports
topic_distribution bool True Generate distribution plots

Requirements

  • Python ≥ 3.8
  • NumPy ≥ 1.24.0
  • Pandas ≥ 2.0.0
  • scikit-learn ≥ 1.3.0
  • NLTK ≥ 3.8.0
  • gensim ≥ 4.3.0
  • tokenizers ≥ 0.19.0

Examples

Turkish App Review Analysis

result = run_topic_analysis(
    "app_reviews.csv",
    column="review_text",
    language="TR",
    topics=8,
    tokenizer_type="bpe",
    generate_wordclouds=True,
    export_excel=True
)

English Document Classification

result = run_topic_analysis(
    "documents.xlsx", 
    column="content",
    language="EN",
    topics=12,
    lemmatize=True,
    nmf_method="opnmf",
    words_per_topic=25
)

Medical Text Analysis

result = run_topic_analysis(
    "medical_notes.csv",
    column="impression", 
    language="EN",
    topics=15,
    lemmatize=True,
    generate_wordclouds=True,
    topic_distribution=True
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this library in your research, please cite:

@software{nmf_standalone,
  author = {Emir Karayagiz},
  title = {NMF Standalone: Topic Modeling for Turkish and English Texts},
  url = {https://github.com/emirkarayagiz/nmf-standalone},
  version = {0.1.0},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nmf_standalone-0.1.7.tar.gz (58.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nmf_standalone-0.1.7-py3-none-any.whl (76.7 kB view details)

Uploaded Python 3

File details

Details for the file nmf_standalone-0.1.7.tar.gz.

File metadata

  • Download URL: nmf_standalone-0.1.7.tar.gz
  • Upload date:
  • Size: 58.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.18

File hashes

Hashes for nmf_standalone-0.1.7.tar.gz
Algorithm Hash digest
SHA256 fcf1835390e6b7766227fd1f1be34afe77d3747c0429a002ed21793395ad7ac8
MD5 e08328becba50c3c748b32080df67507
BLAKE2b-256 851dce796968908ec600bececf1637ee1f26832b5dc4d82e22d3e80bc01a64e2

See more details on using hashes here.

File details

Details for the file nmf_standalone-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for nmf_standalone-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 3cae29854446dde305e3282a4b46d3824811702c45b9efb6e212feee0730e5ed
MD5 124f238271db22f0702ffff6c09c396f
BLAKE2b-256 3fde596b468eadc668c0e4698017b07a4558278f6b5b021feb7ecbb269c2ee73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page