Skip to main content

A standalone NMF topic modeling tool for Turkish and English texts

Project description

NMF Standalone

A comprehensive topic modeling library for Turkish and English texts using Non-negative Matrix Factorization (NMF).

Features

  • Multi-language Support: Native support for Turkish and English text processing
  • Advanced NMF Algorithms: Standard NMF and Orthogonal Projective NMF (OPNMF) variants
  • Modern Tokenization: BPE and WordPiece tokenizers for Turkish, traditional preprocessing for English
  • Comprehensive Preprocessing: Language-specific text cleaning, emoji processing, and normalization
  • Rich Visualizations: Word clouds, topic distribution plots, and co-occurrence heatmaps
  • Multiple Export Formats: Excel reports, JSON results, and database storage
  • Coherence Evaluation: Built-in topic coherence scoring for model evaluation
  • CLI and Python API: Both command-line interface and programmatic access

Installation

pip install nmf-standalone

For visualization features:

pip install nmf-standalone[visualization]

Quick Start

Command Line Interface

# Analyze Turkish app reviews with 5 topics
nmf-standalone analyze reviews.csv --column REVIEW --language TR --topics 5 --wordclouds

# Analyze English documents with lemmatization
nmf-standalone analyze docs.xlsx --column text --language EN --topics 10 --lemmatize --excel

# Use BPE tokenizer for Turkish text
nmf-standalone analyze data.csv --column content --language TR --tokenizer bpe --topics 7

Python API

from nmf_standalone import run_topic_analysis

# Simple analysis
result = run_topic_analysis(
    "data.csv",
    column="text_column", 
    language="TR",
    topics=5,
    generate_wordclouds=True
)

# Access results
topics = result['topic_word_scores']
for topic_id, words in topics.items():
    print(f"Topic {topic_id}: {', '.join([word for word, score in words[:5]])}")

# Advanced configuration
result = run_topic_analysis(
    "reviews.csv",
    column="review_text",
    language="TR", 
    topics=7,
    nmf_method="opnmf",  # Use projective NMF
    tokenizer_type="wordpiece",
    words_per_topic=20,
    export_excel=True,
    topic_distribution=True
)

Supported File Formats

  • CSV files: Automatic delimiter detection, UTF-8 encoding
  • Excel files: .xlsx and .xls formats
  • Text columns: Any column containing text data for analysis

Language Support

Turkish

  • Preprocessing: Advanced text cleaning, Turkish-specific normalization
  • Tokenization: BPE (Byte-Pair Encoding) and WordPiece tokenizers
  • Emoji Processing: Intelligent emoji-to-text conversion
  • TF-IDF: BM25 and traditional TF-IDF with Turkish language adaptations

English

  • Preprocessing: Standard NLP preprocessing with optional lemmatization
  • Tokenization: Traditional word-based tokenization
  • Lemmatization: NLTK-based lemmatization support
  • TF-IDF: Classical TF-IDF with multiple weighting schemes

Algorithm Options

NMF Methods

  • Standard NMF (nmf): Classical non-negative matrix factorization
  • Orthogonal Projective NMF (opnmf): Enhanced variant for better topic separation

Tokenization (Turkish)

  • BPE (bpe): Byte-Pair Encoding for subword tokenization
  • WordPiece (wordpiece): Google's WordPiece algorithm

Output Formats

Generated Files

  • Word Clouds: PNG images for each topic showing prominent words
  • Excel Reports: Detailed topic-word matrices with scores
  • Topic Distribution: Plots showing document-topic relationships
  • JSON Results: Machine-readable topic and document data
  • Database Storage: SQLite databases for persistent storage

Directory Structure

Output/
└── {dataset_name}/
    ├── {dataset}_topics.xlsx           # Excel report
    ├── {dataset}_coherence_scores.json # Model evaluation
    ├── {dataset}_document_dist.png     # Topic distribution
    ├── top_docs_{dataset}.json         # Top documents per topic
    └── wordclouds/                     # Word cloud images
        ├── Topic_00.png
        ├── Topic_01.png
        └── ...

Configuration Options

Parameter Type Default Description
topics int 5 Number of topics to extract
words_per_topic int 15 Top words to display per topic
language str "TR" Language code (TR/EN)
nmf_method str "nmf" Algorithm variant (nmf/opnmf)
tokenizer_type str "bpe" Tokenizer for Turkish (bpe/wordpiece)
lemmatize bool False Apply lemmatization (English only)
generate_wordclouds bool True Create word cloud visualizations
export_excel bool True Export Excel reports
topic_distribution bool True Generate distribution plots

Requirements

  • Python ≥ 3.8
  • NumPy ≥ 1.24.0
  • Pandas ≥ 2.0.0
  • scikit-learn ≥ 1.3.0
  • NLTK ≥ 3.8.0
  • gensim ≥ 4.3.0
  • tokenizers ≥ 0.19.0

Examples

Turkish App Review Analysis

result = run_topic_analysis(
    "app_reviews.csv",
    column="review_text",
    language="TR",
    topics=8,
    tokenizer_type="bpe",
    generate_wordclouds=True,
    export_excel=True
)

English Document Classification

result = run_topic_analysis(
    "documents.xlsx", 
    column="content",
    language="EN",
    topics=12,
    lemmatize=True,
    nmf_method="opnmf",
    words_per_topic=25
)

Medical Text Analysis

result = run_topic_analysis(
    "medical_notes.csv",
    column="impression", 
    language="EN",
    topics=15,
    lemmatize=True,
    generate_wordclouds=True,
    topic_distribution=True
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this library in your research, please cite:

@software{nmf_standalone,
  author = {Emir Karayagiz},
  title = {NMF Standalone: Topic Modeling for Turkish and English Texts},
  url = {https://github.com/emirkarayagiz/nmf-standalone},
  version = {0.1.0},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nmf_standalone-0.1.8.tar.gz (58.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nmf_standalone-0.1.8-py3-none-any.whl (76.8 kB view details)

Uploaded Python 3

File details

Details for the file nmf_standalone-0.1.8.tar.gz.

File metadata

  • Download URL: nmf_standalone-0.1.8.tar.gz
  • Upload date:
  • Size: 58.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.18

File hashes

Hashes for nmf_standalone-0.1.8.tar.gz
Algorithm Hash digest
SHA256 01e1e621415ac58ee69f5275fbdfcfdd077b2da44af90449595481b4261c4b82
MD5 19f71cfcd0fd6987e731963ab90ff421
BLAKE2b-256 853d5f8e416ba234bb7f002219b6a33fa3f7832e757717268217d9a8b8b362e4

See more details on using hashes here.

File details

Details for the file nmf_standalone-0.1.8-py3-none-any.whl.

File metadata

File hashes

Hashes for nmf_standalone-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 925d26bff1394290c4f7e9f3d267fc3d035e7472996dfc1b6159e322f310f29e
MD5 cf81393cd9aa369d6d0b8d679e2d6b35
BLAKE2b-256 6aabc743174ac60a830c50ea065afff85004b29e9c28dc2c3b45f8b40e6187c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page