Skip to main content

FastText-based response similarity analyzer for human raters

Project description

Simars - FastText-based similarity analysis of answers & responses for human raters

Simars is a comprehensive toolkit for analyzing response similarity using FastText embeddings, specifically designed to support human raters in educational assessment and text analysis tasks.

🚀 Features

  • Text Preprocessing: Comprehensive Korean and English text cleaning and tokenization
  • FastText Integration: Training and fine-tuning of FastText models with Korean support
  • Dimensionality Reduction: Support for UMAP, PCA, and t-SNE algorithms
  • Clustering: HDBSCAN clustering for response grouping
  • Interactive Visualization: Plotly-based interactive scatter plots with multiple visualization modes
  • Jamo Processing: Advanced Korean text processing with Jamo decomposition

📦 Installation

pip install simars

# Install spaCy English model (required for text processing)
python -m spacy download en_core_web_sm

Development Installation

git clone https://github.com/h000000nkim/simars.git
cd simars
pip install -e ".[dev]"

🔧 Dependencies

  • Core: gensim, numpy, pandas, scikit-learn, umap-learn
  • NLP: jamo, pecab, spacy
  • Visualization: plotly
  • Clustering: hdbscan
  • Development: pytest, ruff, mkdocs-material

📖 Quick Start

Basic Usage

import simars
import numpy as np

# Sample data
answers = np.array([["허무"], ["흡수율"], ["부사어"]])
responses = np.array([
    ["허무", "공허", "무상", "허무감", "초월"],
    ["흡수율", "흡수", "반사율", "알베도"],
    ["부사어", "부사", "수식어", "부가어"]
])
informations = np.array([
    "문학 문제에 대한 정서적 태도",
    "과학 제재의 핵심 개념",
    "문법 성분 분석"
])

# Initialize Simars
analyzer = simars.Fastrs(
    answers=answers,
    responses=responses,
    informations=informations
)

# Preprocess text data
analyzer.preprocess()

# Train FastText model
model = analyzer.train(
    vector_size=100,
    window=5,
    min_count=1,
    epochs=10
)

# Reduce dimensionality
coordinates = analyzer.reduce(method="umap", n_neighbors=5)

# Perform clustering
analyzer.hdbscanize()

# Visualize results
figures = analyzer.visualize()
for fig in figures:
    fig.show()

Advanced Usage with Custom Data Structure

# Using dictionary format
data = {
    "item1": {
        "answer": ["정답1"],
        "response": ["정답1", "오답1", "오답2"],
        "information": "문항 설명"
    },
    "item2": {
        "answer": ["정답2"],
        "response": ["정답2", "유사답", "오답"],
        "information": "다른 문항 설명"
    }
}

analyzer = simars.Fastrs(data=data)
analyzer.preprocess()

# Fine-tune existing model
pretrained_model = simars.util.get_pretrained_model()
analyzer.finetune(model=pretrained_model, epochs=5)

# Advanced reduction with custom parameters
coordinates = analyzer.reduce(
    method="umap",
    n_neighbors=15,
    min_dist=0.1,
    metric="cosine"
)

🛠️ Core Components

Fastrs Class

Main class for response similarity analysis:

  • preprocess(): Clean, tokenize, and prepare text data
  • train(): Train new FastText model from scratch
  • finetune(): Fine-tune existing FastText model
  • reduce(): Reduce embeddings dimensionality
  • hdbscanize(): Perform clustering analysis
  • visualize(): Create interactive visualizations

Item Class

Individual item processor for detailed analysis:

  • clean(): Text cleaning with customizable options
  • tokenize(): Korean/English tokenization
  • jamoize(): Korean Jamo decomposition
  • formatize(): Prepare data for FastText training

Preprocessing Module

Advanced text preprocessing functions:

  • clean(): Multi-option text cleaning
  • tokenize(): Morphological analysis with PeCab
  • jamoize(): Korean character decomposition
  • formatize(): Data formatting for training

Visualization Module

Interactive plotting with Plotly:

  • scatter(): Unified scatter plot function
  • Multiple plot types: simple, value count, labeled, combined
  • Customizable themes and color schemes

📊 Visualization Types

Simple Scatter Plot

Basic 2D visualization highlighting answers vs responses.

Value Count Scatter Plot

3D visualization showing response frequency in the z-axis.

Labeled Scatter Plot

Color-coded visualization based on clustering results.

Combined Scatter Plot

3D visualization combining clustering and frequency information.

⚙️ Configuration

simars uses JSON configuration files for customization:

  • color_schemes.json: Color themes for visualizations
  • plot_config.json: Plot layout and styling options
  • reduction_defaults.json: Default parameters for dimensionality reduction
  • fasttext_defaults.json: Default FastText training parameters

🧪 Testing

Run the test suite:

# All tests
pytest

# Unit tests only
pytest tests/unit/

# Integration tests only
pytest tests/integration/

# With coverage
pytest --cov=simars tests/

📚 Use Cases

  • Educational Assessment: Analyze student response patterns
  • Content Analysis: Group similar text responses
  • Quality Assurance: Identify outlier responses for review
  • Research: Study response similarity patterns in surveys

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👤 Author

Hoon Kim - h000000nkim@gmail.com

🔗 Links

📈 Roadmap

  • Support for additional languages
  • Web interface for easy usage
  • Additional clustering algorithms
  • Export functionality for results
  • Integration with popular ML frameworks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simars-0.1.0.tar.gz (12.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simars-0.1.0-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file simars-0.1.0.tar.gz.

File metadata

  • Download URL: simars-0.1.0.tar.gz
  • Upload date:
  • Size: 12.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for simars-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2f3b0055cca7e0ff0c53f49bf7a0c58854dc4b3e019fce40f942b22b48e483ef
MD5 08b560e822ab9c282ce9e0a317a4621b
BLAKE2b-256 f55b143dd9535ebd0755551189b3097ddc3b28f02c3d60ea6fb96fc0e353b16e

See more details on using hashes here.

File details

Details for the file simars-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: simars-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for simars-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e37ceabb49cdea2c3c8887b50682ff1293e44ff8e3834b9e09632d52d764ccaf
MD5 4e605ed914ed767f01a5cb0babd9732c
BLAKE2b-256 7d0a07ad92e14c99ae0e3b97a6abc77f8278955cd267daf2085147d70a8268d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page