FastText-based response similarity analyzer for human raters

These details have not been verified by PyPI

Project links

Project description

Simars - FastText-based similarity analysis of answers & responses for human raters

Simars is a comprehensive toolkit for analyzing response similarity using FastText embeddings, specifically designed to support human raters in educational assessment and text analysis tasks.

🚀 Features

Text Preprocessing: Comprehensive Korean and English text cleaning and tokenization
FastText Integration: Training and fine-tuning of FastText models with Korean support
Dimensionality Reduction: Support for UMAP, PCA, and t-SNE algorithms
Clustering: HDBSCAN clustering for response grouping
Interactive Visualization: Plotly-based interactive scatter plots with multiple visualization modes
Jamo Processing: Advanced Korean text processing with Jamo decomposition

📦 Installation

pip install simars

# Install spaCy English model (required for text processing)
python -m spacy download en_core_web_sm

Development Installation

git clone https://github.com/h000000nkim/simars.git
cd simars
pip install -e ".[dev]"

🔧 Dependencies

Core: gensim, numpy, pandas, scikit-learn, umap-learn
NLP: jamo, pecab, spacy
Visualization: plotly
Clustering: hdbscan
Development: pytest, ruff, mkdocs-material

📖 Quick Start

Basic Usage

import simars
import numpy as np

# Sample data
answers = np.array([["허무"], ["흡수율"], ["부사어"]])
responses = np.array([
    ["허무", "공허", "무상", "허무감", "초월"],
    ["흡수율", "흡수", "반사율", "알베도"],
    ["부사어", "부사", "수식어", "부가어"]
])
informations = np.array([
    "문학 문제에 대한 정서적 태도",
    "과학 제재의 핵심 개념",
    "문법 성분 분석"
])

# Initialize Simars
analyzer = simars.Fastrs(
    answers=answers,
    responses=responses,
    informations=informations
)

# Preprocess text data
analyzer.preprocess()

# Train FastText model
model = analyzer.train(
    vector_size=100,
    window=5,
    min_count=1,
    epochs=10
)

# Reduce dimensionality
coordinates = analyzer.reduce(method="umap", n_neighbors=5)

# Perform clustering
analyzer.hdbscanize()

# Visualize results
figures = analyzer.visualize()
for fig in figures:
    fig.show()

Advanced Usage with Custom Data Structure

# Using dictionary format
data = {
    "item1": {
        "answer": ["정답1"],
        "response": ["정답1", "오답1", "오답2"],
        "information": "문항 설명"
    },
    "item2": {
        "answer": ["정답2"],
        "response": ["정답2", "유사답", "오답"],
        "information": "다른 문항 설명"
    }
}

analyzer = simars.Fastrs(data=data)
analyzer.preprocess()

# Fine-tune existing model
pretrained_model = simars.util.get_pretrained_model()
analyzer.finetune(model=pretrained_model, epochs=5)

# Advanced reduction with custom parameters
coordinates = analyzer.reduce(
    method="umap",
    n_neighbors=15,
    min_dist=0.1,
    metric="cosine"
)

🛠️ Core Components

Fastrs Class

Main class for response similarity analysis:

preprocess(): Clean, tokenize, and prepare text data
train(): Train new FastText model from scratch
finetune(): Fine-tune existing FastText model
reduce(): Reduce embeddings dimensionality
hdbscanize(): Perform clustering analysis
visualize(): Create interactive visualizations

Item Class

Individual item processor for detailed analysis:

clean(): Text cleaning with customizable options
tokenize(): Korean/English tokenization
jamoize(): Korean Jamo decomposition
formatize(): Prepare data for FastText training

Preprocessing Module

Advanced text preprocessing functions:

clean(): Multi-option text cleaning
tokenize(): Morphological analysis with PeCab
jamoize(): Korean character decomposition
formatize(): Data formatting for training

Visualization Module

Interactive plotting with Plotly:

scatter(): Unified scatter plot function
Multiple plot types: simple, value count, labeled, combined
Customizable themes and color schemes

📊 Visualization Types

Simple Scatter Plot

Basic 2D visualization highlighting answers vs responses.

Value Count Scatter Plot

3D visualization showing response frequency in the z-axis.

Labeled Scatter Plot

Color-coded visualization based on clustering results.

Combined Scatter Plot

3D visualization combining clustering and frequency information.

⚙️ Configuration

simars uses JSON configuration files for customization:

color_schemes.json: Color themes for visualizations
plot_config.json: Plot layout and styling options
reduction_defaults.json: Default parameters for dimensionality reduction
fasttext_defaults.json: Default FastText training parameters

🧪 Testing

Run the test suite:

# All tests
pytest

# Unit tests only
pytest tests/unit/

# Integration tests only
pytest tests/integration/

# With coverage
pytest --cov=simars tests/

📚 Use Cases

Educational Assessment: Analyze student response patterns
Content Analysis: Group similar text responses
Quality Assurance: Identify outlier responses for review
Research: Study response similarity patterns in surveys

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👤 Author

Hoon Kim - h000000nkim@gmail.com

🔗 Links

📈 Roadmap

Support for additional languages
Web interface for easy usage
Additional clustering algorithms
Export functionality for results
Integration with popular ML frameworks

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Sep 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simars-0.1.0.tar.gz (12.8 MB view details)

Uploaded Sep 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

simars-0.1.0-py3-none-any.whl (18.7 kB view details)

Uploaded Sep 11, 2025 Python 3

File details

Details for the file simars-0.1.0.tar.gz.

File metadata

Download URL: simars-0.1.0.tar.gz
Upload date: Sep 11, 2025
Size: 12.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for simars-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2f3b0055cca7e0ff0c53f49bf7a0c58854dc4b3e019fce40f942b22b48e483ef`
MD5	`08b560e822ab9c282ce9e0a317a4621b`
BLAKE2b-256	`f55b143dd9535ebd0755551189b3097ddc3b28f02c3d60ea6fb96fc0e353b16e`

See more details on using hashes here.

File details

Details for the file simars-0.1.0-py3-none-any.whl.

File metadata

Download URL: simars-0.1.0-py3-none-any.whl
Upload date: Sep 11, 2025
Size: 18.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for simars-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e37ceabb49cdea2c3c8887b50682ff1293e44ff8e3834b9e09632d52d764ccaf`
MD5	`4e605ed914ed767f01a5cb0babd9732c`
BLAKE2b-256	`7d0a07ad92e14c99ae0e3b97a6abc77f8278955cd267daf2085147d70a8268d7`

See more details on using hashes here.

simars 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Simars - FastText-based similarity analysis of answers & responses for human raters

🚀 Features

📦 Installation

Development Installation

🔧 Dependencies

📖 Quick Start

Basic Usage

Advanced Usage with Custom Data Structure

🛠️ Core Components

Fastrs Class

Item Class

Preprocessing Module

Visualization Module

📊 Visualization Types

Simple Scatter Plot

Value Count Scatter Plot

Labeled Scatter Plot

Combined Scatter Plot

⚙️ Configuration

🧪 Testing

📚 Use Cases

🤝 Contributing

📄 License

👤 Author

🔗 Links

📈 Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes