A toolkit for evaluating the culture of MLX large language models (LLMs) on the CD Eval benchmark.
Project description
CultureKit
Note: This repository is currently in alpha testing. Features and APIs may change without notice.
A toolkit for evaluating the culture of Large Language Models (LLMs) on the CD Eval benchmark. Supports MLX, Azure OpenAI, and Azure Foundry models.
Overview
CultureKit provides tools and utilities for evaluating how cultural biases and perspectives are reflected in large language models (LLMs). The toolkit focuses on measuring and analyzing model responses against the CD Eval benchmark, which tests models on cultural dimensions.
Features
- Multiple Model Support: Works with MLX models, Azure OpenAI, and Azure Foundry models
- Comprehensive Evaluation: Tools for scoring models against the CD Eval benchmark
- Result Visualization: Notebook for analyzing and visualizing evaluation results
- CLI: Command line interface for easy model evaluation
Installation
From PyPI
pip install culturekit
Note: MLX dependencies are primarily designed for macOS/Apple Silicon. On other platforms, MLX functionality will be disabled, but Azure-based models will still work.
Using uv
# Clone the repository
git clone https://github.com/decisions-lab/culturekit.git
cd culturekit
# Install with uv
uv sync
# Or install with dev dependencies
uv sync --extra dev
Using pip from source
# Clone the repository
git clone https://github.com/decisions-lab/culturekit.git
cd culturekit
# Install with pip
pip install -e .
Quick Start
CultureKit comes with a CLI for easy model evaluation:
Evaluating Models
# Run evaluation on an MLX model (macOS only)
uv run python -m culturekit eval --model "mlx-community/Qwen1.5-0.5B-MLX" --model_type mlx
# Run evaluation on an Azure OpenAI model
uv run python -m culturekit eval --model "gpt-4o-mini" --model_type azure_openai --azure_deployment "deployment-name"
# Run evaluation on an Azure Foundry model
uv run python -m culturekit eval --model "foundry-model" --model_type azure_foundry
Scoring Results
# Generate scoring
uv run python -m culturekit score --responses_path "results.jsonl" --output_path "scores.json"
Note: If you've activated the virtual environment (
source .venv/bin/activate), you can usepythondirectly instead ofuv run python.
Environment Setup
For Azure OpenAI and Azure Foundry models, you need to set up environment variables. Create a .env file in the src/culturekit directory:
# Azure OpenAI Configuration
OPENAI_API_VERSION=2023-03-15-preview
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT=deployment_name
# Azure Foundry Configuration
AZURE_FOUNDRY_ENDPOINT=https://your-foundry-endpoint.models.ai.azure.com
AZURE_API_KEY=your_api_key
See the Environment Setup guide for more details.
Documentation
For more detailed information, see the documentation:
Dataset
The toolkit uses the CD Eval benchmark for evaluating cultural dimensions in LLMs. The dataset includes diverse scenarios representing different cultural perspectives and contexts.
Development
Prerequisites
- Python 3.11+
- uv (install with
curl -LsSf https://astral.sh/uv/install.sh | sh)
Setup Development Environment
# Clone the repository
git clone https://github.com/decisions-lab/culturekit.git
cd culturekit
# Install dependencies (including dev dependencies)
uv sync --extra dev
Using uv
Here are the essential uv commands for working with this repository:
Installing Dependencies
# Install all dependencies (including dev dependencies)
uv sync --extra dev
# Install only production dependencies
uv sync
# Update all dependencies to latest compatible versions
uv sync --upgrade --extra dev
Running Commands
# Run Python scripts in the virtual environment
uv run python -m culturekit eval --model "model-name" --model_type mlx
# Run CLI commands
uv run culturekit --help
# Run tests
uv run pytest
# Run linting/formatting
uv run black .
uv run isort .
uv run flake8 .
uv run mypy .
Managing Dependencies
# Add a new dependency
uv add package-name
# Add a dev dependency
uv add --dev package-name
# Add a dependency with version constraint
uv add "package-name>=1.0.0"
# Remove a dependency
uv remove package-name
# Update a specific package
uv sync --upgrade-package package-name
Building and Publishing
# Build the package
uv build
# Publish to PyPI (requires authentication)
uv publish
Other Useful Commands
# Show installed packages
uv pip list
# Show dependency tree
uv tree
# Activate the virtual environment (if needed)
source .venv/bin/activate # On macOS/Linux
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
- Thanks to Apple's MLX team for their excellent machine learning framework
- CD Eval benchmark creators for providing a standard for cultural dimensions evaluation
Citation
@software{culturekit2025,
author = {Devansh Gandhi},
title = {CultureKit: A toolkit for evaluating the culture of MLX large language models},
year = {2025},
url = {https://github.com/decisions-lab/culturekit},
version = {0.0.1}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file culturekit-0.0.2.tar.gz.
File metadata
- Download URL: culturekit-0.0.2.tar.gz
- Upload date:
- Size: 558.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
512793bcab79fa59aa3fae78a65f2c33bf42b2522af05578144b64457ad5f27a
|
|
| MD5 |
f78275df2c029fe3bc7a2ae16126c50d
|
|
| BLAKE2b-256 |
0ad857504ff3410224ac3b8c6d627b1f6f49e6c52ab6e7c867f2f02715ec16ac
|
File details
Details for the file culturekit-0.0.2-py3-none-any.whl.
File metadata
- Download URL: culturekit-0.0.2-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72c84378c37c1a92757e1a35fdb30730da91a6982e9bc5ad066e2f2477f510a9
|
|
| MD5 |
bc5fa90f4895fa1cc561c1cfc0d2cac5
|
|
| BLAKE2b-256 |
6c9f9876246668baf3f3678999f42d3f83266375f592c1150d2ee7436021fb27
|