Skip to main content

A toolkit for evaluating the culture of MLX large language models (LLMs) on the CD Eval benchmark.

Project description

CultureKit

Python 3.11+ uv License: MIT Status: Alpha PyPI Version

Note: This repository is currently in alpha testing. Features and APIs may change without notice.

A toolkit for evaluating the culture of Large Language Models (LLMs) on the CD Eval benchmark. Supports MLX, Azure OpenAI, and Azure Foundry models.

Overview

CultureKit provides tools and utilities for evaluating how cultural biases and perspectives are reflected in large language models (LLMs). The toolkit focuses on measuring and analyzing model responses against the CD Eval benchmark, which tests models on cultural dimensions.

Features

  • Multiple Model Support: Works with MLX models, Azure OpenAI, and Azure Foundry models
  • Comprehensive Evaluation: Tools for scoring models against the CD Eval benchmark
  • Result Visualization: Notebook for analyzing and visualizing evaluation results
  • CLI: Command line interface for easy model evaluation

Installation

From PyPI

pip install culturekit

Note: MLX dependencies are primarily designed for macOS/Apple Silicon. On other platforms, MLX functionality will be disabled, but Azure-based models will still work.

Using uv

# Clone the repository
git clone https://github.com/decisions-lab/culturekit.git
cd culturekit

# Install with uv
uv sync

# Or install with dev dependencies
uv sync --extra dev

Using pip from source

# Clone the repository
git clone https://github.com/decisions-lab/culturekit.git
cd culturekit

# Install with pip
pip install -e .

Quick Start

CultureKit comes with a CLI for easy model evaluation:

Evaluating Models

# Run evaluation on an MLX model (macOS only)
uv run python -m culturekit eval --model "mlx-community/Qwen1.5-0.5B-MLX" --model_type mlx

# Run evaluation on an Azure OpenAI model
uv run python -m culturekit eval --model "gpt-4o-mini" --model_type azure_openai --azure_deployment "deployment-name"

# Run evaluation on an Azure Foundry model
uv run python -m culturekit eval --model "foundry-model" --model_type azure_foundry

Scoring Results

# Generate scoring
uv run python -m culturekit score --responses_path "results.jsonl" --output_path "scores.json"

Note: If you've activated the virtual environment (source .venv/bin/activate), you can use python directly instead of uv run python.

Environment Setup

For Azure OpenAI and Azure Foundry models, you need to set up environment variables. Create a .env file in the src/culturekit directory:

# Azure OpenAI Configuration
OPENAI_API_VERSION=2023-03-15-preview
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT=deployment_name

# Azure Foundry Configuration
AZURE_FOUNDRY_ENDPOINT=https://your-foundry-endpoint.models.ai.azure.com
AZURE_API_KEY=your_api_key

See the Environment Setup guide for more details.

Documentation

For more detailed information, see the documentation:

Dataset

The toolkit uses the CD Eval benchmark for evaluating cultural dimensions in LLMs. The dataset includes diverse scenarios representing different cultural perspectives and contexts.

Development

Prerequisites

  • Python 3.11+
  • uv (install with curl -LsSf https://astral.sh/uv/install.sh | sh)

Setup Development Environment

# Clone the repository
git clone https://github.com/decisions-lab/culturekit.git
cd culturekit

# Install dependencies (including dev dependencies)
uv sync --extra dev

Using uv

Here are the essential uv commands for working with this repository:

Installing Dependencies

# Install all dependencies (including dev dependencies)
uv sync --extra dev

# Install only production dependencies
uv sync

# Update all dependencies to latest compatible versions
uv sync --upgrade --extra dev

Running Commands

# Run Python scripts in the virtual environment
uv run python -m culturekit eval --model "model-name" --model_type mlx

# Run CLI commands
uv run culturekit --help

# Run tests
uv run pytest

# Run linting/formatting
uv run black .
uv run isort .
uv run flake8 .
uv run mypy .

Managing Dependencies

# Add a new dependency
uv add package-name

# Add a dev dependency
uv add --dev package-name

# Add a dependency with version constraint
uv add "package-name>=1.0.0"

# Remove a dependency
uv remove package-name

# Update a specific package
uv sync --upgrade-package package-name

Building and Publishing

# Build the package
uv build

# Publish to PyPI (requires authentication)
uv publish

Other Useful Commands

# Show installed packages
uv pip list

# Show dependency tree
uv tree

# Activate the virtual environment (if needed)
source .venv/bin/activate  # On macOS/Linux

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

  • Thanks to Apple's MLX team for their excellent machine learning framework
  • CD Eval benchmark creators for providing a standard for cultural dimensions evaluation

Citation

@software{culturekit2025,
  author = {Devansh Gandhi},
  title = {CultureKit: A toolkit for evaluating the culture of MLX large language models},
  year = {2025},
  url = {https://github.com/decisions-lab/culturekit},
  version = {0.0.1}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

culturekit-0.0.2.tar.gz (558.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

culturekit-0.0.2-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file culturekit-0.0.2.tar.gz.

File metadata

  • Download URL: culturekit-0.0.2.tar.gz
  • Upload date:
  • Size: 558.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for culturekit-0.0.2.tar.gz
Algorithm Hash digest
SHA256 512793bcab79fa59aa3fae78a65f2c33bf42b2522af05578144b64457ad5f27a
MD5 f78275df2c029fe3bc7a2ae16126c50d
BLAKE2b-256 0ad857504ff3410224ac3b8c6d627b1f6f49e6c52ab6e7c867f2f02715ec16ac

See more details on using hashes here.

File details

Details for the file culturekit-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: culturekit-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for culturekit-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 72c84378c37c1a92757e1a35fdb30730da91a6982e9bc5ad066e2f2477f510a9
MD5 bc5fa90f4895fa1cc561c1cfc0d2cac5
BLAKE2b-256 6c9f9876246668baf3f3678999f42d3f83266375f592c1150d2ee7436021fb27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page