Skip to main content

DataScience MCP with Apache Arrow-based cache for efficient data handling

Project description

Arrow Cache MCP

A powerful data science toolkit and MCP server leveraging Apache Arrow for efficient memory management and data processing.

Features

  • High Performance Data Handling: Efficient memory management using Apache Arrow
  • Intelligent Memory Management: Automatic partitioning and compression of large datasets
  • Multiple File Format Support: Load datasets from CSV, Parquet, Arrow, Feather, JSON, Excel, and more
  • SQL Query Capabilities: Advanced SQL analytics with DuckDB integration
  • Visualization Support: Create plots and charts from your datasets
  • AI-Powered Data Analysis: Optional Claude integration for natural language queries
  • Memory Overflow Protection: Automatic spilling to disk when memory limits are reached

Components

Resources

The server exposes cached datasets as resources:

  • Custom arrowcache:// URI scheme for accessing datasets
  • Each dataset resource has detailed metadata about size, shape, and structure

Tools

The server implements the following tools:

  • run_sql_query: Execute SQL queries against cached datasets

    • Use _cache_<dataset_name> syntax in FROM clause
  • load_dataset: Load a dataset from a file or URL into the cache

    • Support for many common file formats
  • get_dataset_sample: Get a sample of rows from a dataset

    • Useful for previewing large datasets
  • get_dataset_info: Get detailed information about a dataset

    • Schema, row counts, column statistics, etc.
  • remove_dataset: Remove a dataset from the cache

  • create_plot: Create visualizations from datasets

    • Support for various plot types (line, bar, scatter, etc.)
  • get_memory_usage: Get detailed memory usage statistics

Installation

pip install arrow-cache-mcp

With optional dependencies:

# For geospatial data support
pip install "arrow-cache-mcp[geospatial]"

# For enhanced visualization
pip install "arrow-cache-mcp[viz]"

# For all features
pip install "arrow-cache-mcp[geospatial,viz]"

Configuration

Environment Variables

  • ARROW_CACHE_MEMORY_LIMIT: Maximum memory usage in bytes (default: 4GB)
  • ARROW_CACHE_SPILL_DIRECTORY: Directory for spilling data to disk when memory limit is reached
  • ARROW_CACHE_SPILL_TO_DISK: Whether to allow spilling to disk (true/false)
  • ANTHROPIC_API_KEY: API key for Claude integration (optional)

Quickstart

Configure in Claude Desktop

On MacOS: ~/Library/Application\ Support/Claude/claude_desktop_config.json On Windows: %APPDATA%/Claude/claude_desktop_config.json

Development/Unpublished Servers Configuration
"mcpServers": {
  "arrow-cache-mcp": {
    "command": "uv",
    "args": [
      "--directory",
      "/PATH/TO/arrow-cache-mcp",
      "run",
      "arrow-cache-mcp"
    ]
  }
}
Published Servers Configuration
"mcpServers": {
  "arrow-cache-mcp": {
    "command": "uvx",
    "args": [
      "arrow-cache-mcp"
    ]
  }
}

Example Usage

  1. Load a dataset:
I'd like to load the NYC Yellow Taxi dataset from January 2023
  1. Query the data:
How many taxi trips were there per day of the week?
  1. Create a visualization:
Create a bar chart showing average fare amount by day of week

Development

Building and Publishing

To prepare the package for distribution:

  1. Sync dependencies and update lockfile:
uv sync
  1. Build package distributions:
uv build

This will create source and wheel distributions in the dist/ directory.

  1. Publish to PyPI:
uv publish

Note: You'll need to set PyPI credentials via environment variables or command flags.

Debugging

Since MCP servers run over stdio, debugging can be challenging. For the best debugging experience, we recommend using the MCP Inspector.

You can launch the MCP Inspector via npm with this command:

npx @modelcontextprotocol/inspector uv --directory /PATH/TO/arrow-cache-mcp run arrow-cache-mcp

Upon launching, the Inspector will display a URL that you can access in your browser to begin debugging.

Architecture

Arrow Cache MCP builds on these key components:

  1. Arrow Cache: A memory-managed caching system for Arrow tables
  2. DuckDB: A high-performance analytical database
  3. PyArrow: Python bindings for Apache Arrow
  4. MCP Protocol: Model Context Protocol for AI agent integration

The system features smart memory management with automatic partitioning, spilling, and compression to efficiently handle datasets of any size while preventing out-of-memory errors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arrow_cache_mcp-0.1.1.tar.gz (62.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arrow_cache_mcp-0.1.1-py3-none-any.whl (3.6 kB view details)

Uploaded Python 3

File details

Details for the file arrow_cache_mcp-0.1.1.tar.gz.

File metadata

  • Download URL: arrow_cache_mcp-0.1.1.tar.gz
  • Upload date:
  • Size: 62.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for arrow_cache_mcp-0.1.1.tar.gz
Algorithm Hash digest
SHA256 88c8d1f6f3e73183b199e19b5354920f6fa314a12d992c98986dd6d183775e4a
MD5 c19e87083cf57079a4b8253b18faecf0
BLAKE2b-256 ba7bc60b0c3957ebb51902ceb44de03cefc895335e5e5bfa396162880a7af4da

See more details on using hashes here.

File details

Details for the file arrow_cache_mcp-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for arrow_cache_mcp-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dc9c7e90a92960bb3c6878f705460941d4cc8c1fc2c02f61610e09bae6580603
MD5 266c34e82e975fd5db8a6561ebc47f52
BLAKE2b-256 9cc384c941cfee3f3d527fdc2d9539b00abe52e4eb911698eef5152e0bfb8653

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page