DataScience MCP with Apache Arrow-based cache for efficient data handling
Project description
Arrow Cache MCP
A powerful data science toolkit and MCP server leveraging Apache Arrow for efficient memory management and data processing.
Features
- High Performance Data Handling: Efficient memory management using Apache Arrow
- Intelligent Memory Management: Automatic partitioning and compression of large datasets
- Multiple File Format Support: Load datasets from CSV, Parquet, Arrow, Feather, JSON, Excel, and more
- SQL Query Capabilities: Advanced SQL analytics with DuckDB integration
- Visualization Support: Create plots and charts from your datasets
- AI-Powered Data Analysis: Optional Claude integration for natural language queries
- Memory Overflow Protection: Automatic spilling to disk when memory limits are reached
Components
Resources
The server exposes cached datasets as resources:
- Custom
arrowcache://URI scheme for accessing datasets - Each dataset resource has detailed metadata about size, shape, and structure
Tools
The server implements the following tools:
-
run_sql_query: Execute SQL queries against cached datasets
- Use
_cache_<dataset_name>syntax in FROM clause
- Use
-
load_dataset: Load a dataset from a file or URL into the cache
- Support for many common file formats
-
get_dataset_sample: Get a sample of rows from a dataset
- Useful for previewing large datasets
-
get_dataset_info: Get detailed information about a dataset
- Schema, row counts, column statistics, etc.
-
remove_dataset: Remove a dataset from the cache
-
create_plot: Create visualizations from datasets
- Support for various plot types (line, bar, scatter, etc.)
-
get_memory_usage: Get detailed memory usage statistics
Installation
pip install arrow-cache-mcp
With optional dependencies:
# For geospatial data support
pip install "arrow-cache-mcp[geospatial]"
# For enhanced visualization
pip install "arrow-cache-mcp[viz]"
# For all features
pip install "arrow-cache-mcp[geospatial,viz]"
Configuration
Environment Variables
ARROW_CACHE_MEMORY_LIMIT: Maximum memory usage in bytes (default: 4GB)ARROW_CACHE_SPILL_DIRECTORY: Directory for spilling data to disk when memory limit is reachedARROW_CACHE_SPILL_TO_DISK: Whether to allow spilling to disk (true/false)ANTHROPIC_API_KEY: API key for Claude integration (optional)
Quickstart
Configure in Claude Desktop
On MacOS: ~/Library/Application\ Support/Claude/claude_desktop_config.json
On Windows: %APPDATA%/Claude/claude_desktop_config.json
Development/Unpublished Servers Configuration
"mcpServers": {
"arrow-cache-mcp": {
"command": "uv",
"args": [
"--directory",
"/PATH/TO/arrow-cache-mcp",
"run",
"arrow-cache-mcp"
]
}
}
Published Servers Configuration
"mcpServers": {
"arrow-cache-mcp": {
"command": "uvx",
"args": [
"arrow-cache-mcp"
]
}
}
Example Usage
- Load a dataset:
I'd like to load the NYC Yellow Taxi dataset from January 2023
- Query the data:
How many taxi trips were there per day of the week?
- Create a visualization:
Create a bar chart showing average fare amount by day of week
Development
Building and Publishing
To prepare the package for distribution:
- Sync dependencies and update lockfile:
uv sync
- Build package distributions:
uv build
This will create source and wheel distributions in the dist/ directory.
- Publish to PyPI:
uv publish
Note: You'll need to set PyPI credentials via environment variables or command flags.
Debugging
Since MCP servers run over stdio, debugging can be challenging. For the best debugging experience, we recommend using the MCP Inspector.
You can launch the MCP Inspector via npm with this command:
npx @modelcontextprotocol/inspector uv --directory /PATH/TO/arrow-cache-mcp run arrow-cache-mcp
Upon launching, the Inspector will display a URL that you can access in your browser to begin debugging.
Architecture
Arrow Cache MCP builds on these key components:
- Arrow Cache: A memory-managed caching system for Arrow tables
- DuckDB: A high-performance analytical database
- PyArrow: Python bindings for Apache Arrow
- MCP Protocol: Model Context Protocol for AI agent integration
The system features smart memory management with automatic partitioning, spilling, and compression to efficiently handle datasets of any size while preventing out-of-memory errors.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arrow_cache_mcp-0.1.tar.gz.
File metadata
- Download URL: arrow_cache_mcp-0.1.tar.gz
- Upload date:
- Size: 40.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5b979fe5ede125f7f8fd793058bbbeccbbe0d94cf17674913bc88b12d864291
|
|
| MD5 |
7c4132d69f84fc57c41e562a31006b00
|
|
| BLAKE2b-256 |
e8cba4bfde1bec4541c6a71bec4cf33ff61c05b15de91b0a78d4173b22788998
|
File details
Details for the file arrow_cache_mcp-0.1-py3-none-any.whl.
File metadata
- Download URL: arrow_cache_mcp-0.1-py3-none-any.whl
- Upload date:
- Size: 32.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
794ba8aaaecc8ec25d136e1d6bbd634ca0cd716e28f5fde32cd7470dda105d3c
|
|
| MD5 |
8ebb02bc26bc937ba493f21ab7f40938
|
|
| BLAKE2b-256 |
d0c84e8571e98e1ad24157cb697c9633616ce833c5e8c45250f7eedbfb31c11d
|