Skip to main content

High-performance ODBC to Arrow data connector for InterBase/Firebird

Project description

ibarrow

High-performance ODBC to Arrow data conversion for Python, built with Rust.

Features

  • 🚀 High Performance: Built with Rust for maximum speed
  • 🔄 ODBC Integration: Direct connection to any ODBC-compatible database
  • 📊 Arrow Format: Native Apache Arrow support for efficient data processing
  • 🐼 Pandas/Polars Ready: Seamless integration with popular Python data libraries
  • 🛡️ Type Safe: Rust-powered reliability with Python convenience
  • 🎯 Two-Level API: Simple wrappers for common use + raw functions for advanced control

Installation

pip install ibarrow

Repository

Prerequisites

Important: You need an ODBC driver installed on your system for ibarrow to work.

Windows

Linux

  • SQL Server: Microsoft ODBC Driver for SQL Server on Linux
  • PostgreSQL: sudo apt-get install odbc-postgresql (Ubuntu/Debian) or sudo yum install postgresql-odbc (RHEL/CentOS)
  • MySQL: sudo apt-get install libmyodbc (Ubuntu/Debian) or sudo yum install mysql-connector-odbc (RHEL/CentOS)

macOS

  • Note: macOS support is currently not available. Please use Windows or Linux for now.

Verify ODBC Installation

You can verify your ODBC drivers are installed by checking the system:

Windows:

# Check installed drivers
odbcad32.exe

Linux/macOS:

# List available drivers
odbcinst -q -d

API Architecture

ibarrow provides a two-level API designed for different user needs:

🎯 High-Level API (Recommended for 95% of users)

  • query_polars(): Direct Polars DataFrame (zero-copy, fastest)
  • query_pandas(): Direct Pandas DataFrame (maximum compatibility)

🔧 Low-Level API (For advanced users)

  • query_arrow_ipc(): Raw Arrow IPC bytes (maximum compatibility)
  • query_arrow_c_data(): Raw Arrow C Data Interface (maximum performance)

📋 When to Use Each Level

User Type Recommended Function Use Case
Beginners query_polars() 95% of cases - simple and fast
Pandas Users query_pandas() When you need Pandas compatibility
Advanced Users query_arrow_ipc() When you need raw Arrow data
Performance Critical query_arrow_c_data() When you need maximum control

Quick Start

🚀 Recommended Usage (95% of cases)

import ibarrow

# Direct Polars DataFrame (zero-copy, fastest)
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Query and get Polars DataFrame
df = conn.query_polars("SELECT * FROM your_table")

print(df)

With Custom Batch Size

import ibarrow

# Create config with custom batch size
config = ibarrow.QueryConfig(batch_size=2000)

# Create connection with configuration
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password",
    config=config
)

# Query with custom batch size
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

Advanced Configuration

import ibarrow

# Create custom configuration
config = ibarrow.QueryConfig(
    batch_size=2000,           # Rows per batch
    read_only=True,            # Read-only connection
    connection_timeout=30,      # Connection timeout in seconds
    query_timeout=60,          # Query timeout in seconds
    max_text_size=32768,       # Max text field size
    max_binary_size=16384,     # Max binary field size
    isolation_level="READ_COMMITTED"  # Transaction isolation
)

# Create connection with configuration
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password",
    config=config
)

# Use the connection
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

Direct DataFrame Integration

import ibarrow

# Direct conversion to Polars DataFrame (uses pl.read_ipc internally)
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Get Polars DataFrame
df_polars = conn.query_polars("SELECT * FROM your_table")

# Get Pandas DataFrame
df_pandas = conn.query_pandas("SELECT * FROM your_table")

print(df_polars)
print(df_pandas)

⚡ Zero-Copy Performance (Arrow C Data Interface)

For maximum performance, use the Arrow C Data Interface functions that completely eliminate serialization:

import ibarrow
import polars as pl
import pyarrow as pa

# Zero-copy conversion to Polars DataFrame (fastest)
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Get Polars DataFrame directly
df_polars = conn.query_arrow_c_data("SELECT * FROM your_table", return_dataframe=True)

# Or get raw PyCapsules for manual control
schema_capsule, array_capsule = conn.query_arrow_c_data("SELECT * FROM your_table")

# Convert to PyArrow Table using zero-copy
schema = pa.Schema._import_from_c(schema_capsule)
array = pa.Array._import_from_c(array_capsule)
table = pa.Table.from_arrays([array], schema=schema)

# Convert to Polars
df = pl.from_arrow(table)

Arrow C Data Interface Benefits:

  • 🚀 Zero serialization: Data passes directly via pointers
  • 💾 Zero copies: Eliminates memory overhead
  • Maximum speed: Ideal for large datasets
  • 🔄 Compatibility: Works with PyArrow, Polars, Pandas

Manual Arrow IPC Usage

import ibarrow
import polars as pl

# Get raw Arrow IPC bytes
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Get Arrow IPC bytes
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

# Convert to Polars DataFrame manually
df = pl.read_ipc(arrow_bytes)
print(df)

API Reference

ibarrow.connect(dsn, user, password, config=None)

Creates a connection object for database operations.

Parameters:

  • dsn (str): ODBC Data Source Name
  • user (str): Database username
  • password (str): Database password
  • config (QueryConfig, optional): Configuration object

Returns: IbarrowConnection object

query_arrow_ipc(sql)

Execute a SQL query and return Arrow IPC bytes.

Parameters:

  • sql (str): SQL query to execute

Returns: bytes - Arrow IPC format data

Raises:

  • PyConnectionError: Database connection issues
  • PySQLError: SQL syntax or execution errors
  • PyArrowError: Arrow data processing errors

conn.query_polars(sql)

Execute a SQL query and return a Polars DataFrame directly.

Parameters: Same as query_arrow_ipc

Returns: polars.DataFrame - Ready-to-use DataFrame

Note: Uses pl.read_ipc() directly with bytes for optimal performance.

query_pandas(sql)

Execute a SQL query and return a Pandas DataFrame directly.

Parameters: Same as query_arrow_ipc

Returns: pandas.DataFrame - Ready-to-use DataFrame

Note: Converts Arrow IPC to Pandas via PyArrow for compatibility.

QueryConfig

Configuration class for advanced query settings.

Parameters:

  • batch_size (int, optional): Number of rows per batch for processing (default: 1000)
  • read_only (bool, optional): Read-only connection to avoid locks (default: True)
  • connection_timeout (int, optional): Connection timeout in seconds
  • query_timeout (int, optional): Query timeout in seconds
  • max_text_size (int, optional): Maximum text field size in bytes (default: 65536)
  • max_binary_size (int, optional): Maximum binary field size in bytes (default: 65536)
  • isolation_level (str, optional): Transaction isolation level. Supported values: "read_uncommitted", "read_committed", "repeatable_read", "serializable", "snapshot"

Configuration Benefits

  • batch_size: Controls memory usage and performance. Larger batches = more memory but faster processing
  • read_only: Prevents locks and improves performance for read-only operations (effective only if ODBC driver supports this flag)
  • connection_timeout: Protects against hanging connections
  • query_timeout: Prevents long-running queries from blocking
  • max_text_size: Handles large text fields (VARCHAR, TEXT) efficiently
  • max_binary_size: Handles large binary data (BLOB, VARBINARY) efficiently
  • isolation_level: Controls transaction isolation for concurrent access

Implementation Notes

  • read_only: Currently implemented via ODBC connection string (ReadOnly=1).
  • batch_size: Controls how many rows are fetched per batch from the database, avoiding row-by-row fetching for better performance.
  • query_timeout: Implemented via statement handle using stmt.set_query_timeout(), which is more reliable than connection string timeouts.
  • isolation_level: Standardized mapping from common names (e.g., "read_committed") to driver-specific ODBC connection string values (e.g., "Isolation Level=ReadCommitted").
  • query_polars: Optimized to use pl.read_ipc() directly with bytes, avoiding the overhead of io.BytesIO wrapper for better performance.
  • Native Types: Always preserves ODBC native types (INT, DECIMAL, FLOAT) as Arrow native types (Int64Array, Float64Array), avoiding expensive string conversions for maximum performance.
  • Pipelining: Always processes data in streaming fashion, writing each batch immediately as it's fetched. This keeps memory usage constant (e.g., 10MB) regardless of dataset size (even 80GB+).

Performance Comparison

Serialization vs Zero-Copy

Method Level Serialization Memory Copies Performance Ideal Use
query_polars High Zero Zero ⭐⭐⭐⭐⭐ 95% of cases (recommended)
query_pandas High Arrow IPC Stream 1x (serialization) ⭐⭐⭐ Pandas compatibility
query_arrow_ipc Low Arrow IPC Stream 1x (serialization) ⭐⭐⭐ Maximum compatibility
query_arrow_c_data Low Zero Zero ⭐⭐⭐⭐⭐ Maximum performance

Typical Benchmarks (1M rows)

query_polars:         ~30ms   (zero-copy + direct conversion) ⭐ RECOMMENDED
query_pandas:         ~600ms  (serialization + pyarrow + pandas)
query_arrow_ipc:      ~500ms  (serialization + deserialization)
query_arrow_c_data:   ~50ms   (zero-copy via pointers)

🚀 Built-in Performance Optimizations

Native Types (Always Enabled):

- INT columns → Int64Array (zero-copy)
- DECIMAL columns → DecimalArray (zero-copy)  
- FLOAT columns → Float64Array (zero-copy)
- Performance: ~30ms for 1M numeric rows

Pipelining (Always Enabled):

- Memory usage: Constant (~10MB) regardless of dataset size
- Processing: Streaming (fetch + write immediately)
- Latency: Lower (Python can start consuming data before completion)
- Example: 80GB dataset uses only ~10MB RAM

When to Use Each Method

🎯 High-Level API (Recommended)

  • query_polars(): 95% of cases - Simple, fast, zero-copy
  • query_pandas(): When you need Pandas compatibility

🔧 Low-Level API (Advanced)

  • query_arrow_ipc(): Maximum compatibility, save to disk
  • query_arrow_c_data(): Maximum performance, full control over data

Error Handling

The library provides specific exception types for different error scenarios:

import ibarrow

try:
    # Create connection
    conn = ibarrow.connect(dsn, user, password)
    
    # Query with batch size
    df = conn.query_polars(sql)
except ibarrow.PyConnectionError as e:
    print(f"Connection failed: {e}")
except ibarrow.PySQLError as e:
    print(f"SQL error: {e}")
except ibarrow.PyArrowError as e:
    print(f"Arrow processing error: {e}")

Requirements

  • Python 3.8+
  • ODBC driver for your database
  • Rust toolchain (for development)

Development

Setup

# Clone the repository
git clone https://github.com/thomazyujibaba/ibarrow.git
cd ibarrow

# Install maturin
pip install maturin[patchelf]

# Install in development mode
maturin develop

Running Tests

# Install test dependencies
pip install pytest pytest-cov

# Run tests
pytest tests/ -v

Building

# Build wheel
maturin build --release

# Build and install
maturin develop

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Troubleshooting

Common ODBC Issues

"Driver not found" errors:

  • Ensure the ODBC driver is properly installed
  • Check that the driver name in your DSN matches exactly
  • Verify the driver architecture (32-bit vs 64-bit) matches your Python installation

Connection timeout errors:

  • Check network connectivity to the database server
  • Verify firewall settings
  • Ensure the database server is running and accessible

Permission errors:

  • Verify database credentials
  • Check user permissions on the database
  • Ensure the ODBC driver has necessary privileges

Performance issues:

  • Adjust batch_size in QueryConfig for optimal memory usage
  • Use read_only=True for read-only operations
  • Consider connection pooling for high-frequency queries

Support

For issues and questions, please use the GitHub Issues page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ibarrow-0.1.0.tar.gz (45.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ibarrow-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl (972.0 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file ibarrow-0.1.0.tar.gz.

File metadata

  • Download URL: ibarrow-0.1.0.tar.gz
  • Upload date:
  • Size: 45.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for ibarrow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0a46ac174691585dec9e4683ec2ac358964b8f1534e0b2c4e0bd09714954a691
MD5 cd59e6f55d20b5ab46f9d103d72892ab
BLAKE2b-256 e118243c93d3d531764bfac583fd19ba7adc8372dd47b25e4b22cf07878b26d5

See more details on using hashes here.

File details

Details for the file ibarrow-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ibarrow-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 2a2771c7ae4ea7966710a506ea9ee5375aecac8aab430e3b5e0b8f21548644f4
MD5 e2315a78ae85c629f4513d3b1074459c
BLAKE2b-256 b9955d50f7b829fec53bbb34cf3cac8993221b30309d69c14ecf831bea520720

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page