Skip to main content

High-performance ODBC to Arrow data connector for InterBase/Firebird

Project description

ibarrow

High-performance ODBC to Arrow data conversion for Python, built with Rust.

Features

  • 🚀 High Performance: Built with Rust for maximum speed
  • 🔄 ODBC Integration: Direct connection to any ODBC-compatible database
  • 📊 Arrow Format: Native Apache Arrow support for efficient data processing
  • 🐼 Pandas/Polars Ready: Seamless integration with popular Python data libraries
  • 🛡️ Type Safe: Rust-powered reliability with Python convenience
  • 🎯 Two-Level API: Simple wrappers for common use + raw functions for advanced control

Installation

pip install ibarrow

Repository

Prerequisites

Important: You need an ODBC driver installed on your system for ibarrow to work.

Windows

Linux

  • SQL Server: Microsoft ODBC Driver for SQL Server on Linux
  • PostgreSQL: sudo apt-get install odbc-postgresql (Ubuntu/Debian) or sudo yum install postgresql-odbc (RHEL/CentOS)
  • MySQL: sudo apt-get install libmyodbc (Ubuntu/Debian) or sudo yum install mysql-connector-odbc (RHEL/CentOS)

macOS

  • Note: macOS support is currently not available. Please use Windows or Linux for now.

Verify ODBC Installation

You can verify your ODBC drivers are installed by checking the system:

Windows:

# Check installed drivers
odbcad32.exe

Linux/macOS:

# List available drivers
odbcinst -q -d

API Architecture

ibarrow provides a two-level API designed for different user needs:

🎯 High-Level API (Recommended for 95% of users)

  • query_polars(): Direct Polars DataFrame (zero-copy, fastest)
  • query_pandas(): Direct Pandas DataFrame (maximum compatibility)

🔧 Low-Level API (For advanced users)

  • query_arrow_ipc(): Raw Arrow IPC bytes (maximum compatibility)
  • query_arrow_c_data(): Raw Arrow C Data Interface (maximum performance)

📋 When to Use Each Level

User Type Recommended Function Use Case
Beginners query_polars() 95% of cases - simple and fast
Pandas Users query_pandas() When you need Pandas compatibility
Advanced Users query_arrow_ipc() When you need raw Arrow data
Performance Critical query_arrow_c_data() When you need maximum control

Quick Start

🔗 Connection Methods

ibarrow supports two ways to connect to databases:

Method 1: DSN (Data Source Name)

# Requires pre-configured DSN in ODBC Data Sources
conn = ibarrow.connect(
    dsn="my_database_dsn",
    user="username", 
    password="password"
)

Method 2: Direct Connection String (Recommended)

# Direct connection like pyodbc - no DSN configuration needed
conn = ibarrow.connect(
    dsn="DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;",
    user="username",
    password="password"
)

🚀 Recommended Usage (95% of cases)

import ibarrow

# Option 1: Using DSN (Data Source Name)
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Option 2: Using direct connection string (like pyodbc)
conn = ibarrow.connect(
    dsn="DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;",
    user="username",
    password="password"
)

# Query and get Polars DataFrame
df = conn.query_polars("SELECT * FROM your_table")

print(df)

With Custom Batch Size

import ibarrow

# Create config with custom batch size
config = ibarrow.QueryConfig(batch_size=2000)

# Create connection with configuration
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password",
    config=config
)

# Query with custom batch size
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

Advanced Configuration

import ibarrow

# Create custom configuration
config = ibarrow.QueryConfig(
    batch_size=2000,           # Rows per batch
    read_only=True,            # Read-only connection
    connection_timeout=30,      # Connection timeout in seconds
    query_timeout=60,          # Query timeout in seconds
    max_text_size=32768,       # Max text field size
    max_binary_size=16384,     # Max binary field size
    isolation_level="READ_COMMITTED"  # Transaction isolation
)

# Create connection with configuration
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password",
    config=config
)

# Use the connection
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

Direct DataFrame Integration

import ibarrow

# Direct conversion to Polars DataFrame (uses pl.read_ipc internally)
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Get Polars DataFrame
df_polars = conn.query_polars("SELECT * FROM your_table")

# Get Pandas DataFrame
df_pandas = conn.query_pandas("SELECT * FROM your_table")

print(df_polars)
print(df_pandas)

⚡ Zero-Copy Performance (Arrow C Data Interface)

For maximum performance, use the Arrow C Data Interface functions that completely eliminate serialization:

import ibarrow
import polars as pl
import pyarrow as pa

# Zero-copy conversion to Polars DataFrame (fastest)
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Get Polars DataFrame directly
df_polars = conn.query_arrow_c_data("SELECT * FROM your_table", return_dataframe=True)

# Or get raw PyCapsules for manual control
schema_capsule, array_capsule = conn.query_arrow_c_data("SELECT * FROM your_table")

# Convert to PyArrow Table using zero-copy
schema = pa.Schema._import_from_c(schema_capsule)
array = pa.Array._import_from_c(array_capsule)
table = pa.Table.from_arrays([array], schema=schema)

# Convert to Polars
df = pl.from_arrow(table)

Arrow C Data Interface Benefits:

  • 🚀 Zero serialization: Data passes directly via pointers
  • 💾 Zero copies: Eliminates memory overhead
  • Maximum speed: Ideal for large datasets
  • 🔄 Compatibility: Works with PyArrow, Polars, Pandas

Manual Arrow IPC Usage

import ibarrow
import polars as pl

# Get raw Arrow IPC bytes
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Get Arrow IPC bytes
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

# Convert to Polars DataFrame manually
df = pl.read_ipc(arrow_bytes)
print(df)

API Reference

ibarrow.connect(dsn, user, password, config=None)

Creates a connection object for database operations.

Parameters:

  • dsn (str): ODBC Data Source Name or full connection string
    • DSN format: "your_dsn" (requires pre-configured DSN)
    • Connection string format: "DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;" (direct connection)
  • user (str): Database username
  • password (str): Database password
  • config (QueryConfig, optional): Configuration object

Returns: IbarrowConnection object

Connection String Examples:

# SQL Server
dsn = "DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;"

# PostgreSQL
dsn = "DRIVER={PostgreSQL};SERVER=localhost;PORT=5432;DATABASE=mydb;"

# MySQL
dsn = "DRIVER={MySQL ODBC 8.0 Driver};SERVER=localhost;PORT=3306;DATABASE=mydb;"

# Oracle
dsn = "DRIVER={Oracle in OraClient19Home1};DBQ=localhost:1521/XE;"

query_arrow_ipc(sql)

Execute a SQL query and return Arrow IPC bytes.

Parameters:

  • sql (str): SQL query to execute

Returns: bytes - Arrow IPC format data

Raises:

  • PyConnectionError: Database connection issues
  • PySQLError: SQL syntax or execution errors
  • PyArrowError: Arrow data processing errors

conn.query_polars(sql)

Execute a SQL query and return a Polars DataFrame directly.

Parameters: Same as query_arrow_ipc

Returns: polars.DataFrame - Ready-to-use DataFrame

Note: Uses pl.read_ipc() directly with bytes for optimal performance.

query_pandas(sql)

Execute a SQL query and return a Pandas DataFrame directly.

Parameters: Same as query_arrow_ipc

Returns: pandas.DataFrame - Ready-to-use DataFrame

Note: Converts Arrow IPC to Pandas via PyArrow for compatibility.

QueryConfig

Configuration class for advanced query settings.

Parameters:

  • batch_size (int, optional): Number of rows per batch for processing (default: 1000)
  • read_only (bool, optional): Read-only connection to avoid locks (default: True)
  • connection_timeout (int, optional): Connection timeout in seconds
  • query_timeout (int, optional): Query timeout in seconds
  • max_text_size (int, optional): Maximum text field size in bytes (default: 65536)
  • max_binary_size (int, optional): Maximum binary field size in bytes (default: 65536)
  • isolation_level (str, optional): Transaction isolation level. Supported values: "read_uncommitted", "read_committed", "repeatable_read", "serializable", "snapshot"

Configuration Benefits

  • batch_size: Controls memory usage and performance. Larger batches = more memory but faster processing
  • read_only: Prevents locks and improves performance for read-only operations (effective only if ODBC driver supports this flag)
  • connection_timeout: Protects against hanging connections
  • query_timeout: Prevents long-running queries from blocking
  • max_text_size: Handles large text fields (VARCHAR, TEXT) efficiently
  • max_binary_size: Handles large binary data (BLOB, VARBINARY) efficiently
  • isolation_level: Controls transaction isolation for concurrent access

Implementation Notes

  • read_only: Currently implemented via ODBC connection string (ReadOnly=1).
  • batch_size: Controls how many rows are fetched per batch from the database, avoiding row-by-row fetching for better performance.
  • query_timeout: Implemented via statement handle using stmt.set_query_timeout(), which is more reliable than connection string timeouts.
  • isolation_level: Standardized mapping from common names (e.g., "read_committed") to driver-specific ODBC connection string values (e.g., "Isolation Level=ReadCommitted").
  • query_polars: Uses Arrow IPC stream with pl.read_ipc() for maximum compatibility and performance.
  • Native Types: Always preserves ODBC native types (INT, DECIMAL, FLOAT) as Arrow native types (Int64Array, Float64Array), avoiding expensive string conversions for maximum performance.
  • Pipelining: Always processes data in streaming fashion, writing each batch immediately as it's fetched. This keeps memory usage constant (e.g., 10MB) regardless of dataset size (even 80GB+).

Performance Comparison

Serialization vs Zero-Copy

Method Level Serialization Memory Copies Performance Ideal Use
query_polars High Arrow IPC Stream 1x (serialization) ⭐⭐⭐⭐ 95% of cases (recommended)
query_pandas High Arrow IPC Stream 1x (serialization) ⭐⭐⭐ Pandas compatibility
query_arrow_ipc Low Arrow IPC Stream 1x (serialization) ⭐⭐⭐ Maximum compatibility
query_arrow_c_data Low Zero Zero ⭐⭐⭐⭐⭐ Maximum performance

Typical Benchmarks (1M rows)

query_polars:         ~200ms  (Arrow IPC + polars.read_ipc) ⭐ RECOMMENDED
query_pandas:         ~600ms  (Arrow IPC + pyarrow + pandas)
query_arrow_ipc:      ~500ms  (Arrow IPC serialization)
query_arrow_c_data:   ~50ms   (zero-copy via pointers)

🚀 Built-in Performance Optimizations

Native Types (Always Enabled):

- INT columns → Int64Array (native Arrow types)
- DECIMAL columns → DecimalArray (native Arrow types)  
- FLOAT columns → Float64Array (native Arrow types)
- Performance: ~200ms for 1M numeric rows (Arrow IPC)

Pipelining (Always Enabled):

- Memory usage: Constant (~10MB) regardless of dataset size
- Processing: Streaming (fetch + write immediately)
- Latency: Lower (Python can start consuming data before completion)
- Example: 80GB dataset uses only ~10MB RAM

When to Use Each Method

🎯 High-Level API (Recommended)

  • query_polars(): 95% of cases - Simple, fast, zero-copy
  • query_pandas(): When you need Pandas compatibility

🔧 Low-Level API (Advanced)

  • query_arrow_ipc(): Maximum compatibility, save to disk
  • query_arrow_c_data(): Maximum performance, full control over data

Error Handling

The library provides specific exception types for different error scenarios:

import ibarrow

try:
    # Create connection
    conn = ibarrow.connect(dsn, user, password)
    
    # Query with batch size
    df = conn.query_polars(sql)
except ibarrow.PyConnectionError as e:
    print(f"Connection failed: {e}")
except ibarrow.PySQLError as e:
    print(f"SQL error: {e}")
except ibarrow.PyArrowError as e:
    print(f"Arrow processing error: {e}")

Requirements

  • Python 3.8+
  • ODBC driver for your database
  • Rust toolchain (for development)

Development

Setup

# Clone the repository
git clone https://github.com/thomazyujibaba/ibarrow.git
cd ibarrow

# Install maturin
pip install maturin[patchelf]

# Install in development mode
maturin develop

Running Tests

# Install test dependencies
pip install pytest pytest-cov

# Run tests
pytest tests/ -v

Building

# Build wheel
maturin build --release

# Build and install
maturin develop

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Troubleshooting

Common ODBC Issues

"Driver not found" errors:

  • Ensure the ODBC driver is properly installed
  • Check that the driver name in your DSN matches exactly
  • Verify the driver architecture (32-bit vs 64-bit) matches your Python installation

Connection timeout errors:

  • Check network connectivity to the database server
  • Verify firewall settings
  • Ensure the database server is running and accessible

Permission errors:

  • Verify database credentials
  • Check user permissions on the database
  • Ensure the ODBC driver has necessary privileges

Performance issues:

  • Adjust batch_size in QueryConfig for optimal memory usage
  • Use read_only=True for read-only operations
  • Consider connection pooling for high-frequency queries

Support

For issues and questions, please use the GitHub Issues page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ibarrow-0.1.9.tar.gz (50.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ibarrow-0.1.9-cp312-cp312-manylinux_2_34_x86_64.whl (994.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file ibarrow-0.1.9.tar.gz.

File metadata

  • Download URL: ibarrow-0.1.9.tar.gz
  • Upload date:
  • Size: 50.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for ibarrow-0.1.9.tar.gz
Algorithm Hash digest
SHA256 7e95d14241d04011628e1a6e835f995d8109083c4489fb4f84a0b450dd79c525
MD5 8774d66af586f3a69fc4531090ab9c97
BLAKE2b-256 02fcbf3284be2e1e279acb4696a83a75416325465bdc91d73027116b796e2eeb

See more details on using hashes here.

File details

Details for the file ibarrow-0.1.9-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ibarrow-0.1.9-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 62694426e02eb9587930f87dc6b73d3b51ac984cd97c5d6f063b1d7ad3244ad8
MD5 164d8a88f8e6454808db718f133ea48b
BLAKE2b-256 d66b71dd1edbec1f9e6e6f7e8a4ccd71036fb912d74571d311c1ab38bc3b82b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page