High-performance ODBC to Arrow data connector for InterBase/Firebird

These details have not been verified by PyPI

Project links

Project description

ibarrow

High-performance ODBC to Arrow data conversion for Python, built with Rust.

Features

🚀 High Performance: Built with Rust for maximum speed
🔄 ODBC Integration: Direct connection to any ODBC-compatible database
📊 Arrow Format: Native Apache Arrow support for efficient data processing
🐼 Pandas/Polars Ready: Seamless integration with popular Python data libraries
🛡️ Type Safe: Rust-powered reliability with Python convenience
🎯 Two-Level API: Simple wrappers for common use + raw functions for advanced control

Installation

pip install ibarrow

Repository

GitHub: https://github.com/thomazyujibaba/ibarrow
PyPI: https://pypi.org/project/ibarrow/
Documentation: https://github.com/thomazyujibaba/ibarrow#readme

Prerequisites

Important: You need an ODBC driver installed on your system for ibarrow to work.

Windows

SQL Server: ODBC Driver for SQL Server
PostgreSQL: psqlODBC
MySQL: MySQL Connector/ODBC
Oracle: Oracle Instant Client + ODBC

Linux

SQL Server: Microsoft ODBC Driver for SQL Server on Linux
PostgreSQL: sudo apt-get install odbc-postgresql (Ubuntu/Debian) or sudo yum install postgresql-odbc (RHEL/CentOS)
MySQL: sudo apt-get install libmyodbc (Ubuntu/Debian) or sudo yum install mysql-connector-odbc (RHEL/CentOS)

macOS

Note: macOS support is currently not available. Please use Windows or Linux for now.

Verify ODBC Installation

You can verify your ODBC drivers are installed by checking the system:

Windows:

# Check installed drivers
odbcad32.exe

Linux/macOS:

# List available drivers
odbcinst -q -d

API Architecture

ibarrow provides a two-level API designed for different user needs:

🎯 High-Level API (Recommended for 95% of users)

query_polars(): Direct Polars DataFrame (zero-copy, fastest)
query_pandas(): Direct Pandas DataFrame (maximum compatibility)

🔧 Low-Level API (For advanced users)

query_arrow_ipc(): Raw Arrow IPC bytes (maximum compatibility)
query_arrow_c_data(): Raw Arrow C Data Interface (maximum performance)

📋 When to Use Each Level

User Type	Recommended Function	Use Case
Beginners	`query_polars()`	95% of cases - simple and fast
Pandas Users	`query_pandas()`	When you need Pandas compatibility
Advanced Users	`query_arrow_ipc()`	When you need raw Arrow data
Performance Critical	`query_arrow_c_data()`	When you need maximum control

Quick Start

🚀 Recommended Usage (95% of cases)

import ibarrow

# Direct Polars DataFrame (zero-copy, fastest)
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Query and get Polars DataFrame
df = conn.query_polars("SELECT * FROM your_table")

print(df)

With Custom Batch Size

import ibarrow

# Create config with custom batch size
config = ibarrow.QueryConfig(batch_size=2000)

# Create connection with configuration
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password",
    config=config
)

# Query with custom batch size
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

Advanced Configuration

import ibarrow

# Create custom configuration
config = ibarrow.QueryConfig(
    batch_size=2000,           # Rows per batch
    read_only=True,            # Read-only connection
    connection_timeout=30,      # Connection timeout in seconds
    query_timeout=60,          # Query timeout in seconds
    max_text_size=32768,       # Max text field size
    max_binary_size=16384,     # Max binary field size
    isolation_level="READ_COMMITTED"  # Transaction isolation
)

# Create connection with configuration
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password",
    config=config
)

# Use the connection
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

Direct DataFrame Integration

import ibarrow

# Direct conversion to Polars DataFrame (uses pl.read_ipc internally)
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Get Polars DataFrame
df_polars = conn.query_polars("SELECT * FROM your_table")

# Get Pandas DataFrame
df_pandas = conn.query_pandas("SELECT * FROM your_table")

print(df_polars)
print(df_pandas)

⚡ Zero-Copy Performance (Arrow C Data Interface)

For maximum performance, use the Arrow C Data Interface functions that completely eliminate serialization:

import ibarrow
import polars as pl
import pyarrow as pa

# Zero-copy conversion to Polars DataFrame (fastest)
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Get Polars DataFrame directly
df_polars = conn.query_arrow_c_data("SELECT * FROM your_table", return_dataframe=True)

# Or get raw PyCapsules for manual control
schema_capsule, array_capsule = conn.query_arrow_c_data("SELECT * FROM your_table")

# Convert to PyArrow Table using zero-copy
schema = pa.Schema._import_from_c(schema_capsule)
array = pa.Array._import_from_c(array_capsule)
table = pa.Table.from_arrays([array], schema=schema)

# Convert to Polars
df = pl.from_arrow(table)

Arrow C Data Interface Benefits:

🚀 Zero serialization: Data passes directly via pointers
💾 Zero copies: Eliminates memory overhead
⚡ Maximum speed: Ideal for large datasets
🔄 Compatibility: Works with PyArrow, Polars, Pandas

Manual Arrow IPC Usage

import ibarrow
import polars as pl

# Get raw Arrow IPC bytes
# Create connection
conn = ibarrow.connect(
    dsn="your_dsn",
    user="username",
    password="password"
)

# Get Arrow IPC bytes
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

# Convert to Polars DataFrame manually
df = pl.read_ipc(arrow_bytes)
print(df)

API Reference

`ibarrow.connect(dsn, user, password, config=None)`

Creates a connection object for database operations.

Parameters:

dsn (str): ODBC Data Source Name
user (str): Database username
password (str): Database password
config (QueryConfig, optional): Configuration object

Returns: IbarrowConnection object

`query_arrow_ipc(sql)`

Execute a SQL query and return Arrow IPC bytes.

Parameters:

sql (str): SQL query to execute

Returns: bytes - Arrow IPC format data

Raises:

PyConnectionError: Database connection issues
PySQLError: SQL syntax or execution errors
PyArrowError: Arrow data processing errors

`conn.query_polars(sql)`

Execute a SQL query and return a Polars DataFrame directly.

Parameters: Same as query_arrow_ipc

Returns: polars.DataFrame - Ready-to-use DataFrame

Note: Uses pl.read_ipc() directly with bytes for optimal performance.

`query_pandas(sql)`

Execute a SQL query and return a Pandas DataFrame directly.

Parameters: Same as query_arrow_ipc

Returns: pandas.DataFrame - Ready-to-use DataFrame

Note: Converts Arrow IPC to Pandas via PyArrow for compatibility.

`QueryConfig`

Configuration class for advanced query settings.

Parameters:

batch_size (int, optional): Number of rows per batch for processing (default: 1000)
read_only (bool, optional): Read-only connection to avoid locks (default: True)
connection_timeout (int, optional): Connection timeout in seconds
query_timeout (int, optional): Query timeout in seconds
max_text_size (int, optional): Maximum text field size in bytes (default: 65536)
max_binary_size (int, optional): Maximum binary field size in bytes (default: 65536)
isolation_level (str, optional): Transaction isolation level. Supported values: "read_uncommitted", "read_committed", "repeatable_read", "serializable", "snapshot"

Configuration Benefits

batch_size: Controls memory usage and performance. Larger batches = more memory but faster processing
read_only: Prevents locks and improves performance for read-only operations (effective only if ODBC driver supports this flag)
connection_timeout: Protects against hanging connections
query_timeout: Prevents long-running queries from blocking
max_text_size: Handles large text fields (VARCHAR, TEXT) efficiently
max_binary_size: Handles large binary data (BLOB, VARBINARY) efficiently
isolation_level: Controls transaction isolation for concurrent access

Implementation Notes

read_only: Currently implemented via ODBC connection string (ReadOnly=1).
batch_size: Controls how many rows are fetched per batch from the database, avoiding row-by-row fetching for better performance.
query_timeout: Implemented via statement handle using stmt.set_query_timeout(), which is more reliable than connection string timeouts.
isolation_level: Standardized mapping from common names (e.g., "read_committed") to driver-specific ODBC connection string values (e.g., "Isolation Level=ReadCommitted").
query_polars: Optimized to use pl.read_ipc() directly with bytes, avoiding the overhead of io.BytesIO wrapper for better performance.
Native Types: Always preserves ODBC native types (INT, DECIMAL, FLOAT) as Arrow native types (Int64Array, Float64Array), avoiding expensive string conversions for maximum performance.
Pipelining: Always processes data in streaming fashion, writing each batch immediately as it's fetched. This keeps memory usage constant (e.g., 10MB) regardless of dataset size (even 80GB+).

Performance Comparison

Serialization vs Zero-Copy

Method	Level	Serialization	Memory Copies	Performance	Ideal Use
`query_polars`	High	Zero	Zero	⭐⭐⭐⭐⭐	95% of cases (recommended)
`query_pandas`	High	Arrow IPC Stream	1x (serialization)	⭐⭐⭐	Pandas compatibility
`query_arrow_ipc`	Low	Arrow IPC Stream	1x (serialization)	⭐⭐⭐	Maximum compatibility
`query_arrow_c_data`	Low	Zero	Zero	⭐⭐⭐⭐⭐	Maximum performance

Typical Benchmarks (1M rows)

query_polars:         ~30ms   (zero-copy + direct conversion) ⭐ RECOMMENDED
query_pandas:         ~600ms  (serialization + pyarrow + pandas)
query_arrow_ipc:      ~500ms  (serialization + deserialization)
query_arrow_c_data:   ~50ms   (zero-copy via pointers)

🚀 Built-in Performance Optimizations

Native Types (Always Enabled):

- INT columns → Int64Array (zero-copy)
- DECIMAL columns → DecimalArray (zero-copy)  
- FLOAT columns → Float64Array (zero-copy)
- Performance: ~30ms for 1M numeric rows

Pipelining (Always Enabled):

- Memory usage: Constant (~10MB) regardless of dataset size
- Processing: Streaming (fetch + write immediately)
- Latency: Lower (Python can start consuming data before completion)
- Example: 80GB dataset uses only ~10MB RAM

When to Use Each Method

🎯 High-Level API (Recommended)

query_polars(): 95% of cases - Simple, fast, zero-copy
query_pandas(): When you need Pandas compatibility

🔧 Low-Level API (Advanced)

query_arrow_ipc(): Maximum compatibility, save to disk
query_arrow_c_data(): Maximum performance, full control over data

Error Handling

The library provides specific exception types for different error scenarios:

import ibarrow

try:
    # Create connection
    conn = ibarrow.connect(dsn, user, password)
    
    # Query with batch size
    df = conn.query_polars(sql)
except ibarrow.PyConnectionError as e:
    print(f"Connection failed: {e}")
except ibarrow.PySQLError as e:
    print(f"SQL error: {e}")
except ibarrow.PyArrowError as e:
    print(f"Arrow processing error: {e}")

Requirements

Python 3.8+
ODBC driver for your database
Rust toolchain (for development)

Development

Setup

# Clone the repository
git clone https://github.com/thomazyujibaba/ibarrow.git
cd ibarrow

# Install maturin
pip install maturin[patchelf]

# Install in development mode
maturin develop

Running Tests

# Install test dependencies
pip install pytest pytest-cov

# Run tests
pytest tests/ -v

Building

# Build wheel
maturin build --release

# Build and install
maturin develop

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Troubleshooting

Common ODBC Issues

"Driver not found" errors:

Ensure the ODBC driver is properly installed
Check that the driver name in your DSN matches exactly
Verify the driver architecture (32-bit vs 64-bit) matches your Python installation

Connection timeout errors:

Check network connectivity to the database server
Verify firewall settings
Ensure the database server is running and accessible

Permission errors:

Verify database credentials
Check user permissions on the database
Ensure the ODBC driver has necessary privileges

Performance issues:

Adjust batch_size in QueryConfig for optimal memory usage
Use read_only=True for read-only operations
Consider connection pooling for high-frequency queries

Support

For issues and questions, please use the GitHub Issues page.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.9

Sep 24, 2025

0.1.8

Sep 24, 2025

0.1.7

Sep 24, 2025

0.1.6

Sep 24, 2025

0.1.4.1

Sep 23, 2025

0.1.4

Sep 23, 2025

0.1.3.1

Sep 23, 2025

0.1.2

Sep 23, 2025

0.1.1

Sep 23, 2025

This version

0.1.0

Sep 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ibarrow-0.1.0.tar.gz (45.2 kB view details)

Uploaded Sep 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ibarrow-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl (972.0 kB view details)

Uploaded Sep 23, 2025 CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file ibarrow-0.1.0.tar.gz.

File metadata

Download URL: ibarrow-0.1.0.tar.gz
Upload date: Sep 23, 2025
Size: 45.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.4

File hashes

Hashes for ibarrow-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0a46ac174691585dec9e4683ec2ac358964b8f1534e0b2c4e0bd09714954a691`
MD5	`cd59e6f55d20b5ab46f9d103d72892ab`
BLAKE2b-256	`e118243c93d3d531764bfac583fd19ba7adc8372dd47b25e4b22cf07878b26d5`

See more details on using hashes here.

File details

Details for the file ibarrow-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: ibarrow-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Sep 23, 2025
Size: 972.0 kB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.4

File hashes

Hashes for ibarrow-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`2a2771c7ae4ea7966710a506ea9ee5375aecac8aab430e3b5e0b8f21548644f4`
MD5	`e2315a78ae85c629f4513d3b1074459c`
BLAKE2b-256	`b9955d50f7b829fec53bbb34cf3cac8993221b30309d69c14ecf831bea520720`

See more details on using hashes here.

ibarrow 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ibarrow

Features

Installation

Repository

Prerequisites

Windows

Linux

macOS

Verify ODBC Installation

API Architecture

🎯 High-Level API (Recommended for 95% of users)

🔧 Low-Level API (For advanced users)

📋 When to Use Each Level

Quick Start

🚀 Recommended Usage (95% of cases)

With Custom Batch Size

Advanced Configuration

Direct DataFrame Integration

⚡ Zero-Copy Performance (Arrow C Data Interface)

Manual Arrow IPC Usage

API Reference

ibarrow.connect(dsn, user, password, config=None)

query_arrow_ipc(sql)

conn.query_polars(sql)

query_pandas(sql)

QueryConfig

Configuration Benefits

Implementation Notes

Performance Comparison

Serialization vs Zero-Copy

Typical Benchmarks (1M rows)

🚀 Built-in Performance Optimizations

When to Use Each Method

🎯 High-Level API (Recommended)

🔧 Low-Level API (Advanced)

Error Handling

Requirements

Development

Setup

Running Tests

Building

License

Contributing

Troubleshooting

Common ODBC Issues

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`ibarrow.connect(dsn, user, password, config=None)`

`query_arrow_ipc(sql)`

`conn.query_polars(sql)`

`query_pandas(sql)`

`QueryConfig`