High-performance ODBC to Arrow data connector for InterBase/Firebird
Project description
ibarrow
High-performance ODBC to Arrow data conversion for Python, built with Rust.
Features
- 🚀 High Performance: Built with Rust for maximum speed
- 🔄 ODBC Integration: Direct connection to any ODBC-compatible database
- 📊 Arrow Format: Native Apache Arrow support for efficient data processing
- 🐼 Pandas/Polars Ready: Seamless integration with popular Python data libraries
- 🛡️ Type Safe: Rust-powered reliability with Python convenience
- 🎯 Two-Level API: Simple wrappers for common use + raw functions for advanced control
Installation
pip install ibarrow
Repository
- GitHub: https://github.com/thomazyujibaba/ibarrow
- PyPI: https://pypi.org/project/ibarrow/
- Documentation: https://github.com/thomazyujibaba/ibarrow#readme
Prerequisites
Important: You need an ODBC driver installed on your system for ibarrow to work.
Windows
- SQL Server: ODBC Driver for SQL Server
- PostgreSQL: psqlODBC
- MySQL: MySQL Connector/ODBC
- Oracle: Oracle Instant Client + ODBC
Linux
- SQL Server: Microsoft ODBC Driver for SQL Server on Linux
- PostgreSQL:
sudo apt-get install odbc-postgresql(Ubuntu/Debian) orsudo yum install postgresql-odbc(RHEL/CentOS) - MySQL:
sudo apt-get install libmyodbc(Ubuntu/Debian) orsudo yum install mysql-connector-odbc(RHEL/CentOS)
macOS
- Note: macOS support is currently not available. Please use Windows or Linux for now.
Verify ODBC Installation
You can verify your ODBC drivers are installed by checking the system:
Windows:
# Check installed drivers
odbcad32.exe
Linux/macOS:
# List available drivers
odbcinst -q -d
API Architecture
ibarrow provides a two-level API designed for different user needs:
🎯 High-Level API (Recommended for 95% of users)
query_polars(): Direct Polars DataFrame (zero-copy, fastest)query_pandas(): Direct Pandas DataFrame (maximum compatibility)
🔧 Low-Level API (For advanced users)
query_arrow_ipc(): Raw Arrow IPC bytes (maximum compatibility)query_arrow_c_data(): Raw Arrow C Data Interface (maximum performance)
📋 When to Use Each Level
| User Type | Recommended Function | Use Case |
|---|---|---|
| Beginners | query_polars() |
95% of cases - simple and fast |
| Pandas Users | query_pandas() |
When you need Pandas compatibility |
| Advanced Users | query_arrow_ipc() |
When you need raw Arrow data |
| Performance Critical | query_arrow_c_data() |
When you need maximum control |
Quick Start
🔗 Connection Methods
ibarrow supports two ways to connect to databases:
Method 1: DSN (Data Source Name)
# Requires pre-configured DSN in ODBC Data Sources
conn = ibarrow.connect(
dsn="my_database_dsn",
user="username",
password="password"
)
Method 2: Direct Connection String (Recommended)
# Direct connection like pyodbc - no DSN configuration needed
conn = ibarrow.connect(
dsn="DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;",
user="username",
password="password"
)
🚀 Recommended Usage (95% of cases)
import ibarrow
# Option 1: Using DSN (Data Source Name)
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Option 2: Using direct connection string (like pyodbc)
conn = ibarrow.connect(
dsn="DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;",
user="username",
password="password"
)
# Query and get Polars DataFrame
df = conn.query_polars("SELECT * FROM your_table")
print(df)
With Custom Batch Size
import ibarrow
# Create config with custom batch size
config = ibarrow.QueryConfig(batch_size=2000)
# Create connection with configuration
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password",
config=config
)
# Query with custom batch size
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")
Advanced Configuration
import ibarrow
# Create custom configuration
config = ibarrow.QueryConfig(
batch_size=2000, # Rows per batch
read_only=True, # Read-only connection
connection_timeout=30, # Connection timeout in seconds
query_timeout=60, # Query timeout in seconds
max_text_size=32768, # Max text field size
max_binary_size=16384, # Max binary field size
isolation_level="READ_COMMITTED" # Transaction isolation
)
# Create connection with configuration
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password",
config=config
)
# Use the connection
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")
Direct DataFrame Integration
import ibarrow
# Direct conversion to Polars DataFrame (uses pl.read_ipc internally)
# Create connection
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Get Polars DataFrame
df_polars = conn.query_polars("SELECT * FROM your_table")
# Get Pandas DataFrame
df_pandas = conn.query_pandas("SELECT * FROM your_table")
print(df_polars)
print(df_pandas)
⚡ Zero-Copy Performance (Arrow C Data Interface)
For maximum performance, use the Arrow C Data Interface functions that completely eliminate serialization:
import ibarrow
import polars as pl
import pyarrow as pa
# Zero-copy conversion to Polars DataFrame (fastest)
# Create connection
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Get Polars DataFrame directly
df_polars = conn.query_arrow_c_data("SELECT * FROM your_table", return_dataframe=True)
# Or get raw PyCapsules for manual control
schema_capsule, array_capsule = conn.query_arrow_c_data("SELECT * FROM your_table")
# Convert to PyArrow Table using zero-copy
schema = pa.Schema._import_from_c(schema_capsule)
array = pa.Array._import_from_c(array_capsule)
table = pa.Table.from_arrays([array], schema=schema)
# Convert to Polars
df = pl.from_arrow(table)
Arrow C Data Interface Benefits:
- 🚀 Zero serialization: Data passes directly via pointers
- 💾 Zero copies: Eliminates memory overhead
- ⚡ Maximum speed: Ideal for large datasets
- 🔄 Compatibility: Works with PyArrow, Polars, Pandas
Manual Arrow IPC Usage
import ibarrow
import polars as pl
# Get raw Arrow IPC bytes
# Create connection
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Get Arrow IPC bytes
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")
# Convert to Polars DataFrame manually
df = pl.read_ipc(arrow_bytes)
print(df)
API Reference
ibarrow.connect(dsn, user, password, config=None)
Creates a connection object for database operations.
Parameters:
dsn(str): ODBC Data Source Name or full connection string- DSN format:
"your_dsn"(requires pre-configured DSN) - Connection string format:
"DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;"(direct connection)
- DSN format:
user(str): Database usernamepassword(str): Database passwordconfig(QueryConfig, optional): Configuration object
Returns: IbarrowConnection object
Connection String Examples:
# SQL Server
dsn = "DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;"
# PostgreSQL
dsn = "DRIVER={PostgreSQL};SERVER=localhost;PORT=5432;DATABASE=mydb;"
# MySQL
dsn = "DRIVER={MySQL ODBC 8.0 Driver};SERVER=localhost;PORT=3306;DATABASE=mydb;"
# Oracle
dsn = "DRIVER={Oracle in OraClient19Home1};DBQ=localhost:1521/XE;"
query_arrow_ipc(sql)
Execute a SQL query and return Arrow IPC bytes.
Parameters:
sql(str): SQL query to execute
Returns: bytes - Arrow IPC format data
Raises:
PyConnectionError: Database connection issuesPySQLError: SQL syntax or execution errorsPyArrowError: Arrow data processing errors
conn.query_polars(sql)
Execute a SQL query and return a Polars DataFrame directly.
Parameters: Same as query_arrow_ipc
Returns: polars.DataFrame - Ready-to-use DataFrame
Note: Uses pl.read_ipc() directly with bytes for optimal performance.
query_pandas(sql)
Execute a SQL query and return a Pandas DataFrame directly.
Parameters: Same as query_arrow_ipc
Returns: pandas.DataFrame - Ready-to-use DataFrame
Note: Converts Arrow IPC to Pandas via PyArrow for compatibility.
QueryConfig
Configuration class for advanced query settings.
Parameters:
batch_size(int, optional): Number of rows per batch for processing (default: 1000)read_only(bool, optional): Read-only connection to avoid locks (default: True)connection_timeout(int, optional): Connection timeout in secondsquery_timeout(int, optional): Query timeout in secondsmax_text_size(int, optional): Maximum text field size in bytes (default: 65536)max_binary_size(int, optional): Maximum binary field size in bytes (default: 65536)isolation_level(str, optional): Transaction isolation level. Supported values: "read_uncommitted", "read_committed", "repeatable_read", "serializable", "snapshot"
Configuration Benefits
batch_size: Controls memory usage and performance. Larger batches = more memory but faster processingread_only: Prevents locks and improves performance for read-only operations (effective only if ODBC driver supports this flag)connection_timeout: Protects against hanging connectionsquery_timeout: Prevents long-running queries from blockingmax_text_size: Handles large text fields (VARCHAR, TEXT) efficientlymax_binary_size: Handles large binary data (BLOB, VARBINARY) efficientlyisolation_level: Controls transaction isolation for concurrent access
Implementation Notes
read_only: Currently implemented via ODBC connection string (ReadOnly=1).batch_size: Controls how many rows are fetched per batch from the database, avoiding row-by-row fetching for better performance.query_timeout: Implemented via statement handle usingstmt.set_query_timeout(), which is more reliable than connection string timeouts.isolation_level: Standardized mapping from common names (e.g., "read_committed") to driver-specific ODBC connection string values (e.g., "Isolation Level=ReadCommitted").query_polars: Uses Arrow IPC stream withpl.read_ipc()for maximum compatibility and performance.- Native Types: Always preserves ODBC native types (INT, DECIMAL, FLOAT) as Arrow native types (Int64Array, Float64Array), avoiding expensive string conversions for maximum performance.
- Pipelining: Always processes data in streaming fashion, writing each batch immediately as it's fetched. This keeps memory usage constant (e.g., 10MB) regardless of dataset size (even 80GB+).
Performance Comparison
Serialization vs Zero-Copy
| Method | Level | Serialization | Memory Copies | Performance | Ideal Use |
|---|---|---|---|---|---|
query_polars |
High | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐⭐ | 95% of cases (recommended) |
query_pandas |
High | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐ | Pandas compatibility |
query_arrow_ipc |
Low | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐ | Maximum compatibility |
query_arrow_c_data |
Low | Zero | Zero | ⭐⭐⭐⭐⭐ | Maximum performance |
Typical Benchmarks (1M rows)
query_polars: ~200ms (Arrow IPC + polars.read_ipc) ⭐ RECOMMENDED
query_pandas: ~600ms (Arrow IPC + pyarrow + pandas)
query_arrow_ipc: ~500ms (Arrow IPC serialization)
query_arrow_c_data: ~50ms (zero-copy via pointers)
🚀 Built-in Performance Optimizations
Native Types (Always Enabled):
- INT columns → Int64Array (native Arrow types)
- DECIMAL columns → DecimalArray (native Arrow types)
- FLOAT columns → Float64Array (native Arrow types)
- Performance: ~200ms for 1M numeric rows (Arrow IPC)
Pipelining (Always Enabled):
- Memory usage: Constant (~10MB) regardless of dataset size
- Processing: Streaming (fetch + write immediately)
- Latency: Lower (Python can start consuming data before completion)
- Example: 80GB dataset uses only ~10MB RAM
When to Use Each Method
🎯 High-Level API (Recommended)
query_polars(): 95% of cases - Simple, fast, zero-copyquery_pandas(): When you need Pandas compatibility
🔧 Low-Level API (Advanced)
query_arrow_ipc(): Maximum compatibility, save to diskquery_arrow_c_data(): Maximum performance, full control over data
Error Handling
The library provides specific exception types for different error scenarios:
import ibarrow
try:
# Create connection
conn = ibarrow.connect(dsn, user, password)
# Query with batch size
df = conn.query_polars(sql)
except ibarrow.PyConnectionError as e:
print(f"Connection failed: {e}")
except ibarrow.PySQLError as e:
print(f"SQL error: {e}")
except ibarrow.PyArrowError as e:
print(f"Arrow processing error: {e}")
Requirements
- Python 3.8+
- ODBC driver for your database
- Rust toolchain (for development)
Development
Setup
# Clone the repository
git clone https://github.com/thomazyujibaba/ibarrow.git
cd ibarrow
# Install maturin
pip install maturin[patchelf]
# Install in development mode
maturin develop
Running Tests
# Install test dependencies
pip install pytest pytest-cov
# Run tests
pytest tests/ -v
Building
# Build wheel
maturin build --release
# Build and install
maturin develop
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Troubleshooting
Common ODBC Issues
"Driver not found" errors:
- Ensure the ODBC driver is properly installed
- Check that the driver name in your DSN matches exactly
- Verify the driver architecture (32-bit vs 64-bit) matches your Python installation
Connection timeout errors:
- Check network connectivity to the database server
- Verify firewall settings
- Ensure the database server is running and accessible
Permission errors:
- Verify database credentials
- Check user permissions on the database
- Ensure the ODBC driver has necessary privileges
Performance issues:
- Adjust
batch_sizeinQueryConfigfor optimal memory usage - Use
read_only=Truefor read-only operations - Consider connection pooling for high-frequency queries
Support
For issues and questions, please use the GitHub Issues page.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ibarrow-0.1.2.tar.gz.
File metadata
- Download URL: ibarrow-0.1.2.tar.gz
- Upload date:
- Size: 46.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
380af6cb2fad17adcaee4ecc325cfb2d38afc5e103681e2ddbb4b4d9eb0d27da
|
|
| MD5 |
ff41b70ae3d1f2655a128cc514599eac
|
|
| BLAKE2b-256 |
1b5d3451fc0e73b8d74aa9857d3f83590e68d4db05b60858f66ab4f8cec7b9d6
|
File details
Details for the file ibarrow-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: ibarrow-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 972.2 kB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69bec6dda23edd9ac3fc6b3165376ba7b52644ef2606e43bb249daa11481031c
|
|
| MD5 |
0a30066f51a5ebb4915a70b7a1ccba80
|
|
| BLAKE2b-256 |
0a76daf93ff73943d0a0e4cc610ba95718d0f9881c7c958520de938d4f797d6f
|