Skip to main content

A JDBC-first Python database bridge powered by JPype

Project description

jdbc4py

Python access to databases whose native drivers are weak, commercial, or missing — and a fast Arrow bulk-read path across any database with a JDBC driver.

jdbc4py embeds a JVM in-process via JPype, loads your JDBC driver JAR, and hands back a standard DBAPI 2.0 connection — plus near-zero-copy Arrow reads into pandas, Polars, and DuckDB.


When to use this

Reach for jdbc4py when:

  • Your database's native Python driver is weak, commercial, or nonexistent — SAP HANA, Sybase ASE/IQ, or any JDBC-only source.
  • You need to join or query across two different database systems in one SQL statement (e.g. PostgreSQL ⋈ MySQL) — see cross-database queries.
  • You query many different databases from one codebase and want one API (and one fast Arrow path) instead of a different driver per engine.
  • You need fast bulk reads into pandas / Arrow / Polars from a JDBC database.

Use something else when:

  • You're on PostgreSQL, MySQL, or SQLite — the native drivers (psycopg, etc.) are faster and need no JVM. Use them.
  • You need sub-second connect latency (serverless, high-frequency OLTP) — jdbc4py runs a JVM in-process, so the first connection pays ~0.5–1 s of JVM startup (use the pool to amortize it).
  • You can't run a JVM at all.

vs. the alternatives

  • Native driver (psycopg, hdbcli, …): faster for single-database, row-by-row access and connect latency. jdbc4py wins on bulk Arrow reads and on databases with no good native driver.
  • JayDeBeApi: the other JPype-based JDBC bridge. jdbc4py adds the Arrow bulk path, a connection pool, richer type mapping, a SQLAlchemy dialect, and an asyncio facade.
  • ADBC (Arrow Database Connectivity): the emerging Arrow-native standard, no JVM — the better choice if your database has an ADBC driver. jdbc4py covers the long tail of databases that only ship JDBC.

Features

  • Arrow bulk reads — results serialize on the JVM side and arrive as a near-zero-copy Arrow table; pandas/Polars/DuckDB reads beat native driver + SQLAlchemy + pandas by 1.8–4.8× on large analytical workloads (see BENCHMARKS.md).
  • Broad database reach — one API over any JDBC driver, including enterprise databases with thin Python support. See Supported Databases.
  • Cross-database queries — register several connections and run a single SQL query that joins tables across different database systems; jdbc4py fetches each source via Arrow and executes the join in an in-memory DuckDB.
  • DBAPI 2.0 + SQLAlchemy — a standard connection/cursor, plus a postgresql+jdbc4py://… dialect so SQLAlchemy, pandas, and ORMs work unchanged.
  • Write paths — insert Arrow, pandas, Polars, or DuckDB data directly into any JDBC table.
  • Server-side streamingstream=True for huge results: instant time-to-first-batch, bounded memory.
  • Connection pool — idle eviction, broken-connection replacement, configurable timeouts.
  • Rich type mapping — numeric/decimal, temporal (nanosecond on the Arrow path), JSON, UUID, arrays, ranges, intervals, vectors/embeddings, and full Unicode.
  • asynciojdbc4py.aio runs blocking JDBC calls off the event loop.
  • Operational niceties — cancel a running query from any thread, uniform metadata helpers, and credential redaction in exceptions and logs.

Requirements

  • Python 3.11+
  • Java 11+ (JRE is enough; JDK only needed if building the helper JAR)

Installation

pip install jdbc4py                 # core DBAPI 2.0 (row-by-row fetch)
pip install "jdbc4py[dataframe]"    # + Arrow bulk reads into pandas / Polars / DuckDB

Requires a Java 11+ runtime (a JRE is enough — no JDK needed). The wheel bundles the prebuilt Arrow helper JAR, so there's no build step at install time. You still bring your own JDBC driver JAR for each database — see docs/installation.md for driver placement and environment configuration, and Building from Source to work on jdbc4py itself.


Quick Start

from jdbc4py import connect

with connect(
    engine="postgres",
    host="localhost",
    database="mydb",
    user="me",
    password="secret",
) as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT id, name FROM users WHERE active = ?", [True])
        rows = cur.fetchall()

Bulk read into a DataFrame:

df = conn.read_sql("SELECT * FROM events", format="pandas")
# or "polars", "arrow", "duckdb"

Write a DataFrame back:

conn.write_polars(df, "staging", if_exists="replace")

Stream a very large result without buffering it client-side:

df = conn.read_sql("SELECT * FROM huge_table", format="pandas", stream=True)

Use it through SQLAlchemy (and therefore pandas, ORMs, Alembic):

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("postgresql+jdbc4py://me:secret@localhost:5432/mydb")
df = pd.read_sql("SELECT * FROM events", engine)

Cross-database queries

Join tables that live in different database systems in a single query. Register each connection under an alias, then reference it as alias.table; jdbc4py pulls each source over Arrow and runs the join in an in-memory DuckDB:

from jdbc4py import connect, FederatedSession

with FederatedSession() as fed:
    fed.add_connection("pg", connect(engine="postgres", host="pg-host",
                                     database="analytics", user="me", password="secret"))
    fed.add_connection("mysql", connect(engine="mysql", host="mysql-host",
                                        database="users", user="me", password="secret"))

    df = fed.execute(
        "SELECT c.name, SUM(o.total) AS spend "
        "FROM pg.customers c "
        "JOIN mysql.orders o ON c.id = o.customer_id "
        "GROUP BY c.name",
        format="pandas",
    )

Supported Databases

Database Engine key Notes
PostgreSQL postgres Full type coverage including arrays, ranges, hstore
MySQL mysql
MariaDB mariadb
SQL Server mssql
Oracle oracle
ClickHouse clickhouse
Trino trino
Presto presto
Snowflake snowflake connect_snowflake() helper
Databricks databricks connect_databricks() helper
Google Cloud Spanner spanner connect_spanner() helper
Google BigQuery bigquery connect_bigquery() helper
Sybase ASE / SAP IQ sybase_ase / sybase_iq jConnect or jTDS driver
SAP HANA hana on-prem & HANA Cloud; ngdbc.jar driver
SQLite sqlite Useful for testing

Any database with a JDBC driver works via the jdbc_url= + driver= + jar= path.


Connection Pool

from jdbc4py import ConnectionPool

pool = ConnectionPool(
    engine="postgres",
    host="localhost",
    database="mydb",
    user="me",
    password="secret",
    min_size=2,
    max_connections=10,
)

with pool.connect() as conn:
    df = conn.read_sql("SELECT * FROM orders", format="pandas")

Benchmarks

On a 50K-row × 50-col read, jdbc4py's Arrow path is the fastest pandas loader on every local engine tested, beating the native driver + SQLAlchemy + pandas by ~1.8× on PostgreSQL, 2.3× MySQL, 3.2× MariaDB, 3.3× Oracle, and 4.8× on SQL Server. On Databricks it's 2.8× faster than databricks-sql-connector; on Snowflake the native connector wins (it reads Snowflake's native Arrow wire format directly).

The win is the Arrow bulk path, not row access: row-by-row fetchall() goes through the JVM and is slower than a native C driver (it creates Python objects one cell at a time), and the first connection pays JVM startup. Rule of thumb — use read_sql(format="pandas"/"arrow") for large reads, pool your connections, and reach for a native driver only if you need fast per-row tuple access.

Full numbers across PostgreSQL/MySQL/MariaDB/SQL Server/ClickHouse/Oracle plus deployed SAP HANA, Snowflake, and Databricks: BENCHMARKS.md


Documentation

Document Contents
docs/installation.md Java setup, driver JARs, environment variables
docs/architecture.md How jdbc4py works internally
docs/performance.md Arrow path, batching, pool tuning
docs/driver-management.md Driver discovery, custom JARs, JDBC4PY_DRIVER_PATHS
docs/jdbc-internals.md Cursor, type system, cancellation, pool internals
docs/design-decisions.md Why JPype, why Arrow IPC, tradeoffs
docs/troubleshooting.md Common errors and fixes
docs/faq.md Frequently asked questions
docs/operator-guide.md Production deployment and security

Building from Source

git clone https://github.com/ryuk-me/jdbc4py
cd jdbc4py
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,dataframe]"

# Build the Java helper JAR (enables Arrow bulk reads)
./java/jdbc4py-helper/build.sh

# Unit tests
pytest -q

# Integration tests (requires Docker)
docker compose --profile core up -d
JDBC4PY_RUN_INTEGRATION=1 pytest tests/integration/ -q

Contributing

Bug reports and pull requests are welcome. For large changes, open an issue first to discuss the approach. Tests are required for new features.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jdbc4py-0.1.1-py3-none-any.whl (5.4 MB view details)

Uploaded Python 3

File details

Details for the file jdbc4py-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: jdbc4py-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for jdbc4py-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4e9728ec29217888e672e7697dd51124fd7a7e9f67ffaecea175718522178b99
MD5 cae6143ff185371d576aaf2c35bd3755
BLAKE2b-256 01cba37d8e0ffc5528d7a654246b7ea7e447408223341bd7d98a0484807cd55d

See more details on using hashes here.

Provenance

The following attestation bundles were made for jdbc4py-0.1.1-py3-none-any.whl:

Publisher: release.yml on Ryuk-me/jdbc4py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page