Convert database tables to Parquet files.
Project description
db2pq: export PostgreSQL and WRDS data to Parquet
db2pq is a Python library for moving data from PostgreSQL into Apache Parquet files.
It is designed for both general PostgreSQL sources and the WRDS PostgreSQL service.
What it does
- Export a single PostgreSQL table to Parquet.
- Export an Ibis PostgreSQL table expression to Parquet.
- Export all tables in a PostgreSQL schema to Parquet.
- Export WRDS tables to Parquet.
- Update Parquet files only when the WRDS source table is newer.
- Mirror WRDS tables into a local PostgreSQL database.
- Read
last_modifiedmetadata embedded in Parquet files.
Installation
Install from PyPI:
pip install --upgrade db2pq
This installs the bundled psycopg PostgreSQL client dependency, so most users
do not need a separate system libpq installation.
Install optional SAS support (used by wrds_update_pq(..., use_sas=True)):
pip install --upgrade "db2pq[sas]"
Install optional pandas support (needed for DataFrame outputs from
pq_last_modified(...)):
pip install --upgrade "db2pq[pandas]"
Install optional Ibis export support (needed for ibis_to_pq(...)):
pip install --upgrade "db2pq[ibis]"
Install optional ADBC export support (needed for engine="adbc" in
PostgreSQL-to-Parquet helpers):
pip install --upgrade "db2pq[adbc]"
Install both optional SAS and pandas support:
pip install --upgrade "db2pq[sas,pandas]"
Environment variables
db2pq supports explicit function arguments and environment-based defaults.
It also loads a local .env file automatically (via python-dotenv) when resolving defaults.
Connection defaults:
PGUSER: PostgreSQL user (falls back to local OS user)PGHOST: PostgreSQL host (default:localhost)PGDATABASE: PostgreSQL database (default:PGUSER)PGPORT: PostgreSQL port (default:5432)
WRDS + output defaults:
WRDS_ID: WRDS username (required for WRDS helpers unless passed directly)WRDS_USER: accepted as a synonym forWRDS_IDfor compatibility with Tidy Finance-style setupsWRDS_PASSWORD: if present and no WRDS.pgpassentry exists,db2pqcan offer to save it to.pgpassDATA_DIR: base directory where Parquet files are written
Example shell setup:
export WRDS_ID="your_wrds_id"
export DATA_DIR="$HOME/pq_data"
If WRDS_ID is not set, WRDS helpers such as wrds_update_pq() and
wrds_pg_to_pq() will prompt for it on first use and suggest adding it to a
local .env file in the calling project. If your WRDS PostgreSQL password is
not yet stored in ~/.pgpass (or PGPASSFILE), db2pq will prompt for it
securely and save it for future connections. For compatibility with the Tidy
Finance Python setup, db2pq also recognizes WRDS_USER and can offer to
copy WRDS_PASSWORD into .pgpass.
WRDS SSH setup (for SAS-based metadata)
wrds_update_pq(..., use_sas=True) uses SSH to execute SAS remotely. Configure
an SSH key for your WRDS account first:
ssh-keygen -t ed25519 -C "your_wrds_id@wrds"
cat ~/.ssh/id_ed25519.pub | \
ssh your_wrds_id@wrds-cloud-sshkey.wharton.upenn.edu \
"mkdir -p ~/.ssh && chmod 700 ~/.ssh && \
cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"
Workflows
1) WRDS to Parquet
Use this path when WRDS is the source of truth and the Parquet repository is the destination.
Most users should start with wrds_update_pq(), which updates only when the
WRDS source appears newer than the existing Parquet file:
from db2pq import wrds_update_pq
wrds_update_pq("dsi", "crsp")
Use wrds_pg_to_pq() when you want a direct export without the update check:
from db2pq import wrds_pg_to_pq
wrds_pg_to_pq(
table_name="dsi",
schema="crsp",
wrds_id="your_wrds_id", # or set WRDS_ID in the environment
engine="adbc",
numeric_mode="float64",
)
Use wrds_sql_to_pq() when you want to run a custom SQL query against WRDS
and save the result into the standard Parquet repository layout:
from db2pq import wrds_sql_to_pq
wrds_sql_to_pq(
sql="""
SELECT gvkey, datadate, fyear, at
FROM comp.funda
WHERE indfmt = 'INDL'
AND datafmt = 'STD'
AND consol = 'C'
AND popsrc = 'D'
AND fyear BETWEEN 1991 AND 2008
""",
table_name="funda",
schema="comp",
alt_table_name="funda_filtered",
)
Common options on this path:
whereto filter rows before exportkeep/dropto select columnsalt_table_nameto rename the Parquet filearchive=Trueto archive replaced filesuse_sas=Truewhen you want SAS-based metadata for the update check
If WRDS_ID is missing, db2pq will ask for it and suggest adding
WRDS_ID=... to your project's .env file. If no matching WRDS password is
found in .pgpass, db2pq will prompt for your WRDS PostgreSQL password and
store it for next time.
2) WRDS to PostgreSQL
Use this path when you want to materialize WRDS tables into a local PostgreSQL database first.
wrds_update_pg() is the main helper here:
from db2pq import wrds_update_pg
wrds_update_pg(
table_name="dsi",
schema="crsp",
dbname="research",
)
This path is useful when you want to:
- keep a local PostgreSQL mirror of selected WRDS tables
- build indexes or comments locally with
process_sql()andset_table_comment() - decouple downstream Parquet generation from live WRDS access
Related helpers:
wrds_get_tables(schema, wrds_id=None)to list WRDS tablesprocess_sql(sql, ...)to run SQL against the destination PostgreSQL databaseset_table_comment(...)to manage destination table comments
3) Local PostgreSQL to Parquet
Use this path when the source table is already in your local PostgreSQL
database, including tables created by wrds_update_pg().
pg_update_pq() mirrors the behavior of wrds_update_pq() but reads from the
local PostgreSQL source:
from db2pq import pg_update_pq
pg_update_pq(
table_name="dsi",
schema="crsp",
database="research",
)
This path relies on the local PostgreSQL table comment for change detection.
If the table has no parseable last-modified comment, pg_update_pq() will not
export by default and will tell you to use force=True.
Use db_to_pq() when you want a direct local PostgreSQL to Parquet export
without the update check:
from db2pq import db_to_pq
pq_file = db_to_pq(
table_name="my_table",
schema="public",
host="localhost",
database="mydb",
engine="adbc",
numeric_mode="float64",
)
print(pq_file)
engine="duckdb" is the recommended default. engine="adbc" is available,
but ADBC-based paths should currently be treated as experimental.
Use db_schema_to_pq() when you want to export an entire local PostgreSQL
schema:
from db2pq import db_schema_to_pq
files = db_schema_to_pq(schema="public")
print(files)
Other Export Path
ibis_to_pq() remains available when your starting point is an Ibis table
expression rather than a table name:
from db2pq import ibis_to_pq
expr = con.table("my_table").filter(lambda t: t.id > 100)
ibis_to_pq(expr, "my_table.parquet", engine="duckdb", compression="zstd")
Parquet layout
Files are organized as:
<DATA_DIR>/<schema>/<table>.parquet
For example:
/data/crsp/dsi.parquet
When archive=True, replaced files are moved under:
<DATA_DIR>/<schema>/<archive_dir>/<table>_<timestamp>.parquet
How it works
At a high level, the core PostgreSQL-to-Parquet flow has three stages:
-
Query planning
db2pqinspects PostgreSQL metadata, applieskeep/drop, normalizes user-suppliedcol_types, handles timestamp conversion rules, and builds a SQLSELECT. -
Query execution The planned query is executed through either:
engine="duckdb": DuckDB reads PostgreSQL and produces Arrow outputengine="adbc": the PostgreSQL ADBC driver streams Arrow record batches directly
-
Parquet writing PyArrow writes the resulting Arrow batches/tables to Parquet, normalizing timestamps, repairing eligible decimal columns on the ADBC path, and buffering row groups with both row-count and byte-size limits.
This means the main export helpers now share the same SQL-planning logic even when they use different execution engines.
Engine defaults
"duckdb" remains the default engine. You can override it per call:
db_to_pq("dsi", "crsp", engine="adbc")
or set a process-wide default:
from db2pq import set_default_engine
set_default_engine("adbc")
You can inspect the current setting with get_default_engine(), and the
environment variable DB2PQ_ENGINE provides the same kind of session-level
default when you prefer configuration outside Python.
In general, DuckDB-backed paths are the stable default. ADBC-backed paths are
still experimental and may be faster or slower depending on the workload and
driver behavior. If you need the most reliable path, prefer engine="duckdb".
Public API
From db2pq:
db_to_pq(table_name, schema, ...)pg_update_pq(table_name, schema, ...)ibis_to_pq(table, out_file, engine=None, **writer_kwargs)wrds_pg_to_pq(table_name, schema, ...)db_schema_to_pq(schema, ...)wrds_update_pq(table_name, schema, ...)wrds_update_schema(schema, ...)pq_list_files(schema, data_dir=None, archive=False, archive_dir=None)pq_last_modified(table_name=None, schema=None, data_dir=None, file_name=None, archive=False, archive_dir="archive")pq_archive(table_name=None, schema=None, data_dir=None, file_name=None, archive_dir=None)pq_restore(file_basename, schema, data_dir=None, archive=True, archive_dir=None)pq_remove(table_name=None, schema=None, data_dir=None, file_name=None, archive=False, archive_dir="archive")db_schema_tables(schema, ...)wrds_get_tables(schema, wrds_id=None)wrds_update_pg(table_name, schema, ...)pq_to_pg(table_name, schema, ...)pq_update_pg(table_name, schema, ...)process_sql(sql, ...)set_table_comment(schema=..., table_name=..., comment=..., ...)get_wrds_username(wrds_id=None)get_wrds_conninfo(username=None)find_pgpass_entry(conninfo, **kwargs)has_pgpass_password(conninfo, **kwargs)save_password(conninfo, password=None, **kwargs)ensure_wrds_credentials(wrds_id=None, interactive=True)set_default_engine(engine)get_default_engine()close_adbc_cached()
The main workflow entry points are:
- WRDS to Parquet:
wrds_update_pq(),wrds_pg_to_pq() - WRDS to PostgreSQL:
wrds_update_pg() - Local PostgreSQL to Parquet:
pg_update_pq(),db_to_pq(),db_schema_to_pq() - Parquet to PostgreSQL:
pq_update_pg(),pq_to_pg()
wrds_update_pq() supports SQL-style filtering via where, for example:
wrds_update_pq("funda", "comp", where="indfmt = 'INDL' AND datafmt = 'STD'")
pg_update_pq() provides the same update-to-parquet pattern for a local
PostgreSQL source table, using the local table comment for change detection.
pq_update_pg() provides the reverse update path from a parquet file in the
local repository into PostgreSQL, using parquet last_modified metadata for
change detection.
Notes
- WRDS PostgreSQL access uses host
wrds-pgdata.wharton.upenn.eduand port9737. batched=True(default) lowers memory usage for large tables.engine="adbc"streams Arrow record batches directly from PostgreSQL into Parquet and may reduce RAM use versus the default DuckDB path.- On the ADBC path,
numeric_mode="text"casts PostgreSQLNUMERICcolumns toTEXT, andnumeric_mode="float64"casts them toDOUBLE PRECISION.numeric_mode="decimal"transports them asTEXTand converts eligible columns back to Arrow decimals using PostgreSQL precision/scale metadata. Columns without usable metadata remain strings.col_typesstill takes precedence over the mode. col_typescan be used to cast selected columns before writing Parquet.keep/dropaccept regex pattern(s) in bothwrds_update_pq()andwrds_update_pg(). If both are supplied,dropis applied beforekeep.tzdefaults to"UTC"in both update paths:wrds_update_pq()uses it to interpret source naive timestamps before writing timezone-aware UTC parquet timestamps;wrds_update_pg()converts sourcetimestamp without time zonecolumns usingAT TIME ZONE '<tz>'.process_sql()executes SQL against the PostgreSQL destination resolved from the samePGUSER/PGHOST/PGDATABASE/PGPORTdefaults used bywrds_update_pg().set_table_comment()can be used the same way for destination PostgreSQL table comments without passing an engine argument.
Development
Run editable install in this repository:
pip install -e .
With optional SAS dependency:
pip install -e ".[sas]"
Project docs
- Docs index:
docs/README.md - Contributor guide:
CONTRIBUTING.md - Release process:
RELEASING.md - Changelog:
CHANGELOG.md
License
MIT License. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file db2pq-0.2.10.tar.gz.
File metadata
- Download URL: db2pq-0.2.10.tar.gz
- Upload date:
- Size: 59.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d14849a62b4e410b354ffcd05663b0331177b952a93f515410f1da18db30aa13
|
|
| MD5 |
a59dd6554f8f86e2398cd51d41be077c
|
|
| BLAKE2b-256 |
19d6ada3558e628813707b1847027767ac7a9b2de2b511fee4995e9e899e3143
|
File details
Details for the file db2pq-0.2.10-py3-none-any.whl.
File metadata
- Download URL: db2pq-0.2.10-py3-none-any.whl
- Upload date:
- Size: 69.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42aa19aafe502cbc8441f5b89f3eb74aef744813bf7ea7956cb7f1ff8001cb35
|
|
| MD5 |
eb666798c4098c980ce00ef1067fef25
|
|
| BLAKE2b-256 |
4030fd5b8dd0561f4d9e9cadfc7aff72ee8777fee8abc39ce3bc6b9c4f303895
|