Skip to main content

Efficient queries of trade and book data in bars. Supports the WRDS TAQ millisecond database.

Project description

TAQY

The taqy project is Python library for accessing and summarizing time-and-quote information. Its initial remit is to generate more manageable “bars” from Wharton Research Data Services (WRDS) subsecond-level trade, bid, and offer US equity data based on the NYSE TAQ.

Brian K. Boonstra, 2025

Motivation

Downloading full market data (hundreds of millions of rows) is often impractical. Instead, this repository employs an approach to generate price “bars” on the server side using SQL queries. While many people and LLMs can write SQL after a fashion, our particular use case demands somewhat carefully architected SQL. To get decent performance we must take advantage of row counting tricks and the like. Rather than expecting academics and nonexperts learn SQL to that level, I provide this library.

The essence of this library is expressed in two functions, described in greater detail below:

• taq_nbbo_bars_on_date()
• taq_trade_bars_on_date()

For more information on the WRDS data service, please see https://wrds-www.wharton.upenn.edu.

For more detailed information about the NYSE TAQ data set, please refer to the official documentation:
https://www.nyse.com/publicdocs/nyse/data/Daily_TAQ_Client_Spec_v3.0.pdf

Installation

You can install this library with pip install taqy. A conda version is in the works.

Usage

General Commentary

The most important thing to know here is that our data from WRDS is not dividend and split adjusted. That is standard for intraday data sets, but means you will need to make adjustments if you are joining this data with commonly used adjusted open/close data sets.

Though WRDS does provide information about trading halts and limit-up/limit-down conditions, there is no support (at present) for making use of that information in taqy itself. We also do not presently filter on quote conditions found in the qu_cond field.

Please note that windows without any data will be absent from the dataframes you obtain here. If you have 1 minute bars and PBPB did not trade for 6 minutes, you will have 6 "missing" bars in the output.

Though the routines here are as efficient as I can make them, calls usually take many seconds to complete. Though this API allows for many tickers and for bar windows of just one minute, you will generally find that big data queries are still cumbersome or subject to timeouts.

At present, we only support tickers without suffixes. Enhancing the code to support suffixes is not hard, so it will probably be finished some day.

WRDS began providing nanosecond-level detail in its timestamps a few years ago. taqy incorporates the nanosecond information when it is available.

Connecting to WRDS

In order access the WRDS TAQ database, you need to have a valid WRDS account. This will give you API access via SQL calls to their PostgreSQL server. Doing so involves having a "Connection" object available in your Python session.

For convenience, you can manage your own database connection object and use the wrds_db argument, or you can use the lightweight wrapper provided by taqy, making database connections a little more convenient. In the examples below, I presume you have done the latter. Here is how you start it:

from taqy.usequity import connect_to_wrds

connect_to_wrds()

NBBO Bars

National best bid and offer (NBBO) bars tell us, for each bar window timestamp, what the price and size of the best bid and offer were across all US exchanges. We also obtain the time of the most recent quote (though not in this case whether it came from a bid update or offer update).

At present, bar_minutes must either evenly divide 30, or it must be 60. In the case of 60 minute bars, one of the windows really only has half an hour of data since US trading hours are 09:30-16:00 New York time.

For convenience, the tickers argument may be a list or a single string containing the one ticker of interest.

import datetime
from taqy.usequity import taq_nbbo_bars_on_date

nbbo_bars = taq_nbbo_bars_on_date(
    tickers=['SPY', 'PBPB', 'HLIT'],
    date=datetime.date(2024,2,29),
    bar_minutes=60
).set_index(['ticker', 'date', 'window_time']).unstack(0)
best_bid best_bidsizeshares best_ask best_asksizeshares time_of_last_quote
ticker HLIT PBPB SPY HLIT PBPB SPY HLIT PBPB SPY HLIT PBPB SPY HLIT PBPB SPY
date window_time
2024-02-29 2024-02-29 10:00:00-05:00 13.26 13.74 508.49 600 100 200 13.27 13.78 508.5 300 100 400 2024-02-29 09:59:55.164173375-05:00 2024-02-29 09:59:59.670322279-05:00 2024-02-29 09:59:59.979579648-05:00
2024-02-29 11:00:00-05:00 13.13 13.75 507.44 500 300 200 13.14 13.8 507.45 500 400 300 2024-02-29 10:59:56.719858721-05:00 2024-02-29 10:58:52.825894793-05:00 2024-02-29 10:59:59.993879808-05:00
2024-02-29 12:00:00-05:00 13.01 13.81 506.65 700 400 300 13.02 13.84 506.66 800 200 500 2024-02-29 11:59:51.922733889-05:00 2024-02-29 11:59:43.408821488-05:00 2024-02-29 11:59:59.999457024-05:00
2024-02-29 13:00:00-05:00 13.05 13.84 507.36 1100 400 300 13.06 13.87 507.37 400 100 300 2024-02-29 12:59:51.822595396-05:00 2024-02-29 12:59:56.123094269-05:00 2024-02-29 12:59:59.995182080-05:00
2024-02-29 14:00:00-05:00 13.09 13.85 507.8 1000 400 200 13.1 13.88 507.81 1100 200 500 2024-02-29 13:59:59.601319642-05:00 2024-02-29 13:58:04.797251444-05:00 2024-02-29 13:59:59.947843072-05:00
2024-02-29 15:00:00-05:00 13.11 13.86 507.74 500 200 2200 13.12 13.88 507.75 700 200 300 2024-02-29 14:59:55.053310002-05:00 2024-02-29 14:59:50.046429293-05:00 2024-02-29 14:59:59.978044672-05:00
2024-02-29 16:00:00-05:00 13.13 13.88 507.99 28200 1000 108900 13.14 13.89 508.0 1100 300 53700 2024-02-29 15:59:59.836964072-05:00 2024-02-29 15:59:59.743245800-05:00 2024-02-29 15:59:59.999648512-05:00

Trade Bars

Trade bars tell us, for each bar window timestamp, simple statistics that occurred in the window preceding the timestamp.

It is optional to include information about the first and last trades within the window via include_first_and_last, since doing so gives us a noticeably more complex SQL query.

At present, bar_minutes must either evenly divide 30, or it must be 60. As with NBBO, one of the 60 minute bars will only come from 30 minutes of data.

For convenience, the tickers argument may be a list or a single string containing the one ticker of interest. Similarly, restrict_to_exchanges can be a string denoting a single exchange.

By default, trades will be summarized across all exchanges (or the ones specified by restrict_to_exchanges) in the dataset. By toggling group_by_exchange, your summary statistics will instead come back grouped by exchange.

Please note the exchange is provided by the column ex in the dataframe.

import datetime
from taqy.usequity import taq_trade_bars_on_date

trade_bars = taq_trade_bars_on_date(
    ['SPY', 'PBPB', 'HLIT'], 
    date=datetime.date(2024,2,29), 
    group_by_exchange=True,
    restrict_to_exchanges=('M','P'),
    include_first_and_last=True
).transpose()
ticker HLIT HLIT HLIT SPY SPY SPY
date 2024-02-29 00:00:00 2024-02-29 00:00:00 2024-02-29 00:00:00 2024-02-29 00:00:00 2024-02-29 00:00:00 2024-02-29 00:00:00
window_time 2024-02-29 10:00:00-05:00 2024-02-29 10:30:00-05:00 2024-02-29 11:00:00-05:00 2024-02-29 15:30:00-05:00 2024-02-29 16:00:00-05:00 2024-02-29 16:00:00-05:00
vwap 13.22803 13.133143 13.130639 507.914929 508.810897 509.067939
last_trade_price 13.25 13.1 13.13 508.14 509.39 508.0
last_trade_size 100 13 16 84 41 900
last_trade_time 2024-02-29 09:57:34.293454515-05:00 2024-02-29 10:29:17.202749055-05:00 2024-02-29 10:59:46.808839585-05:00 2024-02-29 15:29:59.696625664-05:00 2024-02-29 15:59:01.865104640-05:00 2024-02-29 15:59:59.988947456-05:00
num_trades 55 122 80 4800 21 27174
total_qty 2036 6593 6414 469365 524 5716498
mean_price_ignoring_size 13.226364 13.137541 13.134937 507.907977 508.63381 508.989108
median_size 25.0 33.5 56.5 100.0 5.0 100.0
median_price 13.23 13.12 13.14 507.89 508.58 508.97
median_notional 330.5 439.52 742.96 50780.0 2543.75 50910.0
min_price 13.19 13.09 13.08 507.65 508.15 507.94
max_price 13.27 13.28 13.17 508.19 509.44 509.74
min_size 1 1 1 1 1 1
max_size 100 600 500 1178 100 16041
first_trade_price 13.2 13.27 13.12 507.75 508.15 508.14
first_trade_size 100 200 22 100 1 93
first_trade_time 2024-02-29 09:30:00.092696390-05:00 2024-02-29 10:00:00.737432736-05:00 2024-02-29 10:30:11.467083039-05:00 2024-02-29 15:00:00.011126272-05:00 2024-02-29 15:30:03.499381248-05:00 2024-02-29 15:30:00.000087808-05:00
ex P P P P M P

Implementation Notes

WRDS Tables

The WRDS database has both yearly and daily tables. Given the density of data we deal with here, it makes the most sense to query the daily tables with loop rather than a giant query across many dates.

The tables WRDS provides include our main tables of interest complete_nbbo_* and ctm_*. Tables presently unused include cqm_*, which is all quotes (but not enough information to really "build the book"). nbbom_* is best bid and offer by exchange, luld_* is limit up limit down. mastm_* is the "master" table, including special information like trading halts.

', '.join([t for t in DEFAULT_WRDS_CONNECTION.list_tables(library='taqm_2024') if '2024' not in t or '20240124' in t or t.endswith('2024')])

complete_nbbo_2024, complete_nbbo_20240124, cqm_2024, cqm_20240124, ctm_2024, ctm_20240124, luld_cqm_2024, luld_cqm_20240124, luld_ctm_2024, luld_ctm_20240124, mastm_2024, mastm_20240124, nbbom_2024, nbbom_20240124, wct_2024, wct_20240124, wrds_iid_2024

Development Notes

Developer Usage

Developers may find themselves preferring to use the wrds_db argument in order to have better control over connectivity. The trick of a connection in a global variable in contrast is very convenient for researchers.

Any time you want to avoid using the database access cache, or clear out the cache, you can manipulate the taqy.usequity.CACHED_QUERIES variable.

Testing

Running Tests

You can run tests with pytest tests.

The testing routines here do not usually query WRDS. Instead they mock out the SQL queries with a substitute that queries the filesystem, specifically the entries in tests/mock_sql_responses. For someone who wanted to enhance taqy with customary usage of a filesystem cache, the code controlling this in conftest.py may provide useful guidance.

In order to update or add a test, you can run pytest with an --avoid-db-cache argument. Once satisfied with base DB access, you should then run with --write-db-cache.

Regression test output can be saved or updated with the argument --update-gold.

Consistency Checks

In addition to fairly standard regression tests, the tests here also include checks on consistency. Here, we check:

  • multi-ticker queries agree with single-ticker queries for NBBO bars
  • trade bars asking for fist and last trade information (complicated SQL) agree with trade bars not asking for extra information
  • trade bars that are K minutes agree with trade bars of length N*K minutes about total trade count, etc.

Choices of Statistics

The trade statistics chosen here would be quite easy to enhance with, e.g., deciles, or any other aggregate function natively supported by the WRDS PostgreSQL.

SQL Implementation

For a lot of people, it is easiest to use simple SQL queries and then employ pandas groupby() or similar routines to work out the bars. However, sending these large datasets over the internet makes that practice infeasible in our case.

Here, I make heavy use of common table expressions (WITH ...) to keep queries efficient and as clear as I can. The NBBO query uses a CTE that extracts row numbers in descending order within each window, which lets us identify the last quote before our window time by just choosing row number 1. S SELECT DISTINCT then allow the PostgreSQL query optimizer to short-circuit quickly with the data we need.

Because the trade bars want summary statistics over entire window interval, no such short circuit trick is feasible for trades and I just use GROUP BY. This is not as bad as it seems, since trade data is generally 10-100 times less voluminous than quote data.

When first and last trade information is requested (by the include_first_and_last flag), I again use windowed row numbering to find the relevant data, along with CTE expressions to construct the final query.

SQL Examples

Here are examples of SQL queries sent by taqy to WRDS:

SQL Example: NBBO Bars
            WITH windowable_nbbo AS (
                SELECT
                    sym_root AS ticker
                    , date
                    , time_m
                    , time_m_nano
                    , sym_root
                    , qu_cond
                    , best_bid
                    , best_bidsizeshares
                    , best_ask
                    , best_asksizeshares
                    , EXTRACT(HOUR FROM time_m) AS hour_of_day
                    , 1 * DIV(EXTRACT(MINUTE FROM time_m),1) AS minute_of_hour
                    , ROW_NUMBER() OVER (PARTITION BY sym_root, EXTRACT(HOUR FROM time_m), DIV(EXTRACT(MINUTE FROM time_m),1) ORDER BY time_m DESC) AS rownum
                FROM taqm_2024.complete_nbbo_20240229
                WHERE 1=1
                  AND sym_root = 'LLY'
                  AND sym_suffix IS NULL
                  AND time_m > '09:30:00' AND time_m < '16:00:00'
            )
            SELECT DISTINCT ON (ticker, date, hour_of_day, minute_of_hour)
                ticker
                , date
                , date + (EXTRACT(HOUR FROM time_m) || ':' || 1 * DIV(EXTRACT(MINUTE FROM time_m),1))::interval + ( '00:1' )::interval AS window_time
                , best_bid
                , best_bidsizeshares
                , best_ask
                , best_asksizeshares
                , time_m AS time_of_last_quote
                , time_m_nano AS time_of_last_quote_ns
            FROM windowable_nbbo
            WHERE windowable_nbbo.rownum = 1
SQL Example: Trade Bars
            WITH 
              windowable_trades AS (
                SELECT
                    sym_root AS ticker
                    , date
                    , time_m
                    , time_m_nano
                    , sym_root
                    , EXTRACT(HOUR FROM time_m) AS hour_of_day
                    , 30 * DIV(EXTRACT(MINUTE FROM time_m),30) AS minute_of_hour
                    , ROW_NUMBER() OVER (PARTITION BY sym_root, date, EXTRACT(HOUR FROM time_m), DIV(EXTRACT(MINUTE FROM time_m),30), ex ORDER BY time_m DESC) AS rownum
                    , ROW_NUMBER() OVER (PARTITION BY sym_root, date, EXTRACT(HOUR FROM time_m), DIV(EXTRACT(MINUTE FROM time_m),30), ex ORDER BY time_m ASC) AS asc_rownum
                    , price
                    , size
                    , ex
                FROM taqm_2024.ctm_20240229
                WHERE ex IN ('M', 'P')
                  AND sym_root IN ('SPY', 'PBPB', 'HLIT')
                  AND sym_suffix IS NULL
                  AND time_m > '09:30:00' AND time_m < '16:00:00'
            ),
            last_trades AS (
              SELECT DISTINCT ON (ticker, date, hour_of_day, minute_of_hour, ex)
                    ticker
                    , date
                    , date + (EXTRACT(HOUR FROM time_m) || ':' || 30 * DIV(EXTRACT(MINUTE FROM time_m),30))::interval + ( '00:30' )::interval AS window_time
                    , time_m AS last_trade_time
                    , time_m_nano AS last_trade_time_ns
                    , price AS last_trade_price
                    , size as last_trade_size
                    , ex AS last_trade_ex
                FROM windowable_trades
                WHERE windowable_trades.rownum = 1
            ),
            first_trades AS (
              SELECT DISTINCT ON (ticker, date, hour_of_day, minute_of_hour, ex)
                    ticker
                    , date
                    , date + (EXTRACT(HOUR FROM time_m) || ':' || 30 * DIV(EXTRACT(MINUTE FROM time_m),30))::interval + ( '00:30' )::interval AS window_time
                    , time_m AS first_trade_time
                    , time_m_nano AS first_trade_time_ns
                    , price AS first_trade_price
                    , size as first_trade_size
                    , ex AS first_trade_ex
                FROM windowable_trades
                WHERE windowable_trades.asc_rownum = 1
            ),
            trade_stats_in_bar AS (
                SELECT
                    sym_root AS ticker
                    , date
                    , date + (EXTRACT(HOUR FROM time_m) || ':' || 30 * DIV(EXTRACT(MINUTE FROM time_m),30))::interval + ( '00:30' )::interval AS window_time
                    , COUNT(size) AS num_trades
                    , SUM(size) AS total_qty
                    , SUM(price * size) / SUM(size) AS vwap
                    , AVG(price) AS mean_price_ignoring_size
                    , PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY size) AS median_size
                    , PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price
                    , PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price*size) AS median_notional
                    , MAX(price) AS max_price
                    , MIN(price) AS min_price
                    , MAX(size) AS max_size
                    , MIN(size) AS min_size
                    , ex
                FROM taqm_2024.ctm_20240229 
                WHERE ex IN ('M', 'P')
                  AND sym_root IN ('SPY', 'PBPB', 'HLIT')
                  AND sym_suffix IS NULL
                  AND time_m > '09:30:00' AND time_m < '16:00:00'
                GROUP BY
                  sym_root, date, EXTRACT(HOUR FROM time_m), DIV(EXTRACT(MINUTE FROM time_m),30), ex
            )
            SELECT
                  trade_stats_in_bar.ticker
                , trade_stats_in_bar.date
                , trade_stats_in_bar.window_time
                , vwap
                , last_trade_price
                , last_trade_size
                , last_trade_time
                , num_trades
                , total_qty
                , mean_price_ignoring_size
                , median_size
                , median_price
                , median_notional
                , min_price
                , max_price
                , min_size
                , max_size
                , first_trade_price
                , first_trade_size
                , first_trade_time
                , first_trade_time_ns
                , last_trade_time_ns
                , ex
            FROM trade_stats_in_bar 
                JOIN first_trades 
                  ON  trade_stats_in_bar.ex=first_trades.first_trade_ex
                  AND trade_stats_in_bar.ticker=first_trades.ticker 
                  AND trade_stats_in_bar.date=first_trades.date 
                  AND trade_stats_in_bar.window_time=first_trades.window_time
                JOIN last_trades 
                  ON  trade_stats_in_bar.ex=last_trades.last_trade_ex
                  AND trade_stats_in_bar.ticker=last_trades.ticker 
                  AND trade_stats_in_bar.date=last_trades.date 
                  AND trade_stats_in_bar.window_time=last_trades.window_time
            ORDER BY trade_stats_in_bar.ticker, trade_stats_in_bar.date, trade_stats_in_bar.window_time, trade_stats_in_bar.ex

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taqy-0.42.4.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

taqy-0.42.4-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file taqy-0.42.4.tar.gz.

File metadata

  • Download URL: taqy-0.42.4.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for taqy-0.42.4.tar.gz
Algorithm Hash digest
SHA256 a8a7ca666fe2ed7733a65dcda772669dda725fcb35a17ab5cc870f937a1beaf1
MD5 44684aba19e3f48c9f9fed630cc078a6
BLAKE2b-256 33d8391b032006aa5956a5dbf6091eb2732d70944579bd42dc82049470d5fa84

See more details on using hashes here.

File details

Details for the file taqy-0.42.4-py3-none-any.whl.

File metadata

  • Download URL: taqy-0.42.4-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for taqy-0.42.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7cc8e7f6fabe4f7969321aa6e35507c3af7d4a5333aece0e8c4b272b9643e3d6
MD5 5344da95d0ec50995b1174dfbd139506
BLAKE2b-256 7acb5a58019502744220cf10c6f7ebbe6e8822f715de307c154b3e350578d273

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page