Skip to main content

Nanobind/C++ parsers for polygon, bulk S3, and websocket market data.

Project description

massive-speedup

Native C++/nanobind readers for Polygon/Massive flat-file market data.

See INSTALL.md for installation details and DEVELOPMENT.md for release and PyPI publishing notes.

CSV Gzip Files

Install/build the native extension:

pip3 install -e .

Iterate parsed records directly from a .csv.gz file:

import massive_speedup

for trade in massive_speedup.FlatFiles.Stock.Trade.parse("trades.csv.gz"):
    print(trade.ticker, trade.sip_timestamp, trade.price)

for quote in massive_speedup.FlatFiles.Stock.Quote.parse("quotes.csv.gz"):
    print(quote.ticker, quote.bid_price, quote.ask_price)

for quote in massive_speedup.FlatFiles.currency.Quote.parse("currency_quotes.csv.gz"):
    print(quote.ticker, quote.participant_timestamp)

You can also iterate raw CSV fields as bytes tuples:

for row in massive_speedup.FlatFiles.Stock.Trade.parse_raw("trades.csv.gz"):
    print(row[0], row[8])

Example scripts:

Record Access

Parsed records expose read-only attributes and are iterable in CSV field order:

trade = next(massive_speedup.FlatFiles.Stock.Trade.parse("trades.csv.gz"))

print(trade.ticker)
print(trade.conditions)
print(trade.sip_timestamp)
print(trade.pack())
print(list(trade))

Packed records do not include the ticker. Reconstruct with the ticker from the file name:

packed = trade.pack()
trade2 = massive_speedup.StockTrade.from_packed(packed, trade.ticker)

Window Aggregation

The native aggregators consume iterables of parsed records and yield C++ result objects exposed through nanobind. Result attributes are read-only and lazily converted to Python objects on first access. The aggregation interval and offset are expressed in seconds; the returned window_start is still nanoseconds since epoch.

import massive_speedup

trades = massive_speedup.FlatFiles.Stock.Trade.parse("trades.csv.gz")

for bar in massive_speedup.FlatFiles.Stock.Trade.Aggregator(
    trades,
    interval_seconds=60,
):
    print(
        bar.ticker,
        bar.window_start,
        bar.open,
        bar.close,
        bar.high,
        bar.low,
        bar.avg,
        bar.volume_weighted_avg,
        bar.volume,
        bar.transactions,
        bar.stddev,
    )

Available aggregators:

  • massive_speedup.StockTradeAggregator / FlatFiles.Stock.Trade.Aggregator
  • massive_speedup.StockQuoteAggregator / FlatFiles.Stock.Quote.Aggregator
  • massive_speedup.CurrencyQuoteAggregator / FlatFiles.currency.Quote.Aggregator

Stock trades aggregate price and use size for volume and volume_weighted_avg. Stock quotes aggregate ask and bid prices separately and use ask/bid sizes for ask/bid volume-weighted averages. Currency quotes aggregate ask and bid prices separately and omit volume and volume-weighted averages because the source rows have no size field.

quotes = massive_speedup.StockQuoteDatabase("/data/massive-db", "2026-01-23", "A")

for quote_bar in massive_speedup.StockQuoteAggregator(
    quotes,
    interval_seconds=1,
    offset_seconds=0,
):
    print(quote_bar.ask_open, quote_bar.ask_close, quote_bar.bid_avg)

Aggregators stream consecutive (ticker, window_start) groups. Use input ordered by ticker and timestamp, such as the native database iterators or default Massive/Polygon flat-file order. stddev is population standard deviation.

Build Database Files

Build fixed-length binary database files from one or more input .csv.gz files:

massive-speedup-build-database --database /data/massive-db 2026-01-23.csv.gz

The input type is inferred from the CSV header. Output layout is:

{database}/{stock_trade|stock_quote|currency_quote}/{YYYY-MM-DD}/{ticker}

Existing ticker files are not overwritten by default. The builder keeps reading the input until the next ticker and only writes missing ticker files. Use --force to rebuild existing ticker files, which is useful after a binary record format change:

massive-speedup-build-database --force --database /data/massive-db 2026-01-23.csv.gz

Date-level idempotency uses an .incomplete marker in {database}/{type}/{YYYY-MM-DD}. If the date directory exists without .incomplete, the input file is skipped. If the directory is new, .incomplete is created before processing and removed only after successful completion. Use --force to process a date even when .incomplete is absent.

Use --benchmark to print throughput:

massive-speedup-build-database --benchmark --database /data/massive-db *.csv.gz

Database Files

Open a fixed-length binary file through mmap and iterate records:

records = massive_speedup.StockTradeDatabase(
    "/data/massive-db",
    "2026-01-23",
    "A",
)

for trade in records:
    print(trade.sip_timestamp, trade.price)

Merge stock trades and quotes for one date and ticker in SIP timestamp order:

for trade, quote in massive_speedup.stock_trade_quote_timeline(
    "/data/massive-db",
    "2026-01-23",
    "A",
):
    if trade:
        print("trade", trade.sip_timestamp, trade.price, quote)
    else:
        print("quote", quote.sip_timestamp, quote.bid_price, quote.ask_price)

Quote rows yield (None, current_quote). Trade rows yield (trade, last_quote), where last_quote is None until the first quote has appeared. When a trade and quote have the same SIP timestamp, the quote is yielded first.

Database files support indexing and timestamp search:

first = records[0]
last = records[-1]

index = records.index_before_timestamp(1769161728012983416)
near_open = records.index_before_timestamp(1769161728012983416, galloping=0)
next_index = records.index_after_timestamp(1769161728012983416, galloping=index + 1)

Timestamp arguments are nanoseconds since epoch. Database readers also accept datetime.time values, which are resolved using the reader's date:

import datetime as dt

index = records.index_before_timestamp(dt.time(9, 30))

Find the closest record before or after a participant timestamp:

before = records.find_before_participant_timestamp(
    1769161728012624580,
)
after = records.find_after_participant_timestamp(
    1769161728012624580,
    fuzz=250_000_000,
    galloping=True,
)
strict_before = records.find_before_participant_timestamp(
    1769161728012624580,
    on=False,
)

find_before_participant_timestamp returns the record with the highest participant timestamp less than or equal to the target. find_after_participant_timestamp returns the record with the lowest participant timestamp greater than or equal to the target. Set on=False for strict < or > comparisons. fuzz is a nanosecond scan window around the searched timestamp and defaults to one second (1_000_000_000). Both methods return records, not indexes.

Stock database readers also expose NYSE market session timestamps in nanoseconds:

print(records.market_open)
print(records.market_close)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

massive_speedup-0.1.3.tar.gz (58.6 MB view details)

Uploaded Source

File details

Details for the file massive_speedup-0.1.3.tar.gz.

File metadata

  • Download URL: massive_speedup-0.1.3.tar.gz
  • Upload date:
  • Size: 58.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for massive_speedup-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c3e973561de175df993df76e9106b7be98a8fedbdee99f778338fdbfc5330f2e
MD5 9dfcd0e231203e724795e3b798b4b983
BLAKE2b-256 c6bed7f65a990061ea67dcea4dc4e9cf11151e0c7d985188a9043013ac70b847

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page