Skip to main content

Add your description here

Project description

This repository is now packaged as the featureSQL Python module. You can install it in one of two ways:

  1. From PyPI:

    pip install featureSQL
    

    This makes the featureSQL console script available on your PATH and lets you import featureSQL from any Python program.

  2. From source (development mode):

    cd /path/to/featureSQL
    pip install -e .
    

    This installs the package in editable mode, so local edits are reflected immediately without reinstalling. Useful when working on the project.

After installation you can still use the original CLI helpers directly by importing:

from featureSQL.cli import Run

and invoking Run().download(...) or featureSQL at the shell. The remainder of this document is a how‑to reference that you can include in your own projects; it covers the three common workflows you asked about.


1. Download a list of symbols (OCHLVF) to CSV

# 1. prepare a text file listing tickers, one per line
cat > .testsymbols.txt <<'EOF'
AAPL
AMZN
GOOG
TSLA
EOF

# 2. run the collector, writing CSVs into a directory (they’ll land in
#    the `feature-csv` subfolder of whatever you pass to --data_path)
uv run -m featureSQL.cli download \
    --region US \
    --start 2026-01-01 \
    --end   2026-02-28 \
    --symbols_file ./.testsymbols.txt \
    --data_path ./source \
    --store_type fs   # Output to local fs (default). Change to 'gcs' and set data_path to bucket name for GCS.

GCS setup: when using --store_type gcs you must configure two environment variables before running any commands or tests:

export GCS_SC_JSON='{"type":"service_account", …}'    # credentials
export GCS_BUCKET_NAME='your-bucket-name'               # target bucket

The unit tests will skip network‑dependent scenarios if GCS_BUCKET_NAME is unset, and they mock the client in other cases. However, any manual invocation against a real bucket requires both variables to be set.

Publishing: to push a new release to PyPI, follow these steps:

  1. build the package:
    uv run -m build
    
  2. optionally test the release on TestPyPI:
    uv run -m twine upload --repository testpypi dist/*
    uv pip install --index-url https://test.pypi.org/simple/ --no-deps featureSQL
    
  3. when ready for production, upload to PyPI:
    uv run -m twine upload dist/*
    
uv run -m featureSQL.dump_bin dump_all \
    --data_path ./source/feature-csv \
    --dump_dir   ./source/ \
    --exclude_fields symbol,date \
    --store_type fs         # don’t try to treat metadata as floats.
# (./source/ is /path/to/output/dir/)

> ⛑️ **Note:** recent releases automatically coerce a broad range of
> timestamp strings (including ISO‑8601 offsets) when building the
> calendar.  Invalid or malformed dates are dropped rather than causing a
> crash.  Additionally, the dumper now skips any non‑numeric columns
> (e.g. `symbol` or `date`) instead of trying to encode them, so you
> usually don’t need to provide `--exclude_fields` unless you want to
> explicitly whittle down the feature set.

After the command finishes you’ll have a structure like:

/path/to/output/dir/
  calendars/day.txt          # trading dates
  instruments/all.txt        # ticker start/end dates
  features/<ticker>/<field>.day.bin …

This directory can be consumed by any tool that understands the same layout – it’s a full initialised dataset.

💡 run dump_all only once per collection. To add new data later, use dump_update (see next section).


3. Download symbols and produce the binary dataset in one go

The built‑in CLI helper (exposed via the featureSQL script or as python -m featureSQL.cli) can drive both phases and even run ad‑hoc queries against a binary dataset:

uv run -m featureSQL.cli download \
    --region US \
    --start 2026-01-01 \
    --end   2026-02-28 \
    --symbols AMZN,GOOG,TSLA \
    --data_path source  \
    --out_format bin \
    --store_type fs

This will

  1. fetch OCHLVF data from Yahoo and save raw CSVs under --{data_path}/feature-csv (or ./source if you didn’t pass --data_path),
  2. run the binary dumper. On the first invocation the helper uses DumpDataAll to initialise the dataset; subsequent runs use DumpDataUpdate to append new days. The target directory is the same as --data_path and you can still provide exclude_fields="symbol,date".

The first run must be preceded by an empty or non‑existent output directory (and/or you can manually run dump_all as shown above); thereafter the same command will append new days to the existing bins.

4. Query the binary dataset using SQL

Once you have an initialised dataset you can issue SQL against it. The query subcommand lazily loads only the symbols mentioned in the query and keeps a small LRU cache in memory.

Note: every table implicitly includes a date column of type TIMESTAMP. This column is considered the primary key and is always NOT NULL—you must provide a date value on INSERT, and attempts to insert duplicate dates will fail with a constraint error.

DDL support: the engine understands basic CREATE, DROP, ALTER TABLE … ADD/DROP COLUMN and simple UPDATE/INSERT/DELETE statements. Adding or removing a column automatically creates or deletes the corresponding .day.bin file and keeps its length in sync with existing data, so you can evolve your schema in-place.

uv run -m featureSQL.cli query \
    --data_path source \
    --max_symbols 100 \
    --max_memory 2000000000 \
    --store_type fs \
    "select date, open, close, high, low, volume, adjclose from AAPL where volume > 1000000"

Joins work transparently as long as both tables have been dumped:

uv run -m featureSQL.cli query --data_path source --store_type fs \
    "select a.open, n.close from AAPL a join NVDA n on a.date = n.date"

(The cache flags are optional; omit them to use unlimited resources.)


4.1 Example: run a sequence of SQL commands

The CLI can also be used interactively or in a script to exercise the entire life‑cycle of a DuckDB table. One convenient property of the command is that it continues executing subsequent statements even if one of the earlier ones fails (for instance, attempting to DROP or DESCRIBE a table that doesn’t exist). In such cases it prints a friendly message but exits with a zero status, making it safe to run whole scripts without worrying about short‑circuiting.

Below is a step‑by‑step sequence that matches the unit test we added earlier; it’s useful when you want to verify that basic DDL/DML operations are behaving as expected.

# start a DuckDB session using the same Python runtime as the package
uv run -m featureSQL.cli query --data_path source --store_type fs \
    "\
    drop table foo;                     -- not present; CLI prints a message but continues\
    describe foo;                      -- should report missing table\
    create table foo (id int, price float, amount float);\
    describe foo;                      -- now shows three columns (date plus two numeric)
    insert into foo values ('2026-03-01',1,100.0),('2026-03-02',2,200.0);\
    select * from foo order by id;      -- two rows present\
    update foo set id = 10 where price=1;\
    select id from foo where price=1; -- id==10\
    alter table foo add disc float;     -- add a new numeric column\
    select * from foo;                 -- new column shows as NULL\
    alter table foo drop disc;         -- remove the column again\
    drop table foo;                     -- remove table\
    describe foo;                      -- failure again
    "

You can run the same commands one‑by‑one in an interactive Python REPL if preferred; they simply exercise the underlying DuckDB engine that our DuckQueryService uses. This sequence is especially handy for sanity checks when changing the SQL layer or adding new features.

Notes & tips

  • Symbols list: --symbols accepts a comma string, Python list/tuple, or --symbols_file path. If you provide a file that exists but contains no tickers the downloader will now do nothing (previous behaviour fell through to downloading the entire US universe).
  • Reloading tickers: use --reload_symbols to refresh the cached symbol list.
  • Alternate flows: to skip the binning step, omit --out_format or set it to yahoo (the default).

Maintaining binary dataset integrity

Because the dumper appends new days based on the existing calendar, you must take care when fetching overlapping or out‑of‑order date ranges:

  1. Always use a single start date that is at or before any previously downloaded data. For a continuous collection you can simply run:

    uv run -m featureSQL.cli download \
        --region US \
        --start 2025-01-01 \
        --end   $(date +%Y-%m-%d) \
        --symbols TSLA \
        --data_path source --out_format bin
    

    The collector will skip existing CSVs and the dumper will append only the new days, keeping the existing bins and calendar consistent.

  2. If you need to back‑fill a gap or change the start date earlier: delete the old bin files (or the entire source/features tree) and run with --out_format bin again (or manually use uv run -m featureSQL.dump_bin dump_all) so that DumpDataAll recomputes the calendar from scratch. The update mode never rewrites the date index, so appending older data without rebuilding will cause the printed dates to be incorrect.

  3. Automate detection if desired. You can extend the downloader to inspect the last date in existing CSVs and request only missing days, and/or modify the dumper to warn when the incoming data’s maximum date does not exceed the current calendar end.

Following these practices prevents “wrong” date offsets from appearing when you use uv run -m featureSQL.cli view … (or simply featureSQL view …), and ensures the binary dataset remains a faithful time series.

Feel free to copy‑paste these examples into your own docs or scripts!


Viewing the contents of a bin file

A new CLI subcommand makes this easy without writing Python. Once you have an initialised dataset you can inspect any field file with:

uv run -m featureSQL.cli view /path/to/output/dir/features/aapl/open.day.bin --store_type fs

By default the command prints the starting date index and the shape of the array, followed by each raw float value. If the dataset contains a calendar file (calendars/day.txt) the subcommand will automatically find it and show the corresponding date for each value. You can also supply an explicit calendar path:

uv run -m featureSQL.cli view path/to/bin/file --calendar_file path/to/calendars/day.txt --store_type fs

Internally the helper still uses numpy.fromfile so the Python snippet below remains available if you prefer to inspect the file manually.

import numpy as np
# if you have a helper for converting codes to filenames, import it here
from featureSQL.dump_bin import code_to_fname

# point to a particular field file (e.g. open.day.bin) for one symbol
bin_path = "/path/to/output/dir/features/aapl/open.day.bin"
arr = np.fromfile(bin_path, dtype="<f")

# first value is the date offset, the rest are data values
date_index = int(arr[0])
values = arr[1:]
print(date_index, values.shape)

Alternatively, many tools provide helpers once a dataset is loaded; you can query the field of a symbol and receive a NumPy array. Use whatever API your application or library supplies – the underlying files remain the same.

# pseudo-code using a generic loader
data = my_loader.load("/path/to/output/dir")
print(data['AAPL']['open'])

If you just want to look at the calendar, it’s a text file under calendars/day.txt with one date per line.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featuresql-0.1.1.tar.gz (51.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

featuresql-0.1.1-py3-none-any.whl (37.5 kB view details)

Uploaded Python 3

File details

Details for the file featuresql-0.1.1.tar.gz.

File metadata

  • Download URL: featuresql-0.1.1.tar.gz
  • Upload date:
  • Size: 51.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for featuresql-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9e89b30f637ef994705712ee85ab70ac54d17a5af22d117c9e1555c48f26feed
MD5 2dd7822f1daba2695fe17a24c8b1dae2
BLAKE2b-256 6697590ed20a37827cd57813abc2c54d2110f4f24b18368a5b39dbfde56586b1

See more details on using hashes here.

File details

Details for the file featuresql-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: featuresql-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 37.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for featuresql-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 35b9e900622adc14f73a8df9cc6bb98bcb2d6c2028d6b265d7a44b2c3be128ee
MD5 64f20f84b03906532e49cc60214f6aab
BLAKE2b-256 c0cab8f9218ee0fc3c5d1b70927bf1f196df09f8a3247b8dae0ed96f50440569

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page