Skip to main content

Simple, high-speed batch data reader for ML applications.

Project description

Faucet ML

Faucet ML is a Python package that enables high speed mini-batch data reading from common data warehouses for machine learning model training.

Faucet ML is designed for cases where:

  • Datasets are too large to fit into memory
  • Model training requires mini-batches of data (SGD based algorithms)

Installation

pip install faucetml

Supported data warehouses

  • Google BigQuery
  • Snowflake (soon)
  • Amazon Redshift (soon)

Suggestions for other DBs to support? Open an issue and let us know.

More about Faucet

Many training datasets are too large to fit in memory, but model training would benefit from using all of the training data. Naively issuing 1 query per mini-batch of data is unnecessarily expensive due round-trip network costs. Faucet is a library that solves these issues by:

  • Fetching large "chunks" of data in non-blocking background threads
    • where chunks are much larger than mini-batches, but still fit in memory
  • Caching chunks locally
  • Returning mini-batches from cached chunks in O(1) time

Examples

Using Faucet is meant to be simple and painless.

BigQuery

Faucet takes in a BigQuery table with the following schema:

features <STRUCT>
labels <STRUCT>

For example:

|                      features                    |     labels     |
|--------------------------------------------------|----------------|
| {"age": 16, "ctr": 0.02, , "noise": 341293, ...} | {"clicked": 0} |

Initialize the data reader:

from faucetml.data_reader import get_data_reader

data_reader = get_data_reader(
    datastore="bigquery",
    credential_path="path/to/bigquery/creds.json",
    hash_on_feature="noise", # feature used to hash for random sampling
    table_name="project.dataset.training_table",
    ds="2020-01-21",
    epochs=2,
    batch_size=1024,
    chunk_size=1024 * 100,
    exclude_features=["noise"],
    table_sample_percent=100,
    test_split_percent=20,
    skip_small_batches=False,
)

Start reading data and training:

for epoch in range(2):

    # training loop
    data_reader.prep_for_epoch()
    batch = data_reader.get_batch()
    while batch is not None:
        train(batch)
        batch = data_reader.get_batch()

    # evaluation loop
    data_reader.prep_for_eval()
    batch = data_reader.get_batch(eval=True)
    while batch is not None:
        test(batch)
        batch = data_reader.get_batch(eval=True)

Future features

  • Support more data warehouses
  • Add preprocessing to data reading
  • Support reading features from Feast

Suggestions for other features? Open an issue and let us know.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

faucetml-0.0.1.tar.gz (6.2 kB view details)

Uploaded Source

File details

Details for the file faucetml-0.0.1.tar.gz.

File metadata

  • Download URL: faucetml-0.0.1.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.4

File hashes

Hashes for faucetml-0.0.1.tar.gz
Algorithm Hash digest
SHA256 8467b1516ad564a09c50e14829cd9b3ed99691280222195092d5b2dc84ea3b8c
MD5 f7ceeab4b752664b622fc9f340334c0d
BLAKE2b-256 6ad25d2e06790f56d318cea4b63a43740c5c4297124e131d2c40bf38b50c1580

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page