Skip to main content

Simple, high-speed batch data reader & preprocessor for ML applications.

Project description

Faucet ML

Faucet ML is a Python package that enables high speed mini-batch data reading & preprocessing from BigQuery for machine learning model training.

Faucet ML is designed for cases where:

  • Datasets are too large to fit into memory
  • Model training requires mini-batches of data (SGD based algorithms)

Features:

  • High speed batch data reading from BigQuery
  • Automatic feature identification and preprocessing via. PyTorch
  • Integration with Feast feature store (coming soon)

Installation

pip install faucetml

More about Faucet

Many training datasets are too large to fit in memory, but model training would benefit from using all of the training data. Naively issuing 1 query per mini-batch of data is unnecessarily expensive due round-trip network costs. Faucet is a library that solves these issues by:

  • Fetching large "chunks" of data in non-blocking background threads
    • where chunks are much larger than mini-batches, but still fit in memory
  • Caching chunks locally
  • Returning mini-batches from cached chunks in O(1) time

Examples

See examples for detailed ipython notebook examples on how to use Faucet.

# initialize the client
fml = get_client(
    datastore="bigquery",
    credential_path="bq_creds.json",
    table_name="my_training_table",
    ds="2020-01-20",
    epochs=2,
    batch_size=1024
    chunk_size=1024 * 10000,
    test_split_percent=20,
)
# train & test
for epoch in range(2):

    # training loop
    fml.prep_for_epoch()
    batch = fml.get_batch()
    while batch is not None:
        train(batch)
        batch = fml.get_batch()

    # evaluation loop
    fml.prep_for_eval()
    batch = fml.get_batch(eval=True)
    while batch is not None:
        test(batch)
        batch = fml.get_batch(eval=True)

Future features

  • Support more data warehouses (redshift, hive, etc.)
  • Support reading features & preprocessing specs from Feast

Suggestions for other features? Open an issue and let us know.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

faucetml-0.0.3.tar.gz (14.7 kB view details)

Uploaded Source

File details

Details for the file faucetml-0.0.3.tar.gz.

File metadata

  • Download URL: faucetml-0.0.3.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.4

File hashes

Hashes for faucetml-0.0.3.tar.gz
Algorithm Hash digest
SHA256 cac9654cce7d1cb919fe3cef500b58de5bee9dd057047b2207e0b02a4df784da
MD5 0e90cf64ff067a2e54ab21251be480a0
BLAKE2b-256 757b11baaa8b8f1c3199d6d7357d811e880626269f2df835413f83748d23be63

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page