Skip to main content

Packages for fast dataflow and workflow processing

Project description

MLFastFlow

A Python package for fast dataflow and workflow processing.

Installation

pip install mlfastflow

Features

  • Easy-to-use data sourcing with the Sourcing class
  • Flexible vector search capabilities
  • Optimized for data processing workflows

Quick Start

from mlfastflow import Sourcing

# Create a sourcing instance
sourcing = Sourcing(
    query_df=your_query_dataframe,
    db_df=your_database_dataframe,
    columns_for_sourcing=["column1", "column2"],
    label="your_label"
)

# Process your data
sourced_db_df_without_label, sourced_db_df_with_label = (
    sourcing.sourcing()
)

BigQuery Integration

MLFastFlow provides a powerful BigQueryClient class for seamless integration with Google BigQuery and Google Cloud Storage (GCS).

Initialization

from mlfastflow import BigQueryClient

# Initialize the client with your GCP credentials
bq_client = BigQueryClient(
    project_id="your-gcp-project-id",
    dataset_id="your_dataset",
    key_file="/path/to/your/service-account-key.json"
)

Running SQL Queries

# Execute a SQL query and get results as a pandas DataFrame
df = bq_client.sql2df("SELECT * FROM your_dataset.your_table LIMIT 10")

# Or simply run a query without returning results
bq_client.run_sql("CREATE TABLE your_dataset.new_table AS SELECT * FROM your_dataset.source_table")

DataFrame to BigQuery

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'value': [100, 200, 300]
})

# Upload DataFrame to BigQuery
bq_client.df2table(
    df=df,
    table_id="your_table_name",
    if_exists="fail"  # Options: 'fail',  'append'
)

BigQuery to Google Cloud Storage

# Export query results to GCS as Parquet files (default)
bq_client.sql2gcs(
    sql="SELECT * FROM your_dataset.your_table",
    destination_uri="gs://your-bucket/path/to/export/",
    destination_format="PARQUET"  # Options: 'PARQUET', 'CSV', 'JSON', 'AVRO'
)

Google Cloud Storage to BigQuery

# Load data from GCS to BigQuery
bq_client.gcs2table(
    gcs_uri="gs://your-bucket/path/to/data/*.parquet",
    table_id="your_destination_table",
    write_disposition="WRITE_TRUNCATE",  # Options: 'WRITE_TRUNCATE', 'WRITE_APPEND', 'WRITE_EMPTY'
    source_format="PARQUET"  # Options: 'PARQUET', 'CSV', 'JSON', 'AVRO', 'ORC'
)

GCS Folder Management

# Create a folder in GCS
bq_client.create_gcs_folder("gs://your-bucket/new-folder/")

# Delete a folder and all its contents
success, deleted_count = bq_client.delete_gcs_folder(
    gcs_folder_path="gs://your-bucket/folder-to-delete/",
    dry_run=True  # Set to False to actually delete
)
print(f"Would delete {deleted_count} files" if success else "Error occurred")

Resource Management

# Explicitly close the client when done to free resources
bq_client.close()
del bq_client
bq_client = None

For more detailed examples and advanced usage, refer to the documentation.

License

MIT

Author

Xileven

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlfastflow-0.1.23.tar.gz (31.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlfastflow-0.1.23-py3-none-any.whl (33.2 kB view details)

Uploaded Python 3

File details

Details for the file mlfastflow-0.1.23.tar.gz.

File metadata

  • Download URL: mlfastflow-0.1.23.tar.gz
  • Upload date:
  • Size: 31.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for mlfastflow-0.1.23.tar.gz
Algorithm Hash digest
SHA256 607176ca31f1a13aa80545bb89e35a177749d45810ff380787514b626d4912d6
MD5 3bdb251489d09c36d5967b9116a180ff
BLAKE2b-256 d9ac21edc7a0e494d6ceb9df016e994e17fc71f13c599e82ea01793124e9ae4e

See more details on using hashes here.

File details

Details for the file mlfastflow-0.1.23-py3-none-any.whl.

File metadata

  • Download URL: mlfastflow-0.1.23-py3-none-any.whl
  • Upload date:
  • Size: 33.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for mlfastflow-0.1.23-py3-none-any.whl
Algorithm Hash digest
SHA256 4684de9410058ca662f466577a253847c082279659597f551e550a7cfde123c4
MD5 c359ed38466d56ffb88d029256d09f32
BLAKE2b-256 e7a6f878bf7f82615478bb1fb6f9424f5c3edece3fe4def1fcf81f32db5b0926

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page