Skip to main content

A plugin for pipen to handle file metadata in Google Cloud Storage

Project description

pipen-gcs

A plugin for pipen to handle files in Google Cloud Storage.

[!NOTE] Since pipen v0.16.0, it introduced cloud support natively. See here for more information. However, when the pipeline working directory is a local path, but the input/output files are in the cloud, we need to handle the cloud files ourselves and in the job script. To avoid that, we can use this plugin to download the input files and upload the output files automatically.

[!NOTE] Also note that this plugin does not synchronize the meta files to the cloud storage; they are already handled by pipen when needed. This plugin only handles the input/output files when the working directory is a local path. When the pipeline output directory is a cloud path, the output files will be uploaded to the cloud storage automatically.

pipen-gcs

Installation

pip install -U pipen-gcs

Usage

from pipen import Proc, Pipen
import pipen_gcs  # Import and enable the plugin

class MyProc(Proc):
    input = "infile:file"
    input_data = ["gs://bucket/path/to/file"]
    output = "outfile:file:{{in.infile.name}}.out"
    # We can deal with the files as if they are local
    script = "cat {{in.infile}} > {{out.outfile}}"

class MyPipen(Pipen):
    starts = MyProc
    # input files/directories will be downloaded to /tmp
    # output files/directories will be generated in /tmp and then uploaded
    #   to the cloud storage
    plugin_opts = {"gcs_cache": "/tmp"}

if __name__ == "__main__":
    # The working directory is a local path
    # The output directory can be a local path, but if it is a cloud path,
    #   the output files will be uploaded to the cloud storage automatically
    MyPipen(workdir="./.pipen", outdir="./myoutput").run()

[!NOTE] When checking the meta information of the jobs, for example, whether a job is cached, the plugin will make pipen to use the cloud files.

Configuration

  • gcs_cache: The directory to save the cloud storage files.
  • gcs_loglevel: The log level for the plugin. Default is INFO.
  • gcs_logmax: The maximum number of files to log while syncing. Default is 5.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipen_gcs-1.0.0.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pipen_gcs-1.0.0-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file pipen_gcs-1.0.0.tar.gz.

File metadata

  • Download URL: pipen_gcs-1.0.0.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.11.0-1018-azure

File hashes

Hashes for pipen_gcs-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b31937bc9c75d56100b3174c4eb9f8c57121a5f6063d6c4250d860522780e31b
MD5 06227080743a7ed85a4553e88db8c887
BLAKE2b-256 0763780dfa9443515eaa5c73c9227265c5c57229b1afa141a25328b35f4838ff

See more details on using hashes here.

File details

Details for the file pipen_gcs-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pipen_gcs-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.11.0-1018-azure

File hashes

Hashes for pipen_gcs-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 03576580bd74322012d29efbec6ed9c71e6116ec5482c85a6c84d5d3a148edbc
MD5 6ff2ddc9616f120a9911fd6d0698aa93
BLAKE2b-256 b91f2d906f0e035cfa91d00cc9c7466256570992cb3642b0296e126df30367ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page