No project description provided

Project description

bucket-pull

Small CLI command to download a bucket directory locally.
Aims to simulate gsutil cp -r when copying from a bucket to local path.

# download whole bucket to a local dir
bucket-pull gs://mybucketname ./mybucketname

# download a directory and all it's content to a local dir
bucket-pull gs://mybucketname/mydir ./

Get it

pip install bucket-pull

https://pypi.org/project/bucket-pull/

Run it

# as cli 
$ bucket-pull gs://mybucketname ./mybucketname

# as a module
$ python -m bucket_pull gs://mybucketname ./mybucketname

Auth & Permissions

The utility makes use of the Google SDK and uses Client-Provided Authentication

To run with exported SA key, you can make use of the GOOGLE_APPLICATION_CREDENTIALS environmental variable

GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa-credentials.json  bucket-pull gs://bucket/mydir /tmp/some/path

The account you are connecting with will need at least storage.buckets.get on the bucket, which can be granted with the roles/storage.legacyBucketReader.

gsutil iam ch serviceAccount:SERVICEACCOUNT@PROJECT.iam.gserviceaccount.com:legacyBucketReader  gs://bucket

Some noteable differences

gsutil seems to have this somewhat weird behaviour when the destination path doesn't exist

$ gsutil cp -r  gs://smoss-tech-test-bucket/mydir/ ./doesnotexist/actually/
$ echo $?
0
$ ls ./doesnotexist
ls: cannot access './doesnotexist': No such file or directory

It will not create the destination path (not that weird)
but it won't complain either and end exits with 0.

Here bucket-pull diverge and throws an error instead.

Notes on multi-processing

With the -m flag we can enable multi-processing.

Here bucket-pull has opted for using the threading.
So, not true paralellism and we only ever make use of one CPU.
However, since we are mostly IO bound (disk and network) there is
still some gain to be had by using multiple threads waiting for IO.

"very" scientific comparison:

# with multithreading
time ./bucket-pull.py gs://smoss-tech-test-bucket/mydir /tmp/ -m
...
Downloading to /tmp/mydir/32mb.file
Downloading to /tmp/mydir/128mb.file
Downloading to /tmp/mydir/64mb.file
Downloading to /tmp/mydir/a/1.txt
Downloading to /tmp/mydir/a/b/2.txt
./bucket-pull.py gs://smoss-tech-test-bucket/mydir /tmp/ -m  5.86s user 5.24s system 22% cpu 50.149 total

# single thread
time ./bucket-pull.py gs://smoss-tech-test-bucket/mydir /tmp/ 
...
Downloading to /tmp/mydir/128mb.file
Downloading to /tmp/mydir/32mb.file
Downloading to /tmp/mydir/64mb.file
Downloading to /tmp/mydir/a/1.txt
Downloading to /tmp/mydir/a/b/2.txt
./bucket-pull.py gs://smoss-tech-test-bucket/mydir /tmp/  4.80s user 4.38s system 12% cpu 1:13.83 total

Project details

Release history Release notifications | RSS feed

This version

0.0.2

Feb 6, 2023

0.0.1

Feb 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bucket-pull-0.0.2.tar.gz (3.5 kB view hashes)

Uploaded Feb 6, 2023 Source

Built Distribution

bucket_pull-0.0.2-py3-none-any.whl (4.1 kB view hashes)

Uploaded Feb 6, 2023 Python 3

Hashes for bucket-pull-0.0.2.tar.gz

Hashes for bucket-pull-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`6645f883a30d5a410407e813dff378e3d07682161e3595f0b1ce61bb0649eb4a`
MD5	`ee9d4d8df33d24bbf35bb6a96f662584`
BLAKE2b-256	`e109973347ebc7cda7191bed850062ffbd4cb6da6cf72e44fade9f41cac7302c`

Hashes for bucket_pull-0.0.2-py3-none-any.whl

Hashes for bucket_pull-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b356c6d08d4a1d9f2dc1767a42ddadb7abc93ae4157d18e8d9cc0cdcc86e75fd`
MD5	`14aea27412e19b4776446d4288814adb`
BLAKE2b-256	`ddf92638bfd0af7e18d6a96f44dc786d2600acb8e8a74f75186e280dfa3a8193`