Skip to main content

Deterministic File Lib to make working with Files across Object Storage easier

Project description

file_io

Deterministic File Lib to make working with Files across Object Storage easier

Quickstart

!pip install --upgrade git+https://github.com/trisongz/file-io.git
!pip install --upgrade file-io


from fileio import File

pathlike = File('gs://path/to/item.txt')
pathlike = File('s3://path/to/item.txt')

Changelogs


May 21, 2022 v0.3.1

  • Complete Overhaul and refactor.

Aug 31, 2021 v0.3.0alpha

  • Major refactor to remove tensorflow as primary dependency
  • Started secondary support of gs using google-cloud-storage
  • Started primary support of s3 using tensorflow
  • Working on secondary support of s3 using aioaws
  • Planning to integrate async support
  • Planning to add deeper integration with smart_open
  • Planning to add support for supabase storage
  • Started adding auto-auth support: s3, gs, supabase
  • Added compat module for previous File API to prevent breakage
    • All previous File APIs are still usable.
    • Does not check for tensorflow dependency. So using without tensorflow will break

Aug 3, 2021 - v0.1.16

  • A lot. But its pretty lazily done.

July 7, 2021 - v0.1.15

  • Modified behavior of open and direct __call__
  • Remove Explicit need for Tensorflow in setup, but still require it at the moment.
    • This may help with macos Tensorflow installations using tensorflow-macos

July 2, 2021 - v0.1.13

  • Change .textread to return string rather than list
    • .textreadlines replaces original function
  • Update .textlist to support option for stripping newlines and have replacements
    • strip_newlines = True, will strip all newlines prior to return
    • replacements: [ list | dict | str ] = None, will iterate through and replace
  • Update .base(filename, with_ext=True) to allow return without File Extension
  • Add .readfile method to return .read() API
  • Add .mod_fname(filename, new_name=None, prefix=None, suffix=None, ext=None, directory=None, create_dirs=True, filename_only=False, space_replace='_')
    • src = 'gs://mybucket/path/file.txt'
    • res = File.mod_fname(src, newname='newfile', ext='json', directory='/newdir', prefix='test_', suffix='_001')
    • >> res = /newdir/test_newfile_001.json

June 30, 2021 - v0.1.11

  • Added Dill as default pickler if installed
  • Ability to set any pickle method that supports .dumps/.loads call with File.set_pickler(name='pickler') or File.set_pickler(function=cloudpickle)
  • Hotfix to change method to dumps/loads
  • Hotfix for .gsutil method which did not initialize properly.

June 11, 2021 - v0.1.8

  • Hotfix for methods .split_file/.split_files

June 9, 2021 - v0.1.7

  • Hotfix for Method .get_local
  • Hotfix for method .jlgs

May 28, 2021 - v0.1.6

  • Added Method to get User Dir
    • File.userdir

May 21, 2021 - v0.1.5

  • Added TSV/CSV Write Methods
    • File.csvwrite
    • File.tsvwrite

May 20, 2021 - v0.1.4

  • Hotfix for file.split_file(s) method to also return resulting filenames with output_files key

May 20, 2021 - v0.1.3

  • Py Version Requirement Fix

May 19, 2021 - v0.1.2

  • Minor Fixes
  • Added Methods for Splitting Files/Items
    • File.calc_splits
    • File.split_items
    • File.split_file
    • File.split_files

May 12, 2021 - v0.1.1

  • Minor Fixes
  • Added Method
    • File.fmv

May 12, 2021 - v0.1.0

  • Refactored Library
  • Organized Methods
  • Added MultiThreaded Wrapper
    • from fileio import MultiThreadPipeline
  • Added gsutil wrapper method
    • File.gsutil
  • Added Methods for Yaml
    • File.yload
    • File.yloads
    • File.ydump
    • File.ydumps
    • File.yparse
  • Updated Methods for Json
    • File.jsonload
    • File.jsonloads
    • File.jsondump
    • File.jsondumps
    • File.jp
    • File.jwrite
    • File.jg
    • File.jgs
  • Updated Methods for Jsonlines
    • File.jll
    • File.jlp
    • File.jldumps
    • File.jlwrite
    • File.jlwrites
    • File.jlg
    • File.jlgs
    • File.jlload
    • File.jlw
    • File.jlsample
  • Updated Methods for Text
    • File.textload
    • File.textwrite
    • File.textread
    • File.textlist
  • Added Methods for Requests
    • File.rget
    • File.rpost
    • File.reqsess
  • Added Methods for URL Encoding/Decoding
    • File.urlencode
    • File.urldecode
  • Added Methods for Hashing
    • File.hash
    • File.checkhash
  • Added Methods to Disable/Enable TQDM
    • File.enable_progress
    • File.disable_progress
  • Added Utility Methods
    • File.cat
    • File.backup
    • File.findir
    • File.append_ext
    • File.copydir
    • File.dirglob
    • File.absdir
    • File.get_local
    • File.finalize
    • File.print
    • File.set_printer
  • Fixed/Updated Methods
    • File.isfile
    • File.download
    • File.batch_download
    • File.pexists
    • File.whichpath
    • File.copy
    • File.bcopy
  • Added TFDSIODataset

Previous Version


from fileio import File

'''
Recognized File Extensions

.json               - json
.jsonl/.jsonlines   - jsonlines
.csv                - csv
.tsv                - tsv with "\t" seperator
.txt                - txtlines
.pkl                - pickle
.pt                 - pytorch
.tfrecords          - tensorflow
'''

# Main auto classes
File.open(filename, mode='r', auto=True, device=None) # device is specific to pytorch. Set auto=False to get a barebones Posix via Gfile
File.save(data, filename, overwrite=False) # if not overwrite, will attempt to append for newline files
File.load(filenames, device=None) # yields generators per file, meaning you can have different file types
File.download(url, dirpath=None, filename=None, overwrite=False) # Downloads a single url
File.gdown(url, extract=True, verbose=False) # uses gdown lib to grab a google drive drive

# Main i/o classes (Not Binary)
File.read(filename) # 'r'
File.write(filename) # 'w'
File.append(filename) # 'a'

# Binary
File.wb(filename) # 'wb'
File.rb(filename) # 'rb'


# Batch downloaders
File.batch_download(urls, directory=None, overwrite=False) # downloads all urls into a directory, skipping if overwrite = True and exists
File.batch_gdown(urls, directory=None, extract=True, verbose=False) # downloads all gdrive urls to a directory

# Extension Specific 

# .json
File.jsonload(filename)
File.jsondump(dict, filename)

# .jsonl/.jsonlines (Single File)
File.jlg(filename)
File.jlw(data, filename, mode='auto', verbose=True)

# Multifile Readers

# .jsonl/.jsonlines
File.jgs(filenames)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

file-io-0.3.2rc1.tar.gz (52.6 kB view details)

Uploaded Source

Built Distribution

file_io-0.3.2rc1-py3-none-any.whl (60.1 kB view details)

Uploaded Python 3

File details

Details for the file file-io-0.3.2rc1.tar.gz.

File metadata

  • Download URL: file-io-0.3.2rc1.tar.gz
  • Upload date:
  • Size: 52.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for file-io-0.3.2rc1.tar.gz
Algorithm Hash digest
SHA256 5feee0b970d783bf3b4eb16d530a9851b7f45bc7c991f50d27efaa90abf75c9b
MD5 5826c66f76a89ffb35a8669e29980387
BLAKE2b-256 a267c056e68072eff3c671b3b307144bfe4371143a808e5f221fdc94f46b22a9

See more details on using hashes here.

File details

Details for the file file_io-0.3.2rc1-py3-none-any.whl.

File metadata

  • Download URL: file_io-0.3.2rc1-py3-none-any.whl
  • Upload date:
  • Size: 60.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for file_io-0.3.2rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 76b0c63f8d4bacc521685d37e9f72d9c580f392a9d9be3680c9c1a5a52f32c9c
MD5 59c3b25c142ba8db4cf269387f6df167
BLAKE2b-256 71e20cf2651d58280340a25697a9e7ebbf8da217090cdd0c6a73e40fe5a26dfe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page