Skip to main content

Utils for streaming large files (S3, HDFS, gzip, bz2...)

Project description

License Travis

What?

smart_open is a Python 2 & Python 3 library for efficient streaming of very large files from/to S3, HDFS, WebHDFS, HTTP, or local (compressed) files. It’s a drop-in replacement for Python’s built-in open(): it can do anything open can (100% compatible, falls back to native open wherever possible), plus lots of nifty extra stuff on top.

smart_open is well-tested, well-documented and sports a simple, Pythonic API:

>>> from smart_open import smart_open

>>> # stream lines from an S3 object
>>> for line in smart_open('s3://mybucket/mykey.txt', 'rb'):
...    print(line.decode('utf8'))

>>> # stream from/to compressed files, with transparent (de)compression:
>>> for line in smart_open('./foo.txt.gz', encoding='utf8'):
...    print(line)

>>> # can use context managers too:
>>> with smart_open('/home/radim/foo.txt.bz2', 'wb') as fout:
...    fout.write(u"some content\n".encode('utf8'))

>>> with smart_open('s3://mybucket/mykey.txt', 'rb') as fin:
...     for line in fin:
...         print(line.decode('utf8'))
...     fin.seek(0)  # seek to the beginning
...     b1000 = fin.read(1000)  # read 1000 bytes

>>> # stream from HDFS
>>> for line in smart_open('hdfs://user/hadoop/my_file.txt', encoding='utf8'):
...     print(line)

>>> # stream from HTTP
>>> for line in smart_open('http://example.com/index.html'):
...     print(line)

>>> # stream from WebHDFS
>>> for line in smart_open('webhdfs://host:port/user/hadoop/my_file.txt'):
...     print(line)

>>> # stream content *into* S3 (write mode):
>>> with smart_open('s3://mybucket/mykey.txt', 'wb') as fout:
...     for line in [b'first line\n', b'second line\n', b'third line\n']:
...          fout.write(line)

>>> # stream content *into* HDFS (write mode):
>>> with smart_open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
...     for line in [b'first line\n', b'second line\n', b'third line\n']:
...          fout.write(line)

>>> # stream content *into* WebHDFS (write mode):
>>> with smart_open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
...     for line in [b'first line\n', b'second line\n', b'third line\n']:
...          fout.write(line)

>>> # stream using a completely custom s3 server, like s3proxy:
>>> for line in smart_open('s3u://user:secret@host:port@mybucket/mykey.txt', 'rb'):
...    print(line.decode('utf8'))

>>> # you can also use a boto.s3.key.Key instance directly:
>>> key = boto.connect_s3().get_bucket("my_bucket").get_key("my_key")
>>> with smart_open(key, 'rb') as fin:
...     for line in fin:
...         print(line.decode('utf8'))

>>> # Stream to Digital Ocean Spaces bucket providing credentials from boto profile
>>> with smart_open('s3://bucket-for-experiments/file.txt', 'wb', endpoint_url='https://ams3.digitaloceanspaces.com', profile_name='digitalocean') as fout:
...     fout.write(b'here we stand')

Why?

Working with large S3 files using Amazon’s default Python library, boto and boto3, is a pain. Its key.set_contents_from_string() and key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming). There are nasty hidden gotchas when using boto’s multipart upload functionality that is needed for large files, and a lot of boilerplate.

smart_open shields you from that. It builds on boto3 but offers a cleaner, Pythonic API. The result is less code for you to write and fewer bugs to make.

Installation

pip install smart_open

Or, if you prefer to install from the source tar.gz:

python setup.py test  # run unit tests
python setup.py install

To run the unit tests (optional), you’ll also need to install mock , moto and responses (pip install mock moto responses). The tests are also run automatically with Travis CI on every commit push & pull request.

S3-Specific Options

The S3 reader supports gzipped content transparently, as long as the key is obviously a gzipped file (e.g. ends with “.gz”).

There are a few optional keyword arguments that are useful only for S3 access.

The host and profile arguments are both passed to boto.s3_connect() as keyword arguments:

>>> smart_open('s3://', host='s3.amazonaws.com')
>>> smart_open('s3://', profile_name='my-profile')

The s3_session argument allows you to provide a custom boto3.Session instance for connecting to S3:

>>> smart_open('s3://', s3_session=boto3.Session())

The s3_upload argument accepts a dict of any parameters accepted by initiate_multipart_upload:

>>> smart_open('s3://', s3_upload={ 'ServerSideEncryption': 'AES256' })

Since going over all (or select) keys in an S3 bucket is a very common operation, there’s also an extra method smart_open.s3_iter_bucket() that does this efficiently, processing the bucket keys in parallel (using multiprocessing):

>>> from smart_open import smart_open, s3_iter_bucket
>>> # get all JSON files under "mybucket/foo/"
>>> bucket = boto.connect_s3().get_bucket('mybucket')
>>> for key, content in s3_iter_bucket(bucket, prefix='foo/', accept_key=lambda key: key.endswith('.json')):
...     print(key, len(content))

For more info (S3 credentials in URI, minimum S3 part size…) and full method signatures, check out the API docs:

>>> import smart_open
>>> help(smart_open.smart_open_lib)

Comments, bug reports

smart_open lives on Github. You can file issues or pull requests there. Suggestions, pull requests and improvements welcome!


smart_open is open source software released under the MIT license. Copyright (c) 2015-now Radim Řehůřek.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_open-1.8.0.tar.gz (40.7 kB view details)

Uploaded Source

File details

Details for the file smart_open-1.8.0.tar.gz.

File metadata

  • Download URL: smart_open-1.8.0.tar.gz
  • Upload date:
  • Size: 40.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.29.1 CPython/2.7.15rc1

File hashes

Hashes for smart_open-1.8.0.tar.gz
Algorithm Hash digest
SHA256 a52206bb69c38c5f08709ec2ee5704b0f86fc0a770935b5dad9b5841bfd5f502
MD5 43415e6bb245e679cdd04097a62eb288
BLAKE2b-256 ffc8de7dcf34d4b5f2ae94fe1055e0d6418fb97a63c9dc3428edd264704983a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page