Utils for streaming large files (S3, HDFS, gzip, bz2...)
smart_open is a Python 2 & Python 3 library for efficient streaming of very large files from/to S3, HDFS, WebHDFS or local (compressed) files. It is well tested (using moto), well documented and sports a simple, Pythonic API:
>>> # stream lines from an S3 object >>> for line in smart_open.smart_open('s3://mybucket/mykey.txt'): ... print line >>> # you can also use a boto.s3.key.Key instance directly: >>> key = boto.connect_s3().get_bucket("my_bucket").get_key("my_key") >>> with smart_open.smart_open(key) as fin: ... for line in fin: ... print line >>> # can use context managers too: >>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin: ... for line in fin: ... print line ... fin.seek(0) # seek to the beginning ... print fin.read(1000) # read 1000 bytes >>> # stream from HDFS >>> for line in smart_open.smart_open('hdfs://user/hadoop/my_file.txt'): ... print line >>> # stream from WebHDFS >>> for line in smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt'): ... print line >>> # stream content *into* S3 (write mode): >>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + '\n') >>> # stream content *into* WebHDFS (write mode): >>> with smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + '\n') >>> # stream from/to local compressed files: >>> for line in smart_open.smart_open('./foo.txt.gz'): ... print line >>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout: ... fout.write("some content\n")
Since going over all (or select) keys in an S3 bucket is a very common operation, there’s also an extra method smart_open.s3_iter_bucket() that does this efficiently, processing the bucket keys in parallel (using multiprocessing):
>>> # get all JSON files under "mybucket/foo/" >>> bucket = boto.connect_s3().get_bucket('mybucket') >>> for key, content in s3_iter_bucket(bucket, prefix='foo/', accept_key=lambda key: key.endswith('.json')): ... print key, len(content)
For more info (S3 credentials in URI, minimum S3 part size…) and full method signatures, check out the API docs:
>>> import smart_open >>> help(smart_open.smart_open_lib)
There are a few optional keyword arguments that are useful only for S3 access.
>>> smart_open.smart_open('s3://', host='s3.amazonaws.com') >>> smart_open.smart_open('s3://', profile_name='my-profile')
These are both passed to boto.s3_connect() as keyword arguments.
Working with large S3 files using Amazon’s default Python library, boto, is a pain. Its key.set_contents_from_string() and key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming). There are nasty hidden gotchas when using boto’s multipart upload functionality, and a lot of boilerplate.
smart_open shields you from that. It builds on boto but offers a cleaner API. The result is less code for you to write and fewer bugs to make.
The module has no dependencies beyond Python >= 2.6 (or Python >= 3.3) and boto:
pip install smart_open
Or, if you prefer to install from the source tar.gz:
python setup.py test # run unit tests python setup.py install
To run the unit tests (optional), you’ll also need to install mock , moto and responses <https://github.com/getsentry/responses> (pip install mock moto responses). The tests are also run automatically with Travis CI on every commit push & pull request.
smart_open is an ongoing effort. Suggestions, pull request and improvements welcome!
On the roadmap:
- better documentation for the default file:// scheme