smart-open

Utils for streaming large files (S3, HDFS, gzip, bz2...)

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Database :: Front-Ends
- System :: Distributed Computing

Project description

What?

smart_open is a Python 2 & Python 3 library for efficient streaming of very large files from/to S3, HDFS, WebHDFS, HTTP, or local (compressed) files. It is well tested (using moto), well documented and sports a simple, Pythonic API:

>>> # stream lines from an S3 object
>>> for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
...    print line

>>> # using a completely custom s3 server, like s3proxy:
>>> for line in smart_open.smart_open('s3u://user:secret@host:port@mybucket/mykey.txt'):
...    print line

>>> # you can also use a boto.s3.key.Key instance directly:
>>> key = boto.connect_s3().get_bucket("my_bucket").get_key("my_key")
>>> with smart_open.smart_open(key) as fin:
...     for line in fin:
...         print line

>>> # can use context managers too:
>>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin:
...     for line in fin:
...         print line
...     fin.seek(0)  # seek to the beginning
...     print fin.read(1000)  # read 1000 bytes

>>> # stream from HDFS
>>> for line in smart_open.smart_open('hdfs://user/hadoop/my_file.txt'):
...     print line

>>> # stream from HTTP
>>> for line in smart_open.smart_open('http://example.com/index.html'):
...     print line

>>> # stream from WebHDFS
>>> for line in smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt'):
...     print line

>>> # stream content *into* S3 (write mode):
>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:
...     for line in ['first line', 'second line', 'third line']:
...          fout.write(line + '\n')

>>> # stream content *into* HDFS (write mode):
>>> with smart_open.smart_open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
...     for line in ['first line', 'second line', 'third line']:
...          fout.write(line + '\n')

>>> # stream content *into* WebHDFS (write mode):
>>> with smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
...     for line in ['first line', 'second line', 'third line']:
...          fout.write(line + '\n')

>>> # stream from/to local compressed files:
>>> for line in smart_open.smart_open('./foo.txt.gz'):
...    print line

>>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout:
...    fout.write("some content\n")

Since going over all (or select) keys in an S3 bucket is a very common operation, there’s also an extra method smart_open.s3_iter_bucket() that does this efficiently, processing the bucket keys in parallel (using multiprocessing):

>>> # get all JSON files under "mybucket/foo/"
>>> bucket = boto.connect_s3().get_bucket('mybucket')
>>> for key, content in s3_iter_bucket(bucket, prefix='foo/', accept_key=lambda key: key.endswith('.json')):
...     print key, len(content)

For more info (S3 credentials in URI, minimum S3 part size…) and full method signatures, check out the API docs:

>>> import smart_open
>>> help(smart_open.smart_open_lib)

S3-Specific Options

There are a few optional keyword arguments that are useful only for S3 access.

>>> smart_open.smart_open('s3://', host='s3.amazonaws.com')
>>> smart_open.smart_open('s3://', profile_name='my-profile')

These are both passed to boto.s3_connect() as keyword arguments. The S3 reader supports gzipped content, as long as the key is obviously a gzipped file (e.g. ends with “.gz”).

Why?

Working with large S3 files using Amazon’s default Python library, boto, is a pain. Its key.set_contents_from_string() and key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming). There are nasty hidden gotchas when using boto’s multipart upload functionality, and a lot of boilerplate.

smart_open shields you from that. It builds on boto but offers a cleaner API. The result is less code for you to write and fewer bugs to make.

Installation

pip install smart_open

Or, if you prefer to install from the source tar.gz:

python setup.py test  # run unit tests
python setup.py install

To run the unit tests (optional), you’ll also need to install mock , moto and responses <https://github.com/getsentry/responses> (pip install mock moto responses). The tests are also run automatically with Travis CI on every commit push & pull request.

Todo

smart_open is an ongoing effort. Suggestions, pull request and improvements welcome!

On the roadmap:

better documentation for the default file:// scheme

Comments, bug reports

smart_open lives on github. You can file issues or pull requests there.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Database :: Front-Ends
- System :: Distributed Computing

Release history Release notifications | RSS feed

7.0.4

Mar 26, 2024

7.0.3

Mar 21, 2024

7.0.2

Mar 21, 2024

7.0.1

Feb 26, 2024

7.0.0

Feb 26, 2024

6.4.0

Sep 7, 2023

6.3.0

Dec 12, 2022

6.2.0

Sep 14, 2022

6.1.0

Aug 21, 2022

6.0.0

Apr 24, 2022

5.2.1

Aug 28, 2021

5.2.0

Aug 18, 2021

5.1.0

May 25, 2021

5.0.0

Mar 30, 2021

4.2.0

Feb 15, 2021

4.1.2

Jan 18, 2021

4.1.0

Dec 30, 2020

4.0.1

Nov 27, 2020

4.0.0

Nov 24, 2020

3.0.0

Oct 8, 2020

2.2.1

Oct 1, 2020

2.2.0

Sep 25, 2020

2.1.1

Aug 27, 2020

2.1.0

Jul 1, 2020

2.0.0

Apr 28, 2020

1.11.1

Apr 8, 2020

1.11.0

Apr 8, 2020

1.10.1

Apr 26, 2020

1.10.0

Mar 16, 2020

1.9.0

Nov 3, 2019

1.8.4

Jun 2, 2019

1.8.3

Apr 26, 2019

1.8.2

Apr 17, 2019

1.8.1

Apr 9, 2019

1.8.0

Jan 17, 2019

1.7.1

Sep 19, 2018

1.7.0

Sep 19, 2018

1.6.0

Jun 29, 2018

1.5.7

Mar 18, 2018

1.5.6

Dec 28, 2017

This version

1.5.5

Dec 6, 2017

1.5.4

Nov 30, 2017

1.5.3

May 18, 2017

1.5.2

Apr 12, 2017

1.5.1

Mar 17, 2017

1.5.0

Mar 14, 2017

1.4.0

Feb 13, 2017

1.3.5

Oct 5, 2016

1.3.4

Aug 26, 2016

1.3.3

May 16, 2016

1.3.2

Jan 3, 2016

1.3.1

Dec 18, 2015

1.3.0

Sep 19, 2015

1.3.0rc1 pre-release

Sep 17, 2015

1.2.1

Apr 10, 2015

1.2.0

Apr 9, 2015

1.1.0

Feb 1, 2015

1.0.2

Jan 25, 2015

1.0.1

Jan 25, 2015

0.1.1

Jan 24, 2015

0.1.0

Jan 20, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_open-1.5.5.tar.gz (32.0 kB view hashes)

Uploaded Dec 6, 2017 Source

Hashes for smart_open-1.5.5.tar.gz

Hashes for smart_open-1.5.5.tar.gz
Algorithm	Hash digest
SHA256	`9e2591241e92f552cf4225b1b70bbaeace0122fd9068bfc769567aeb947e1a4e`
MD5	`a19a828e58fa78e7d60a40df51ae2804`
BLAKE2b-256	`fbc9fa4099bf0818edc38fcb04db7b905227d5bbffccfce71e90bc42e52d56e7`