Python module which connects to Amazon's S3 REST API
Project description
Overview
s3 is a connector to S3, Amazon’s Simple Storage System REST API.
Use it to upload, download, delete, copy, test files for existence in S3, or update their metadata.
S3 files may have metadata in addition to their content. Metadata is a set of key/value pairs. Metadata may be set when the file is uploaded or it can be updated subsequently.
S3 files are stored in S3 buckets. Buckets can be created, listed, configured, and deleted. The bucket configuration can be read and the bucket contents can be listed.
In addition to the s3 Python module, this package contains a command line tool also named s3. The tool imports the module and offers a command line interface to some of the module’s capability.
Installation
From PyPi
$ pip install s3
From source
$ hg clone ssh://hg@bitbucket.org/prometheus/s3 $ pip install -e s3
The installation is successful if you can import s3 and run the command line tool. The following commands must produce no errors:
$ python -c 'import s3' $ s3 --help
API to remote storage
S3 Buckets
Buckets store files. Buckets may be created and deleted. They may be listed, configured, and loaded with files. The configuration can be read, and the files in the bucket can be listed.
Bucket names must be unique across S3 so it is best to use a prefix on all bucket names. S3 forbids underscores in bucket names, and although it allows periods, these confound DNS and should be avoided.
We prefix all our bucket names with: com-prometheus-
All the bucket configuration options work the same way - the caller provides XML or JSON data and perhaps headers or params as well.
s3 accepts a python object for the data argument instead of a string. The object will be converted to XML or JSON as required.
Likewise, s3 returns a python dict instead of the XML or JSON string returned by S3. However, that string is readily available if need be, because the response returned by requests.request() is exposed to the caller.
S3 Filenames
An S3 file name consists of a bucket and a key. This pair of strings uniquely identifies the file within S3.
The S3Name class is instantiated with a key and a bucket; the key is required and the bucket defaults to None.
The Storage class methods take a remote_name argument which can be either a string which is the key, or an instance of the S3Name class. When no bucket is given (or the bucket is None) then the default_bucket established when the connection is instantiated is used. If no bucket is given (or the bucket is None) and there is no default bucket then a ValueError is raised.
In other words, the S3Name class provides a means of using a bucket other than the default_bucket.
S3 Directories
Although S3 storage is flat: buckets contain keys, S3 lets you impose a directory tree structure on your bucket by using a delimiter in your keys.
For example, if you name a key ‘a/b/f’, and use ‘/’ as the delimiter, then S3 will consider that ‘a’ is a directory, ‘b’ is a sub-directory of ‘a’, and ‘f’ is a file in ‘b’.
Headers and Metadata
Additional http headers may be sent using the methods which write data. These methods accept an optional headers argument which is a python dict. The headers control various aspects of how the file may be handled. S3 supports a variety of headers. These are not discussed here. See Amazon’s S3 documentation for more info on S3 headers.
Those headers whose key begins with the special prefix: x-amz-meta- are considered to be metadata headers and are used to set the metadata attributes of the file.
The methods which read files also return the metadata which consists of only those response headers which begin with x-amz-meta-.
Python classes for S3 data
To facilitate the transfer of data between S3 and applications various classes were defined which correspond to data returned by S3.
All attributes of these classes are strings.
- S3Bucket
creation_date
name
- S3Key
e_tag
key
last_modified
owner
size
storage_class
- S3Owner
display_name
id
XML strings and Python objects
An XML string consists of a series of nested tags. An XML tag can be represented in python as an entry in a dict. An OrderedDict from the collections module should be used when the order of the keys is important.
The opening tag (everything between the ‘<’ and the ‘>’) is the key and everything between the opening tag and the closing tag is the value of the key.
Since every value must be enclosed in a tag, not every python object can represent XML in this way. In particular, lists may only contain dicts which have a single key.
For example this XML:
<a xmlns="foo"> <b1> <c1> 1 </c1> </b1> <b2> <c2> 2 </c2> </b2> </a>
is equivalent to this object:
{'a xmlns="foo"': [{'b1': {'c1': 1}}, {'b2': {'c2': 2}}] }
Storage Methods
The arguments remote_source, remote_destination, and remote_name may be either a string, or an S3Name instance.
local_name is a string and is the name of the file on the local system. This string is passed directly to open().
bucket is a string and is the name of the bucket.
headers is a python dict used to encode additional request headers.
params is either a python dict used to encode the request parameters, or a string containing all the text of the url query string after the ‘?’.
data is a string or an object and is the body of the message. The object will be converted to an XML or JSON string as appropriate.
All methods return on success or raise StorageError on failure.
Upon return storage.response contains the raw response object which was returned by the requests module. So for example, storage.response.headers contains the response headers returned by S3. See http://docs.python-requests.org/en/latest/api/ for a description of the response object.
See http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketOps.html for a description of the available bucket operations and their arguments.
- storage.bucket_create(bucket, headers={}, data=None)
Create a bucket named bucket. headers may be used to set either ACL or explicit access permissions. data may be used to override the default region. If data is None, data is set as follows:
data = { 'CreateBucketConfiguration' ' xmlns="http://s3.amazonaws.com/doc/2006-03-01/"': { 'LocationConstraint': self.connection.region}}
- storage.bucket_delete(bucket)
Delete a bucket named bucket.
- storage.bucket_delete_cors(bucket)
Delete cors configuration of bucket named bucket.
- storage.bucket_delete_lifecycle(bucket)
Delete lifecycle configuration of bucket named bucket.
- storage.bucket_delete_policy(bucket)
Delete policy of bucket named bucket.
- storage.bucket_delete_tagging(bucket)
Delete tagging configuration of bucket named bucket.
- storage.bucket_delete_website(bucket)
Delete website configuration of bucket named bucket.
- exists = storage.bucket_exists(bucket)
Test if bucket exists in storage.
exists - boolean.
- storage.bucket_get(self, bucket, params={})
Gets the next block of keys from the bucket based on params.
- d = storage.bucket_get_acl(bucket)
Returns bucket acl configuration as a dict.
- d = storage.bucket_get_cors(bucket)
Returns bucket cors configuration as a dict.
- d = storage.bucket_get_lifecycle(bucket)
Returns bucket lifecycle as a dict.
- d = storage.bucket_get_location(bucket)
Returns bucket location configuration as a dict.
- d = storage.bucket_get_logging(bucket)
Returns bucket logging configuration as a dict.
- d = storage.bucket_get_notification(bucket)
Returns bucket notification configuration as a dict.
- d = storage.bucket_get_policy(bucket)
Returns bucket policy as a dict.
- d = storage.bucket_get_request_payment(bucket)
Returns bucket requestPayment configuration as a dict.
- d = storage.bucket_get_tagging(bucket)
Returns bucket tagging configuration as a dict.
- d = storage.bucket_get_versioning(bucket)
Returns bucket versioning configuration as a dict.
- d = storage.bucket_get_versions(bucket, params={})
Returns bucket versions as a dict.
- d = storage.bucket_get_website(bucket)
Returns bucket website configuration as a dict.
- for bucket in storage.bucket_list():
Returns a Generator object which returns all the buckets for the authenticated user’s account.
Each bucket is returned as an S3Bucket instance.
- for key in storage.bucket_list_keys(bucket, delimiter=None, prefix=None, params={}):
Returns a Generator object which returns all the keys in the bucket.
Each key is returned as an S3Key instance.
bucket - the name of the bucket to list
delimiter - used to request common prefixes
prefix - used to filter the listing
params - additional parameters.
When delimiter is used, the keys (i.e. file names) are returned first, followed by the common prefixes (i.e. directory names). Each key is returned as an S3Key instance. Each common prefix is returned as a string.
As a convenience, the delimiter and prefix may be provided as either keyword arguments or as keys in params. If the arguments are provided, they are used to update params. In any case, params are passed to S3.
See http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html for a description of delimiter, prefix, and the other parameters.
- bucket_set_acl(bucket, headers={}, data=’’)
Configure bucket acl using xml data, or request headers.
- bucket_set_cors(bucket, data=’’)
Configure bucket cors with xml data.
- bucket_set_lifecycle(bucket, data=’’)
Configure bucket lifecycle with xml data.
- bucket_set_logging(bucket, data=’’)
Configure bucket logging with xml data.
- bucket_set_notification(bucket, data=’’)
Configure bucket notification with xml data.
- bucket_set_policy(bucket, data=’’)
Configure bucket policy using json data.
- bucket_set_request_payment(bucket, data=’’)
Configure bucket requestPayment with xml data.
- bucket_set_tagging(bucket, data=’’)
Configure bucket tagging with xml data.
- bucket_set_versioning(bucket, headers={}, data=’’)
Configure bucket versioning using xml data and request headers.
- bucket_set_website(bucket, data=’’)
Configure bucket website with xml data.
- storage.copy(remote_source, remote_destination, headers={})
Copy remote_source to remote_destination.
The destination metadata is copied from headers when it contains metadata; otherwise it is copied from the source metadata.
- storage.delete(remote_name)
Delete remote_name from storage.
- exists, metadata = storage.exists(remote_name)
Test if remote_name exists in storage, retrieve its metadata if it does.
exists - boolean, metadata - dict.
- metadata = storage.read(remote_name, local_name)
Download remote_name from storage, save it locally as local_name and retrieve its metadata.
metadata - dict.
- storage.update_metadata(remote_name, headers)
Update (replace) the metadata associated with remote_name with the metadata headers in headers.
- storage.write(local_name, remote_name, headers={})
Upload local_name to storage as remote_name, and set its metadata if any metadata headers are in headers.
StorageError
There are two forms of exceptions.
The first form is when a request to S3 completes but fails. For example a read request may fail because the user does not have read permission. In this case a StorageError is raised with:
msg - The name of the method that was called (e.g. ‘read’, ‘exists’, etc.)
exception - A detailed error message
response - The raw response object returned by requests.
The second form is when any other exception happens. For example a disk or network error. In this case StorageError is raised with:
msg - A detailed error message.
exception - The exception object
response - None
Usage
Configuration
First configure your yaml file.
access_key_id and secret_access_key are generated by the S3 account manager. They are effectively the username and password for the account.
default_bucket is the name of the default bucket to use when referencing S3 files. bucket names must be unique (on earth) so by convention we use a prefix on all our bucket names: com-prometheus- (NOTE: amazon forbids underscores in bucket names, and although they allow periods, periods will confound DNS - so it is best not to use periods in bucket names.
endpoint and region are the Amazon server url to connect to and its associated region. See http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region for a list of the available endpoints and their associated regions.
tls True => use https://, False => use http://. Default is True.
retry contains values used to retry requests.request(). If a request fails with an error listed in status_codes, and the limit of tries has not been reached, then a retry message is logged, the program sleeps for interval seconds, and the request is sent again. Default is:
retry: limit: 5 interval: 2.5 status_codes: - 104
limit is the number of times to try to send the request. 0 means unlimited retries.
interval is the number of seconds to wait between retries.
status_codes is a list of request status codes (errors) to retry.
Here is an example s3.yaml
--- s3: access_key_id: "XXXXX" secret_access_key: "YYYYYYY" default_bucket: "ZZZZZZZ" endpoint: "s3-us-west-2.amazonaws.com" region: "us-west-2"
Next configure your S3 bucket permissions. You can use s3 to create, configure, and manage your buckets (see the examples below) or you can use Amazon’s web interface:
Log onto your Amazon account.
Create a bucket or click on an existing bucket.
Click on Properties.
Click on Permissions.
Click on Edit Bucket Policy.
Here is a example policy with the required permissions:
{ "Version": "2008-10-17", "Id": "Policyxxxxxxxxxxxxx", "Statement": [ { "Sid": "Stmtxxxxxxxxxxxxx", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::xxxxxxxxxxxx:user/XXXXXXX" }, "Action": [ "s3:AbortMultipartUpload", "s3:GetObjectAcl", "s3:GetObjectVersion", "s3:DeleteObject", "s3:DeleteObjectVersion", "s3:GetObject", "s3:PutObjectAcl", "s3:PutObjectVersionAcl", "s3:ListMultipartUploadParts", "s3:PutObject", "s3:GetObjectVersionAcl" ], "Resource": [ "arn:aws:s3:::com.prometheus.cgtest-1/*", "arn:aws:s3:::com.prometheus.cgtest-1" ] } ] }
Examples
Once the yaml file is configured you can instantiate a S3Connection and you use that connection to instantiate a Storage instance.
import s3 import yaml with open('s3.yaml', 'r') as fi: config = yaml.load(fi) connection = s3.S3Connection(**config['s3']) storage = s3.Storage(connection)
Then you call methods on the Storage instance.
The following code creates a bucket called “com-prometheus-my-bucket” and asserts the bucket exists. Then it deletes the bucket, and asserts the bucket does not exist.
my_bucket_name = 'com-prometheus-my-bucket' storage.bucket_create(my_bucket_name) assert storage.bucket_exists(my_bucket_name) storage.bucket_delete(my_bucket_name) assert not storage.bucket_exists(my_bucket_name)
The following code lists all the buckets and all the keys in each bucket.
for bucket in storage.bucket_list(): print bucket.name, bucket.creation_date for key in storage.bucket_list_keys(bucket.name): print '\t', key.key, key.size, key.last_modified, key.owner.display_name
The following code uses the default bucket and uploads a file named “example” from the local filesystem as “example-in-s3” in s3. It then checks that “example-in-s3” exists in storage, downloads the file as “example-from-s3”, compares the original with the downloaded copy to ensure they are the same, deletes “example-in-s3”, and finally checks that it is no longer in storage.
import subprocess try: storage.write("example", "example-in-s3") exists, metadata = storage.exists("example-in-s3") assert exists metadata = storage.read("example-in-s3", "example-from-s3") assert 0 == subprocess.call(['diff', "example", "example-from-s3"]) storage.delete("example-in-s3") exists, metadata = storage.exists("example-in-s3") assert not exists except StorageError, e: print 'failed:', e
The following code again uploads “example” as “example-in-s3”. This time it uses the bucket “my-other-bucket” explicitly, and it sets some metadata and checks that the metadata is set correctly. Then it changes the metadata and checks that as well.
headers = { 'x-amz-meta-state': 'unprocessed', } remote_name = s3.S3Name("example-in-s3", bucket="my-other-bucket") try: storage.write("example", remote_name, headers=headers) exists, metadata = storage.exists(remote_name) assert exists assert metadata == headers headers['x-amz-meta-state'] = 'processed' storage.update_metadata(remote_name, headers) metadata = storage.read(remote_name, "example-from-s3") assert metadata == headers except StorageError, e: print 'failed:', e
The following code configures “com-prometheus-my-bucket” with a policy that restricts “myuser” to write-only. myuser can write files but cannot read them back, delete them, or even list them.
storage.bucket_set_policy("com-prometheus-my-bucket", data={ "Version": "2008-10-17", "Id": "BucketUploadNoDelete", "Statement": [ { "Sid": "Stmt01", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::123456789012:user/myuser" }, "Action": [ "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts", "s3:PutObject", ], "Resource": [ "arn:aws:s3:::com-prometheus-my-bucket/*", "arn:aws:s3:::com-prometheus-my-bucket" ] } ] })
s3 Command Line Tool
This package installs both the s3 Python module and the s3 command line tool.
The command line tool provides a convenient way to upload and download files to and from S3 without writing python code.
As of now the tool supports the put, get, delete, and list commands; but it does not support all the features of the module API.
s3 expects to find s3.yaml in the current directory. If it is not there you must tell s3 where it is using the –config option. For example:
s3 --config /path/to/s3.yaml command [command arguments]
You must provide a command. Some commands have required arguments and/or optional arguments - it depends upon the command.
Use the –help option to see a list of supported commands and their arguments:
s3 --help
See s3 Command Line Tool in the API Reference.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.