Official Shooju Client
Project description
shooju
shooju is the official python client library for Shooju with the following features:
- Authentication via username and api key
- Getting series points and fields
- Registering import jobs and writing and removing points and fields
Installation
Install with:
pip install shooju
To install from source, use:
python setup.py install
Basic Usage
>>> from shooju import Connection, sid, Point
>>> from datetime import date
>>> conn = Connection(server = <API_SERVER>, user = <USERNAME>, api_key = <API_KEY>)
>>> series_id = sid("users", <USERNAME>, "china", "population")
>>> series_query = 'sid="{}"'.format(series_id)
>>> with conn.register_job('China Pop.') as job:
>>> job.write(series_query, fields={"unit": "millions"}, points=[Point(date(2012, 1, 1), 314.3)])
>>> series = conn.get_series('sid="{}"'.format(series_id), fields=['unit'],
max_points=1, df=date(2012, 1, 1), dt=date(2012, 1, 1))
>>> print(series['points'][0].value)
>>> print(series['fields']['unit'])
#Code samples
Code samples are in the usage_samples/ directory. You will need to replace your user and server settings in usage_samples/sample_settings.py.
Tutorial
Connecting to Shooju
The first step when working with shooju is to connect to Shooju using your username and API key or google account email and google auth refresh token. To authenticate with Shooju username and API key, find they api key in the accounts section of Shooju.com). You should also supply the server you are using:
>>> from shooju import Connection
>>> conn = Connection(server = API_SERVER, username = USERNAME, api_key = API_KEY)
Connection
accepts optional requests_session
parameter of requests.Session
type:
>>> import requests
>>> session = requests.Session()
>>> sj = Connection(API_SERVER, USERNAME, API_KEY, requests_session=session)
To retrieve the Google OAuth refresh token, follow these steps:
>>> from shooju import Client, Connection
>>> client = Client(API_SERVER, base_path="/api/1")
>>> oauth_link = client.get('/auth/google_refresh_token')['link']
Open the oauth link in a web browser and copy the CODE, then use the following to retrieve the refresh token:
>>> refresh_token = client.post('/auth/google_refresh_token', data_json={'code': CODE})['refresh_token']
Shooju Series Representation
The basic data building block in Shooju is the series (i.e. time series), and each series is identified by a series id. A series id is a path-like string delimited by \ characters. The path helps keep data series organized into folder-like structures. By default, each user can write into the id space **users\your_username\* **. So if I'm Sam and I want to import my GDP forecasts, I might use the series id users\sam\china\gdp. To help put the string together you can use a helper function like so:
>>> from shooju import sid
>>> series_id = sid("users","sam","china","gdp")
>>> print(series_id)
users\sam\china\gdp
Writing Data
To write data, first register a job with Shooju:
>>> job = conn.register_job("My description")
To write a data point onto Shooju, we first instantiate a Point object and specify the datetime and float value:
>>> from datetime import date
>>> from shooju import Point
>>> series_id = sid("users", USERNAME, "gdp", "china")
>>> series_query = 'sid="{}"'.format(series_id)
>>> points = []
>>> for i in range(1,28):
>>> points.append(Point(date(2010+i, 1, 1), i))
>>> job.write(series_query, points=points)
Shooju also stores field/value data for each series. This is commonly used to store meta-data such as source, unit, notes, etc. To write fields into Shooju use:
>>> job.write(series_query, fields={'source': 'Sam analysis', 'unit': 'US$bn'})
By default, write() call send data to Shooju immediately. When making many write() calls, it is recommended to queue write() calls and submit them in batches. This is done by specifying a batch_size when registering the job:
>>> job = conn.register_job("another job", batch_size = 500)
>>> series_id = sid("users", USERNAME, "gdp", "germany")
>>> series_query = 'sid="{}"'.format(series_id)
>>> points = []
>>> for i in range(1,28):
>>> points.append(Point(date(2010+i, 1, 1), i))
>>> job.write(series_query, fields={'source': 'My analysis', 'unit', 'US$bn'}, points=points)
>>> job.submit() #NOTE: otherwise nothing would happen!
The job object can be used as a context manager. The below two snippets are equivalent:
>>> job = conn.register_job("another job", batch_size = 500)
>>> job.write(series_query, fields={'unit': 'US$bn'})
>>> job.submit()
>>> with conn.register_job("another job", batch_size = 500) as job:
>>> job.write(series_query, fields={'unit': 'US$bn'})
To delete a single series, use:
>>> with conn.register_job("another job", batch_size = 500) as job:
>>> job.delete_series('sid={}'.format(series_id))
to delete many series by a query, use:
>>> with conn.register_job("another job", batch_size = 500) as job:
>>> job.delete_series('sid:data', one=False)
Getting Data
To read a single series data use get_series() function. The function returns a dict with series_id
, points
and fields
keys. points
and fields
may be omitted if no points/fields were returned.
By default the function does not fetch points/fields.
To get an array of points pass the following parameters: df
(date from), dt
(date to) and max_points
. Note that df
and dt
arguments are optional, but max_points
is required when fetching points because the default value is 0:
>>> from datetime import date
>>> series = conn.get_series(u'sid="{}"'.format(series_id), df=date(2011,1,1), dt=date(2020,1,1), max_points=-1)
>>> print(series['points'][0].date, ser['points'][0].value)
2012-01-01 00:00:00 1.0
As noted above get_series() doesn't fetch points by default. To fetch points explicitly set max_points
(must be integer greater than 0). To fetch ALL points set max_points
to a special value -1
:
>>> print(conn.get_series(u'sid="{}"'.format(series_id), df=date(2011,1,1), max_points=1)['points'].value)
1.0
To get field values, use:
>>> print(conn.get_series('sid="{}".format(series_id), fields=["unit"]))['fields']['unit']
US$bn
To get all of the fields for a given series pass '*' in the fields
parameter:
>>> print conn.get_series(u'sid="{}"'.format(series_id), fields=['*'])['points']['fields']
{"unit":"US$bn", "source":"usa"}
To get some of the fields under given series, use:
>>> print conn.get_fields(u'sid="{}"'.format(series_id), fields=["source"])
{"unit":"US$bn"}
Getting multiple data at once (multi-get)
By default, each get_series() call makes one blocking API request. If we were to make all the calls in the getting data example above, we would be making 5 API calls. Shooju API supports multiple get requests via the BULK API, which is much more efficient if we intend to make multiple requests.
To initialize a muti-get request:
>>> mget = conn.mget()
Now we can use get_series()* function. Keep in mind that the function does not return the data, but instead queues the requests for fetching. We can reproduce the get_series()* requests introduced above:
>>> series_query = u'sid="{}"'.format(series_id)
>>> mget.get_series(series_query, df=date(2011,1,1), dt=date(2020,1,1), max_points=-1)
0
>>> mget.get_series(series_query, df=date(2011,1,1), max_points=1)
1
>>> mget.get_series(series_query, fields=["unit"])
2
>>> mget.get_series(series_query, fields=["*""])
3
>>> mget.get_fields(series_query, fields=["source"])
4
To get an array containing the results in the order that the get_* requests were called:
>>> result = mget.fetch()
>>> print result[2]['fields']
US$bn
Scroll
To fetch a big number of series by a given query use scroll()
. This function accepts the same points/fields related parameters as get_series()
:
>>> for s in conn.scroll('sid:users\\me', fields=['unit'], max_points=-1, df=date(2001, 1, 1)):
>>> print('sid: {} points: {} fields: {}'.format(s['series_id'], s.get('points'), s.get('fields')))
Points serializers
By default get_series
and scroll
return points represented as a list of shooju.Point
objects. This behaviour can be changed by using shooju.points_serializers
.
>>> from shooju import points_serializers as ps
>>> ser = conn.get_series(u'sid="{}"'.format(series_id), max_points=-1, serializer=ps.pd_series)
>>> print(ser['points'])
1980-01-01 12.0
dtype: float64
Supported serializers:
milli_tuple
- an array of date milli and value tuples.pd_series
- pandas.Series where date represented asDatetimeIndex
.pd_series_localized
- the same is above butDatetimeIndex
is localized if@localize
operator was used.np_array
- a Numpy array.
Generating a pandas.DataFrame from Shooju series data (get_df)
To generate a pandas.DataFrame from series query use get_df()
. This function has a private parameter series_axis
, which is used to set series position on DataFrame - the default rows
or columns
. Beside that, get_df()
accepts the same points/fields related parameters as get_series()
and scroll()
.
Generates pandas.DataFrame with fields as columns and series as rows.
>>> df = conn.get_df('sid:users\\me', fields=['*'])
>>> print(df)
series_id unit description
0 users\me\unit-a unit A Unit A
1 users\me\unit-b unit B Unit B
3 users\me\unit-c unit C Unit C
...
To generate DataFrame with series values as columns and points as rows, pass the parameter series_axis='columns'
. If specific fields are passed, the values will define the DataFrame indexes joined by the character '/'
.
>>> df = conn.get_df('sid:users\\me', fields=['unit', 'description'], series_axis='columns', max_points=-1)
>>> print(df)
unit A/Unit A unit B/Unit B ... unit Z/Unit Z
2000-04-03 20.50 31.50 ... 34.20
2000-04-04 32.25 20.50 ... 36.00
2000-04-05 31.25 40.50 ... 46.50
...
get_df()
always returns localized DataFrame. By default it's in UTC, but if @localized:<tz>
operator applied, it will be in <tz>
. To convert DataFrame's index to naive use df.tz_localize(None)
.
REST Client
To use other APIs, use the configured REST client in Connection:
>>> from shooju import Connection
>>> conn = Connection(username = USERNAME, api_key = API_KEY, server = API_SERVER)
>>> conn.raw.get('/teams')
>>> conn.raw.post('/teams/myteam/', data_json={'description': 'my description'})
To send url parameters, use the params
argument:
>>> conn.raw.get('/series', params={'series_id': r'user\series\s1'}
Change log
3.8.13
- Added
no_history
param toConnection.register_job
3.8.12
- Minor improvements
3.8.11
- BREAKING CHANGE: Switched
mode
argument forget_reported_dates
toall
( fetches both points and fields reported dates )
3.8.10
- Added the ability to pass
pandas.Series
tojob.write
3.8.9
Connection.scroll
extra params improvements
3.8.8
- Updated for compatibility with NumPy 1.24
3.8.7
Connection.scroll
improvements. Now returns a ScrollIterable object which has araw_response
property which can also be accessed during iteration
3.8.6
- Minor performance improvements
3.8.5
- Added
Connection.upload_files
function
3.8.4
- Renamed
scroll_batch_size
parameter tobatch_size
3.8.3
- Fix pandas FutureWarnings
3.8.2
- Minor improvements
3.8.1
- Minor fixes
3.8.0
- Added support of low level API hooks
3.7.0
- New attributes
Point.timestamp
andPoint.job
3.6.0
- BREAKING CHANGE: Columns of
pandas.DataFrame
thatConnection.get_df()
returns were renamed frompoints
anddate
toval
anddt
- Reduced
Connection.get_df()
memory footprint Connection.get_df()
omits rows where points values are nan
3.5.1
- new
custom_fields
parameter in Connection.upload_file()
3.5.0
- introduce Connection.upload_file() and Connection.init_multipart_upload() methods
- deprecate Connection.create_uploader_session() and UploaderSession()
- job.delete_reported() to delete certain reported dates
3.4.3
- Fix exception in
Connection.get_df()
due to mixing naive and localized pandas.Series().
3.4.2
- Global extra_params was ignored in Connection.raw calls.
3.4.1
- Minor internal changes. Stopped using the derprecated parameters of /series/write endpoint.
- Fix Connection.get_df() error when scrolling over series with no points.
3.4.0
- New
options.return_series_errors
to control how series level errors are handled
3.3.1
Connection
accepts newextra_params
parameter
3.3.0
RemoteJob.delete()
andRemoteJob.delete_by_query()
are now deprecated. UseRemoteJob.delete_series()
.
3.2.0
Connection.get_df()
now always returns localized DataFrame
3.1.0
- Added multipart upload for huge files
3.0.3
- Fixed ability to make anonymous calls against public endpoints
3.0.2
- Fixed Python 2 compatibility issues
3.0.1
- Minor internal refactoring
3.0.0
- New
Connection.get_df()
function to generate a pandas.DataFrame from Shooju series data - Removed deprecated Connection.get_point()/get_field() and GetBulk.get_point()/get_field()
- Removed the following deprecated parameters from read functions: snapshot_job_id, snapshot_date, reported_date, operators, date_start, date_finish
2.3.0
- Added RemoteJob(skip_meta_if_no_fields=...) parameter
2.2.0
Connection.search()
been deprecated and now removed.- Added
timeout
parameter to Connection. This controls HTTP requests timeout.
2.1.1
- Fix compatibility issues with the most recent msgpack version.
2.1.0
- Deprecate put_* job methods. The new write()/write_reported() methods introduced as a replacement.
2.0.16
- Improve date parse error message
2.0.15
- Connection(...proxies={...}) parameter has been replaced by Connection(...requests_session=requests.Session()) in favor of better flexibility
2.0.14
- added proxies support
2.0.13
- fixed error when writing points with tz-aware dates
2.0.12
- added ability to define direct IPs of API servers
2.0.11
- fixed milliseconds being cut-off on points write
2.0.10
- pd_series points serializer fix
2.0.9
- Stopped using Pandas deprecated feature
2.0.8
- Minor request retry logic improvements
2.0.7
- Deprecate
snapshot_job_id
,snapshot_date
andreported_date
parameters.@asof
and@repdate
must be used instead. - get_series() accepts
operators
parameter - Added
pd_series_localized
points serializer
2.0.6
- Fix Python 3.7 compatibility.
2.0.5
- Edge case fix. Wasn't able to wrap sj.raw. with functools.wraps.
2.0.4
- Fixed thread safety bug.
- New optional "location" Connection() parameter to identify the application that using the API.
2.0.3
- Breaking change: the first parameter of Connection.get_reported_dates() is now series_query. It was series_id before. To convert from series_id to series_query, remove the $ from the beginning or prepend sid="<series_id>".
2.0.2
- Log warning on request retry.
2.0.1
- Bug fixes.
2.0.0
- Added preferred new get_series() method.
- Moved writes to SJTS format for serialization and transport.
- Allowed relative date format in df / dt parameters.
- Big changes in scroll():
- date_start -> df (date_start still works but will be removed in future versions)
- date_finish -> dt (date_finish still works but will be removed in future versions)
- removed deprecated parameters: query_size, sort_on, sort_order, size
- added max_series
- added extra_params
- Deprecated get_point and get_field methods. These will be removed in future versions.
- Deprecated search method in favor of scroll. It will be removed in future versions.
0.9.7
- Python 3 compatibility fixes.
0.9.6
- Points serializers bug fixes.
0.9.5
- Added operators parameter in the pd.search() function.
- Added reported_date parameter to the get_points() functions.
- Added job.put_reported_points(series_id, reported_date, points) to write reported points based on a date.
- Added get_reported_dates(series_id=None, job_id=None, processor=None, df=None, dt=None) to retrieve all reported_dates for one of: series_id, job_id, processor.
- Added snapshot_date and snapshot_job_id to all get_points() functions.
- Added serializer parameter to all get_points() functions. Built-in options are under shooju.points_serializers.*. The default can be set using shooju.options.point_serializer = shooju.points_serializers.pd_series.
- Removed pd.get_points() and pd.get_fields(). Use serializer=shooju.points_serializers.pd_series instead.
0.9.1
- Fixed negative epoch times (before year 1970) on non-unix.
- Now using DatetimeIndex in pandas formatter for faster pandas dataframe serialization.
- Removed pd.get_points and pd.get_fields functions. Use pd.search() instead.
- Now applying options.point_serializer everywhere. (edited)
0.9.0
- Job.delete() is now part of bulk request. Use Job.submit() to run immediately.
- Connection.delete() and Connection.delete_by_query() have been removed. Use the equivalents in job instead.
0.8.5
- Fixed mget().get_point() bug.
0.8.4
- Bug fixes.
0.8.3
- SJTS bug fixes.
0.8.2
- Bug fixes and json/msgpack/sjts auto support.
0.8.1
- Bug fixes.
0.8.0
- Removed ujson.
- Using new /series API.
- Changed size to max_points parameter. Size is still supported, but switching to max_points is encouraged.
0.7.8
- Optional ujson.
- Added options.point_serializer (shooju_point / milli_tuple).
0.7.7
- Bug fixes.
0.7.6
- Added options.sjts_stream.
0.7.5
- Added options.sjts_chunk_size.
- Do not fetch fields when not necessary.
0.7.4
- Added SJTS.
- Moved internal dates from unix to milli.
0.7.3
- Added internal async.
0.7.2
- Bug fixes.
0.7.1
- Series are now written in the order of put_* calls.
- Added retry on lock failures.
0.7.0
- Retry on temporary API failure.
- Added reported_group concept.
- Added support for Python 3.
0.6.2
- Add operators parameter to scroll and search functions. To use, pass in an array of operators without the @. For example, operators = ['MA'].
0.6.1
- Ability to upload files using sess = conn.create_uploader_session() and sess.upload_file()
- conn.get_points(), get_point(), get_field() and get_fields() now accept snapshot_job_id and snapshot_date parameters. These parameters allow fetching historic snapshots of how the series looked after the job or at specific datetime.
0.6.0
- BREAKING CHANGE: search() now returns a list instead of a dictionary.
- search() and scroll() now accept sort_on and sort_order paramters.
- If a non-url string is provided to Connection(), https://{}.shooju.com will be attempted.
- Simpler OAuth interface and instructions have been added. See bitbucket page for details.
- Added force parameter to delete_by_query.
0.5.0
- Added job.finish(submit=True) to submit job buffer and mark a job as finished.
- Added job context to be used like: with connection.register_job('testjob') as job: ...
0.4.8
- Added email and google_oauth_token kwargs to Connection() to allow authentication through Google Oauth. Environment variables SHOOJU_EMAIL and SHOOJU_GOOGLE_OAUTH_TOKEN can be used instead of parameters.
- Added Connection.user property to find the currently logged in user.
0.4.7
- Bug fixes.
0.4.6
- Added delete_by_query function.
- Exposed query_size in scroll().
- Changed default size from 10 to 0 in scroll().
0.4.5
- Added remove_points and remove_fields methods to RemoteJob to clear the fields/points before sending new data.
0.4.4
- Change Connection search default point size to 0
0.4.3
- Fix another job cache error.
0.4.2
- Added pre and post submit hooks to RemoteJob to perform actions after submitting a job to shooju
0.4.1
- Fix job cache error, if exception was raised cache was not flushed
0.4
- Connection().pd.search_series renamed to search
- Change way DataFrame is formatted when using Connection().pd.search()
- Added key_field parameters to Connection().pd.search() to add a custom name for the column using series fields
0.3
- Connection().scroll() fixed
- Initializing Connection doesn't ping the API
- If series does not exist get_point, get_points, get_field, get_fields return None
0.2
- Connection().multi_get() renamed to mget()
- mget().get_points(), get_fields(), get_point() and get_field() return index of their result
- Connection().register_job() requires a description of more than 3 chars
- Connection().scroll_series() renamed to scroll()
- Renamed and rearranged Connection parameters: Connection(server, user, api_key)
- Field object removed, fields return a simple dict
- Points can have value of None
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file shooju-3.8.13.tar.gz
.
File metadata
- Download URL: shooju-3.8.13.tar.gz
- Upload date:
- Size: 39.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a1d08e9c1ed93d1e5108c134e58afc71b8ae26839009e35e93ea60873cacb7a |
|
MD5 | 42a662a7e39f910e299ea3ffa27027eb |
|
BLAKE2b-256 | 582311894f8075cf172304929c461e1fe312026931a719281e647be715fbbff3 |