Front-end for the ServiceX Data Server
Project description
ServiceX_frontend
Client access library for ServiceX
Introduction
Given you have a selection string, this library will manage submitting it to a ServiceX instance and retrieving the data locally for you. The selection string is often generated by another front-end library, for example:
- func_adl.xAOD (for ATLAS xAOD's)
- func_adl.uproot (for flat ntuples)
- tcut_to_castle (translates
TCut
like syntax into aservicex
query - should work for both)
Prerequisites
Before you can use this library you'll need:
- An environment based on python 3.6 or later
- A
ServiceX
end-point. For example,http://localhost:5000/servicex
, ifServiceX
is running on a localk8
cluster and the proper ports are open, or the public servicex instance (contact IRIS-HEP at xxx if you are part of the LHC to request an account, or with help setting up an instance).
How to access your endpoint
The servicex
library searches for configuration information in several locations to determine what end-point it should connect to, in the following order:
- A
.servicex
file in the current working directory - A
.servicex
file in the user's home directory ($HOME
on Linux and Mac, and your profile directory on Windows). - The
config_defaults.yaml
file distributed with theservicex
package.
If no endpoint is specified, then the library defaults to the developer endpoint, which is http://localhost:5000
for the web-service API, and localhost:9000
for the minio
endpoint. No passwords are required.
Create a .servicex
file, in the yaml
format, in the appropriate place for your work that contains the following:
api_endpoint:
endpoint: <your-endpoint>
username: <api-username>
password: <api-password>
minio_endpoint: <minio-endpoint>
minio_username: <minio-accesskey>
minio_password: <minio-secretkey>
Finally, you can create the objects ServiceXAdaptor
and MinioAdaptor
by hand in your code, passing them as arguments to ServiceXDataset
and inject custom endpoints and usernames and passwords, avoiding the configuration system. This is probably only useful for advanced users.
Usage
The following lines will return a pandas.DataFrame
containing all the jet pT's from an ATLAS xAOD file containing Z->ee Monte Carlo:
from servicex import ServiceX
query = "(call ResultTTree (call Select (call SelectMany (call EventDataset (list 'localds:bogus')) (lambda (list e) (call (attr e 'Jets') 'AntiKt4EMTopoJets'))) (lambda (list j) (/ (call (attr j 'pt')) 1000.0))) (list 'JetPt') 'analysis' 'junk.root')"
dataset = "mc15_13TeV:mc15_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.merge.DAOD_STDM3.e3601_s2576_s2132_r6630_r6264_p2363_tid05630052_00"
ds = ServiceXDataset(dataset)
r = ds.get_data_pandas_df(query)
print(r)
And the output in a terminal window from running the above script (takes about 1-2 minutes to complete):
python scripts/run_test.py http://localhost:5000/servicex
JetPt
entry
0 38.065707
1 31.967096
2 7.881337
3 6.669581
4 5.624053
... ...
710183 42.926141
710184 30.815709
710185 6.348002
710186 5.472711
710187 5.212714
[11355980 rows x 1 columns]
If your query is badly formed or there is an other problem with the backend, an exception will be thrown with information about the error.
If you'd like to be able to submit multiple queries and have them run on the ServiceX
back end in parallel, it is best to use the asyncio
interface, which has the identical signature, but is called get_data_pandas_df_async
.
For documentation of get_data
and get_data_async
see the servicex.py
source file.
Configuration
As mentioned above, the .servicex
file is read to pull a configuraiton. The search path for this file:
- Your current working directory
- Your home directory
The file can contain an api_endpoint
as mentioned above. In addition the other following things can be put in:
cache_path
: Location where queries, data, and a record of queries are written. This should be an absolute path the person running the library has r/w access to. On windows, make sure to escape\
- and best to follow standardyaml
conventions and put the path in quotes - especially if it contains a space. Top level yaml item (don't indent it accidentally!). Defaults to/tmp/servicex
(with the temp directory as appropriate for your platform) Examples:- Windows:
cache_path: "C:\\Users\\gordo\\Desktop\\cacheme"
- Linux:
cache_path: "/home/servicex-cache"
- Windows:
Features
Implemented:
- Accepts a
qastle
formatted query - Exceptions are used to report back errors of all sorts from the service to the user's code.
- Data is return in the following forms:
pandas.DataFrame
an in process DataFrame of all the data requestedawkward
an in processJaggedArray
or dictionary ofJaggedArray
s- A list of root files that can be opened with
uproot
and used as desired. - Not all output formats are compatible with all transformations.
- Complete returned data must fit in the process' memory
- Run in an async or a non-async environment and non-async methods will accommodate automatically (including
jupyter
notebooks). - Support up to 100 simultaneous queries from a laptop-like front end without overwhelming the local machine (hopefully ServiceX will be overwhelmed!)
- Start downloading files as soon as they are ready (before ServiceX is done with the complete transform).
- It has been tested to run against 100 datasets with multiple simultaneous queries.
- It supports local caching of query data
- It will provide feedback on progress.
- Configuration files supported so that user identification information does not have to be checked into repositories.
Testing
This code has been tested in several environments:
- Windows, Linux, MacOS
- Python 3.6, 3.7, 3.8
- Jupyter Notebooks (not automated), regular python command-line invoked source files
API
Everything is based around the ServiceXDataset
object. Below is the documentation for the most common parameters.
ServiceXDataset(dataset: str,
image: str = 'sslhep/servicex_func_adl_xaod_transformer:v0.4',
max_workers: int = 20,
servicex_adaptor: ServiceXAdaptor = None,
minio_adaptor: MinioAdaptor = None,
cache_adaptor: Optional[Cache] = None,
status_callback_factory: Optional[StatusUpdateFactory] = _run_default_wrapper,
local_log: log_adaptor = None,
session_generator: Callable[[], Awaitable[aiohttp.ClientSession]] = None,
config_adaptor: ConfigView = None)
Create and configure a ServiceX object for a dataset.
Arguments
dataset Name of a dataset from which queries will be selected.
image Name of transformer image to use to transform the data
max_workers Maximum number of transformers to run simultaneously on
ServiceX.
servicex_adaptor Object to control communication with the servicex instance
at a particular ip address with certain login credentials.
Will be configured via the `.servicex` file by default.
minio_adaptor Object to control communication with the minio servicex
instance. By default configured with values from the
`.servicex` file.
cache_adaptor Runs the caching for data and queries that are sent up and
down.
status_callback_factory Factory to create a status notification callback for each
query. One is created per query.
local_log Log adaptor for logging.
session_generator If you want to control the `ClientSession` object that
is used for callbacks. Otherwise a single one for all
`servicex` queries is used.
config_adaptor Control how configuration options are read from the
`.servicex` file.
Notes:
- The `status_callback` argument, by default, uses the `tqdm` library to render
progress bars in a terminal window or a graphic in a Jupyter notebook (with proper
jupyter extensions installed). If `status_callback` is specified as None, no
updates will be rendered. A custom callback function can also be specified which
takes `(total_files, transformed, downloaded, skipped)` as an argument. The
`total_files` parameter may be `None` until the system knows how many files need to
be processed (and some files can even be completed before that is known).
To get the data use one of the get_data
method. They all have the same API, differing only by what they return.
| get_data_awkward_async(self, selection_query: str) -> Dict[bytes, Union[awkward.array.jagged.JaggedArray, numpy.ndarray]]
| Fetch query data from ServiceX matching `selection_query` and return it as
| dictionary of awkward arrays, an entry for each column. The data is uniquely
| ordered (the same query will always return the same order).
|
| get_data_awkward(self, selection_query: str) -> Dict[bytes, Union[awkward.array.jagged.JaggedArray, numpy.ndarray]]
| Fetch query data from ServiceX matching `selection_query` and return it as
| dictionary of awkward arrays, an entry for each column. The data is uniquely
| ordered (the same query will always return the same order).
Each data type comes in a pair - an async
version and a synchronous version.
get_data_awkward_async, get_data_awkward
- Returns a dictionary of the requested data asnumpy
orJaggedArray
objects.get_data_rootfiles
,get_data_rootfiles_async
- Returns a list of locally download files (aspathlib.Path
objects) containing the requested data. Suitable for opening withROOT::TFile
oruproot
.get_data_pandas_df
,get_data_pandas_df_async
- Returns the data as apandas
DataFrame
. This will fail if the data you've requested has any structure (e.g. is hierarchical, like a single entry for each event, and each event may have some number of jets).get_data_parquet
,get_data_parquet_async
- Returns a list of files locally downloaded that can be read by any parquet tools.
Development
For any changes please feel free to submit pull requests!
To do development please setup your environment with the following steps:
- A python 3.7 development environment
- Fork/Pull down this package, XX
python -m pip install -e .[test]
- Run the tests to make sure everything is good:
pytest
.
Then add tests as you develop. When you are done, submit a pull request with any required changes to the documentation and the online tests will run.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for servicex-2.0.0b10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f914034d7fac2094f04b6fcb96e82397d7c4a11ec869fd40dba835e820331eca |
|
MD5 | f2f2e4c0c4ad63bdb32d8906c9af77fc |
|
BLAKE2b-256 | 924fc3b65ea3f7288a7dd16f3e6cd2b2a4459f56cee56de234b4cc3df8f3b0ff |