ServiceX data management using a configuration file
Project description
ServiceX DataBinder
Release v0.2.3
ServiceX DataBinder is a Python package for making multiple ServiceX requests and managing ServiceX delivered data from a configuration file.
Installation
pip install servicex-databinder
Configuration file
The configuration file is a yaml file containing all the information. An example configuration file is shown below:
General:
ServiceXBackendName: uproot
OutputDirectory: /path/to/output
OutputFormat: parquet
Sample:
- Name: ttH
RucioDID: user.kchoi:user.kchoi.sampleA,
user.kchoi:user.kchoi.sampleB
Tree: nominal
FuncADL: "Select(lambda event: {'jet_e': event.jet_e, 'jet_pt': event.jet_pt})"
- Name: ttW
RucioDID: user.kchoi:user.kchoi.sampleC
Tree: nominal
Filter: n_jet > 5
Columns: jet_e, jet_pt
Input dataset can be defined either by RucioDID
or XRootDFiles
. You need to make sure whether the ServiceX backend you specified in ServiceXBackendName
supports Rucio and/or XRootD.
ServiceX query can be constructed with either TCut syntax or func-adl.
- Options for TCut syntax:
Filter
1 andColumns
- Option for Func-adl expression:
FuncADL
1 Filter
works only for scalar-type of TBranch.
Output format can be either Apache parquet
or ROOT ntuple
for uproot
backend. Only ROOT ntuple
format is supported for xAOD
backend.
Please find other example configurations for ATLAS opendata, xAOD, and Uproot ServiceX endpoints.
The followings are available options:
Option for General |
Description | DataType |
---|---|---|
ServiceXBackendName |
ServiceX backend name in your servicex.yaml file (name should contain either uproot or xAOD to distinguish the type of transformer) |
String |
OutputDirectory |
Path to the directory for ServiceX delivered files | String |
OutputFormat |
Output file format of ServiceX delivered data (parquet or root for uproot / root for xaod ) |
String |
ZipROOTColumns |
Zip columns that share prefix to generate one counter branch (see detail at uproot readthedoc) | Boolean |
WriteOutputDict |
Name of an ouput yaml file containing Python nested dictionary of output file paths (located in the OutputDirectory ) |
String |
IgnoreServiceXCache |
Ignore the existing ServiceX cache and force to make ServiceX requests | Boolean |
Option for Sample |
Description | DataType |
---|---|---|
Name |
sample name defined by a user | String |
RucioDID |
Rucio Dataset Id (DID) for a given sample; Can be multiple DIDs separated by comma |
String |
XRootDFiles |
XRootD files (e.g. root:// ) for a given sample; Can be multiple files separated by comma |
String |
Tree |
Name of the input ROOT TTree (uproot ONLY) |
String |
Filter |
Selection in the TCut syntax, e.g. jet_pt > 10e3 && jet_eta < 2.0 (TCut ONLY) |
String |
Columns |
List of columns (or branches) to be delivered; multiple columns separately by comma (TCut ONLY) | String |
FuncADL |
func-adl expression for a given sample (func adl ONLY) | String |
Deliver data
from servicex_databinder import DataBinder
sx_db = DataBinder('<CONFIG>.yml')
out = sx_db.deliver()
The function deliver()
returns a Python nested dictionary:
- for
uproot
backend andparquet
output format:out['<SAMPLE>']['<TREE>'] = [ List of output parquet files ]
- for
uproot
backend androot
output format:out['<SAMPLE>'] = [ List of output root files ]
- for
xAOD
backend:out['<SAMPLE>'] = [ List of output root files ]
Useful tools
Create Rucio container for multiple DIDs
The current ServiceX generates one request per Rucio DID.
It's often the case that a physics analysis needs to process hundreds of DIDs.
In such cases, the script (scripts/create_rucio_container.py
) can be used to create one Rucio container per Sample from a yaml file.
An example yaml file (scripts/rucio_dids_example.yaml
) is included.
Here is the usage of the script:
usage: create_rucio_containers.py [-h] [--dry-run DRY_RUN]
infile container_name version
Create Rucio containers from multiple DIDs
positional arguments:
infile yaml file contains Rucio DIDs for each Sample
container_name e.g. user.kchoi:user.kchoi.<container-name>.Sample.v1
version e.g. user.kchoi:user.kchoi.fcnc_ana.Sample.<version>
optional arguments:
-h, --help show this help message and exit
--dry-run DRY_RUN Run without creating new Rucio container
Acknowledgements
Support for this work was provided by the the U.S. Department of Energy, Office of High Energy Physics under Grant No. DE-SC0007890
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for servicex_databinder-0.2.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38d5ac0f8980860e117c3597ed426cb9a38777a7ed1d35e4d4573e3e3373e2bc |
|
MD5 | bb98338a5d703b7b7a6dbc6502ed8e22 |
|
BLAKE2b-256 | d986830a9b693dae868c779564e94cd310b717cf320c9216fa10fe83d3218073 |
Hashes for servicex_databinder-0.2.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b388a7c26ec2962e201fad055c955205c22e793f2705461cbea1f84d8446d78 |
|
MD5 | c87dbfdd41e9fdd2b518e7f4ae0fc689 |
|
BLAKE2b-256 | e1f0a275b1bfc51a42b3fa0926dade259745da9337b27ae612af0b6805a3c7a6 |