Skip to main content

ServiceX data management using a configuration file

Project description

ServiceX DataBinder

Release v0.4.0b2

PyPI version

servicex-databinder is a Python package to interact with ServiceX instance to make ServiceX request(s) and manage ServiceX delivered data efficiently from a single configuration file.

The package is particularly useful when accessing multiple remote data via ServiceX, e.g. full-scale analysis or extracting input data for machine learning.

Brief introduction of ServiceX and walk-though of various DataBinder features can be found in this Jupyter notebook.

Prerequisite

Installation

pip install servicex-databinder

Configuration file

The configuration file is a yaml file containing all the information. An example configuration file is shown below:

General:
  ServiceXBackendName: uproot
  OutputDirectory: /path/to/output
  OutputFormat: parquet
  
Sample:
  - Name: ttH
    RucioDID: user.kchoi:user.kchoi.sampleA, 
             user.kchoi:user.kchoi.sampleB
    Tree: nominal
    FuncADL: "Select(lambda event: {'jet_e': event.jet_e})"
  - Name: ttW
    XRootDFiles: root://ttW.root
    Tree: nominal
    Filter: n_jet > 5 
    Columns: jet_e, jet_pt
  - Name: ttZ
    LocalPath: /home/kchoi/ttZ
    Tree: nominal

General block requires three mandatory options as in the example above.

Input dataset for each Sample can be defined either by RucioDID or XRootDFiles or LocalPath. You need to make sure whether the ServiceX backend you specified in ServiceXBackendName supports Rucio and/or XRootD.

ServiceX query can be constructed with either TCut syntax or func-adl.

  • Options for TCut syntax: Filter1 and Columns
  • Option for Func-adl expression: FuncADL

      1 Filter works only for scalar-type of TBranch.

Output format can be either Apache parquet or ROOT ntuple for uproot backend. Only ROOT ntuple format is supported for xAOD backend.

Please find other example configurations for ATLAS opendata, xAOD, and Uproot ServiceX endpoints.

The followings are available options:

Option for General block Description DataType
ServiceXBackendName* ServiceX backend name in your servicex.yaml file
(name MUST contain either uproot or xaod to distinguish the type of transformer)
String
OutputDirectory* Path to the directory for ServiceX delivered files String
OutputFormat* Output file format of ServiceX delivered data (parquet or root for uproot / root for xaod) String
ZipROOTColumns Zip columns that share prefix to generate one counter branch (see detail at uproot readthedoc) Boolean
WriteOutputDict Name of an ouput yaml file containing Python nested dictionary of output file paths (located in the OutputDirectory) String
IgnoreServiceXCache Ignore the existing ServiceX cache and force to make ServiceX requests Boolean

*Mandatory options

Option for Sample block Description DataType
Name sample name defined by a user String
RucioDID Rucio Dataset Id (DID) for a given sample;
Can be multiple DIDs separated by comma
String
XRootDFiles XRootD files (e.g. root://) for a given sample;
Can be multiple files separated by comma
String
Tree Name of the input ROOT TTree;
Can be multiple TTrees separated by comma (uproot ONLY)
String
Filter Selection in the TCut syntax, e.g. jet_pt > 10e3 && jet_eta < 2.0 (TCut ONLY) String
Columns List of columns (or branches) to be delivered; multiple columns separately by comma (TCut ONLY) String
FuncADL func-adl expression for a given sample (see example) String
LocalPath File path directly from local path (NO ServiceX tranformation) String

A config file can be simplified by utilizing Definition block. You can define placeholders under Definition block, which will replace all matched placeholders in the values of Sample block. Note that placeholders must start with DEF_. The example configuration file can be also written as below with Definition block:

General:
  ServiceXBackendName: uproot
  OutputDirectory: /path/to/output
  OutputFormat: parquet
  
Sample:
  - Name: ttH
    RucioDID: DEF_ttH_dids
    Tree: nominal
    FuncADL: DEF_nominal_selection
  - Name: ttW
    RucioDID: user.kchoi:user.kchoi.sampleC
    Tree: nominal
    Filter: n_jet > 5 
    Columns: jet_e, jet_pt

Definition:
  DEF_ttH_dids: user.kchoi:user.kchoi.sampleA, 
             user.kchoi:user.kchoi.sampleB
  DEF_nominal_selection: "Select(lambda event: {'jet_e': event.jet_e})"

Deliver data

from servicex_databinder import DataBinder
sx_db = DataBinder('<CONFIG>.yml')
out = sx_db.deliver()

The function deliver() returns a Python nested dictionary that contains delivered files:

  • for uproot backend and parquet output format: out['<SAMPLE>']['<TREE>'] = [ List of output parquet files ]
  • for uproot backend and root output format: out['<SAMPLE>'] = [ List of output root files ]
  • for xAOD backend: out['<SAMPLE>'] = [ List of output root files ]

Input configuration can be also passed in a form of a Python dictionary.

Delivered Samples and files in the OutputDirectory are always synced with the DataBinder config file.

Error handling

failed_requests = sx_db.get_failed_requests()

If failed ServiceX request(s), deliver() will print number of failed requests and the name of Sample, Tree if present, and input dataset. You can get a full list of failed samples and error messages for each by get_failed_requests() function. If it is not clear from the message you can browse Logs in the ServiceX instance webpage for the detail.

Useful tools

Create Rucio container for multiple DIDs

The current ServiceX generates one request per Rucio DID. It's often the case that a physics analysis needs to process hundreds of DIDs. In such cases, the script (scripts/create_rucio_container.py) can be used to create one Rucio container per Sample from a yaml file. An example yaml file (scripts/rucio_dids_example.yaml) is included.

Here is the usage of the script:

usage: create_rucio_containers.py [-h] [--dry-run DRY_RUN]
                                  infile container_name version

Create Rucio containers from multiple DIDs

positional arguments:
  infile             yaml file contains Rucio DIDs for each Sample
  container_name     e.g. user.kchoi:user.kchoi.<container-name>.Sample.v1
  version            e.g. user.kchoi:user.kchoi.fcnc_ana.Sample.<version>

optional arguments:
  -h, --help         show this help message and exit
  --dry-run DRY_RUN  Run without creating new Rucio container

Acknowledgements

Support for this work was provided by the the U.S. Department of Energy, Office of High Energy Physics under Grant No. DE-SC0007890

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

servicex_databinder-0.4.0b6.tar.gz (18.0 kB view hashes)

Uploaded Source

Built Distribution

servicex_databinder-0.4.0b6-py3-none-any.whl (22.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page