Skip to main content

ServiceX data management using a configuration file

Project description

ServiceX DataBinder

Release v0.2.10

PyPI version

ServiceX DataBinder is a Python package for making multiple ServiceX requests and managing ServiceX delivered data from a configuration file.

Installation

pip install servicex-databinder

Configuration file

The configuration file is a yaml file containing all the information. An example configuration file is shown below:

General:
  ServiceXBackendName: uproot
  OutputDirectory: /path/to/output
  OutputFormat: parquet
  
Sample:
  - Name: ttH
    RucioDID: user.kchoi:user.kchoi.sampleA, 
             user.kchoi:user.kchoi.sampleB
    Tree: nominal
    FuncADL: "Select(lambda event: {'jet_e': event.jet_e})"
  - Name: ttW
    RucioDID: user.kchoi:user.kchoi.sampleC
    Tree: nominal
    Filter: n_jet > 5 
    Columns: jet_e, jet_pt

Input dataset can be defined either by RucioDID or XRootDFiles. You need to make sure whether the ServiceX backend you specified in ServiceXBackendName supports Rucio and/or XRootD.

ServiceX query can be constructed with either TCut syntax or func-adl.

  • Options for TCut syntax: Filter1 and Columns
  • Option for Func-adl expression: FuncADL

      1 Filter works only for scalar-type of TBranch.

Output format can be either Apache parquet or ROOT ntuple for uproot backend. Only ROOT ntuple format is supported for xAOD backend.

Please find other example configurations for ATLAS opendata, xAOD, and Uproot ServiceX endpoints.

The followings are available options:

Option for General Description DataType
ServiceXBackendName ServiceX backend name in your servicex.yaml file
(name should contain either uproot or xAOD to distinguish the type of transformer)
String
OutputDirectory Path to the directory for ServiceX delivered files String
OutputFormat Output file format of ServiceX delivered data (parquet or root for uproot / root for xaod) String
ZipROOTColumns Zip columns that share prefix to generate one counter branch (see detail at uproot readthedoc) Boolean
WriteOutputDict Name of an ouput yaml file containing Python nested dictionary of output file paths (located in the OutputDirectory) String
IgnoreServiceXCache Ignore the existing ServiceX cache and force to make ServiceX requests Boolean
Option for Sample Description DataType
Name sample name defined by a user String
RucioDID Rucio Dataset Id (DID) for a given sample;
Can be multiple DIDs separated by comma
String
XRootDFiles XRootD files (e.g. root://) for a given sample;
Can be multiple files separated by comma
String
Tree Name of the input ROOT TTree;
Can be multiple TTrees separated by comma (uproot ONLY)
String
Filter Selection in the TCut syntax, e.g. jet_pt > 10e3 && jet_eta < 2.0 (TCut ONLY) String
Columns List of columns (or branches) to be delivered; multiple columns separately by comma (TCut ONLY) String
FuncADL func-adl expression for a given sample (see example) String

Deliver data

from servicex_databinder import DataBinder
sx_db = DataBinder('<CONFIG>.yml')
out = sx_db.deliver()

The function deliver() returns a Python nested dictionary that contains delivered files:

  • for uproot backend and parquet output format: out['<SAMPLE>']['<TREE>'] = [ List of output parquet files ]
  • for uproot backend and root output format: out['<SAMPLE>'] = [ List of output root files ]
  • for xAOD backend: out['<SAMPLE>'] = [ List of output root files ]

Input configuration can be also feed as a dictionary.

Useful tools

Create Rucio container for multiple DIDs

The current ServiceX generates one request per Rucio DID. It's often the case that a physics analysis needs to process hundreds of DIDs. In such cases, the script (scripts/create_rucio_container.py) can be used to create one Rucio container per Sample from a yaml file. An example yaml file (scripts/rucio_dids_example.yaml) is included.

Here is the usage of the script:

usage: create_rucio_containers.py [-h] [--dry-run DRY_RUN]
                                  infile container_name version

Create Rucio containers from multiple DIDs

positional arguments:
  infile             yaml file contains Rucio DIDs for each Sample
  container_name     e.g. user.kchoi:user.kchoi.<container-name>.Sample.v1
  version            e.g. user.kchoi:user.kchoi.fcnc_ana.Sample.<version>

optional arguments:
  -h, --help         show this help message and exit
  --dry-run DRY_RUN  Run without creating new Rucio container

Acknowledgements

Support for this work was provided by the the U.S. Department of Energy, Office of High Energy Physics under Grant No. DE-SC0007890

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

servicex_databinder-0.2.10.tar.gz (13.7 kB view hashes)

Uploaded Source

Built Distribution

servicex_databinder-0.2.10-py3-none-any.whl (13.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page