Python DataSource for Apache Spark 4 to read ROOT files (High Energy Physics, HEP) as DataFrames, powered by uproot, awkward, and PyArrow
Project description
PySpark Datasource for ROOT
Apache Spark 4 Python Datasource for reading files in the ROOT data format used in High-Energy Physics (HEP).
Author and version
Luca.Canali@cern.ch · v0.1 (Sep 2025)
Highlights
- ✅ Allows to read ROOT data using Apache Spark using a custom Spark 4 Python DataSource.
- ✅ Works with local files, directories, and globs; optional XRootD (
root://) support. - ✅ Implements partitioning and optional schema inference.
- ✅ Powered by uproot, awkward, PyArrow and Spark's Python Datasource.
Related work & acknowledgments
- The ROOT format is part of the ROOT project
- Key dependencies from scikit-hep: uproot and awkward (thanks to Jim Pivarski)
- Spark Python Datasources: Python Data Source API, Spark Python datasources, Datasource for Huggingface datasets
- SPARK-48493 - Arrow batch support for improved performance (thanks to Allison Wang)
- Notes and example notebooks on Apache Spark for Physics and a note on reading ROOT files with Spark
Install
# From PyPI
pip install pyspark-root-datasource
# Or, local for development
pip install -e .
Quick start
from pyspark.sql import SparkSession
from pyspark_root_datasource import register
spark = (SparkSession.builder
.appName("Read ROOT via PySpark + uproot")
.getOrCreate())
# Register the datasource (short name = "root")
register(spark)
# Get the example ROOT file (2 GB)
# xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
# if you don't have xrdcp installed, on Linux use wget or curl -O
wget https://sparkdltrigger.web.cern.ch/sparkdltrigger/Run2012BC_DoubleMuParked_Muons.root
# Best practice: provide a schema to prune branches early
schema = "nMuon int, Muon_pt array<float>, Muon_eta array<float>, Muon_phi array<float>, Muon_mass array<float>, Muon_charge array<int>"
df = (spark.read.format("root")
.schema(schema)
.option("path", "/data/Run2012BC_DoubleMuParked_Muons.root")
.option("tree", "Events")
.option("step_size", "1000000")
.load())
df.show(5, truncate=False)
print("Count:", df.count())
# Use schema inference
df2 = (spark.read.format("root")
.option("path", "/data/Run2012BC_DoubleMuParked_Muons.root")
.option("tree", "Events")
.option("sample_rows", "1000") # default 1000
.load())
df2.printSchema()
Examples and tests
- Read ROOT using PySpark: read_root_file.py
- Notebook computing data from ROOT files: Dimuon_mass_spectrum.ipynb
- Run tests with
pytest
Options
"path"(required) – file path, URL, comma-separated list, directory, or glob (e.g."/data/*.root")"tree"(default:"Events") – TTree name"step_size"(default:"1000000") – entries per Spark partition (per file)"num_partitions"(optional, per file) – overridesstep_size"entry_start","entry_stop"(optional, per file) – index bounds"columns"– comma-separated branch names (if not providing a Spark schema)"list_to32"(default:"true") – Arrow list offset width"extensionarray"(default:"false") – Arrow extension array support"cast_unsigned"(default:"true") – castuint*→ signed (Spark lacks unsigned)"recursive"(default:"false") – expand directories recursively"ext"(default:"*.root") – filter pattern whenpathis a directory"sample_rows"(default:"1000") – rows for schema inference"arrow_max_chunksize"(default:"0") – if >0, limit rows per Arrow RecordBatch
Reading over XRootD (root://)
# fsspec plugins for xrootd
pip install fsspec fsspec-xrootd
# XRootD client libs + Python bindings
conda install -c conda-forge xrootd
Install the extras, then:
remote_file = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = (spark.read.format("root")
.option("path", remote_file)
.option("tree", "Events")
.load())
df.show(3, truncate=False)
Reading folders, globs, recursion
# All .root files in a directory (non-recursive)
df = (spark.read.format("root")
.option("path", "/data/myfolder")
.load())
# Recursive directory expansion
df = (spark.read.format("root")
.option("path", "/data/myfolder")
.option("recursive", "true")
.load())
# Custom extension used when 'path' is a directory
df = (spark.read.format("root")
.option("path", "/data/myfolder")
.option("ext", "*.parquet.root")
.load())
# Glob
df = (spark.read.format("root")
.option("path", "/data/*/atlas/*.root")
.load())
Tips and troubleshooting
- Prefer explicit schemas to prune early and minimize I/O.
- Tune partitioning:
step_size= entries per Spark partition.num_partitions(per file) overridesstep_size.
- Large jagged arrays benefit from reasonable
step_size(e.g.,100k–1M). - If necessary, use
arrow_max_chunksizeto keep batch sizes moderate for downstream stages. cast_unsigned=truenormalizesuint*to signed widths (Spark-friendly).- Fixed-size lists are preserved as Arrow
fixed_size_list(no silent downgrade). - XRootD errors: install both
fsspecandfsspec-xrootd, and the XRootD client libs. Conda is often the smoothest:pip install fsspec fsspec-xrootd conda install -c conda-forge xrootd
- Tree not found: double-check
.option("tree", "..."); error messages list available keys. - Different schemas across files: ensure compatible branch types or read by subsets, then reconcile in Spark.
- Driver vs executors env mismatch: set both
spark.pyspark.pythonandspark.pyspark.driver.pythonto your Python.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyspark_root_datasource-0.1.0.tar.gz.
File metadata
- Download URL: pyspark_root_datasource-0.1.0.tar.gz
- Upload date:
- Size: 20.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39ef97649b4979caf91b335239330aa649aa553171a1f194a63c98b17748b786
|
|
| MD5 |
d5ffaaeabcbab230d23cb3f340d17564
|
|
| BLAKE2b-256 |
87c2d4ef2666c08007ef6998ed1f05f3d3e77bdce24c6bf66e52a1758a64e74d
|
File details
Details for the file pyspark_root_datasource-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pyspark_root_datasource-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27b8c6a827566a6e4d45bd220de5df1ac99d017dcf7c379eb3a61b23ff02af9f
|
|
| MD5 |
c2e48759b38b6e56a0667c960a9bba5b
|
|
| BLAKE2b-256 |
38af652ebd0799fe9228a7a80f806c6da4d9356df33d7a5339e3a31480a3a370
|