Python DataSource for Apache Spark 4 to read ROOT files (High Energy Physics, HEP) as DataFrames, powered by uproot, awkward, and PyArrow

These details have not been verified by PyPI

Project links

Project description

PySpark Datasource for ROOT

Apache Spark 4 Python Datasource for reading files in the ROOT data format used in High-Energy Physics (HEP).

Author and version
Luca.Canali@cern.ch · v0.1 (Sep 2025)

Highlights

✅ Allows to read ROOT data using Apache Spark using a custom Spark 4 Python DataSource.
✅ Works with local files, directories, and globs; optional XRootD (root://) support.
✅ Implements partitioning and optional schema inference.
✅ Powered by uproot, awkward, PyArrow and Spark's Python Datasource.

Related work & acknowledgments

The ROOT format is part of the ROOT project
Key dependencies from scikit-hep: uproot and awkward (thanks to Jim Pivarski)
Spark Python Datasources: Python Data Source API, Spark Python datasources, Datasource for Huggingface datasets
SPARK-48493 - Arrow batch support for improved performance (thanks to Allison Wang)
Notes and example notebooks on Apache Spark for Physics and a note on reading ROOT files with Spark

Install

# From PyPI
pip install pyspark-root-datasource

# Or, local for development
pip install -e .

Quick start

from pyspark.sql import SparkSession
from pyspark_root_datasource import register

spark = (SparkSession.builder
         .appName("Read ROOT via PySpark + uproot")
         .getOrCreate())

# Register the datasource (short name = "root")
register(spark)

# Get the example ROOT file (2 GB)
# xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
# if you don't have xrdcp installed, on Linux use wget or curl -O
wget https://sparkdltrigger.web.cern.ch/sparkdltrigger/Run2012BC_DoubleMuParked_Muons.root

# Best practice: provide a schema to prune branches early
schema = "nMuon int, Muon_pt array<float>, Muon_eta array<float>, Muon_phi array<float>, Muon_mass array<float>, Muon_charge array<int>"

df = (spark.read.format("root")
      .schema(schema)
      .option("path", "/data/Run2012BC_DoubleMuParked_Muons.root")
      .option("tree", "Events")
      .option("step_size", "1000000")
      .load())

df.show(5, truncate=False)
print("Count:", df.count())

# Use schema inference
df2 = (spark.read.format("root")
       .option("path", "/data/Run2012BC_DoubleMuParked_Muons.root")
       .option("tree", "Events")
       .option("sample_rows", "1000")   # default 1000
       .load())
df2.printSchema()

Examples and tests

Read ROOT using PySpark: read_root_file.py
Notebook computing data from ROOT files: Dimuon_mass_spectrum.ipynb
Run tests with pytest

Options

"path" (required) – file path, URL, comma-separated list, directory, or glob (e.g. "/data/*.root")
"tree" (default: "Events") – TTree name
"step_size" (default: "1000000") – entries per Spark partition (per file)
"num_partitions" (optional, per file) – overrides step_size
"entry_start", "entry_stop" (optional, per file) – index bounds
"columns" – comma-separated branch names (if not providing a Spark schema)
"list_to32" (default: "true") – Arrow list offset width
"extensionarray" (default: "false") – Arrow extension array support
"cast_unsigned" (default: "true") – cast uint* → signed (Spark lacks unsigned)
"recursive" (default: "false") – expand directories recursively
"ext" (default: "*.root") – filter pattern when path is a directory
"sample_rows" (default: "1000") – rows for schema inference
"arrow_max_chunksize" (default: "0") – if >0, limit rows per Arrow RecordBatch

Reading over XRootD (`root://`)

# fsspec plugins for xrootd
pip install fsspec fsspec-xrootd

# XRootD client libs + Python bindings
conda install -c conda-forge xrootd

Install the extras, then:

remote_file = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = (spark.read.format("root")
      .option("path", remote_file)
      .option("tree", "Events")
      .load())
df.show(3, truncate=False)

Reading folders, globs, recursion

# All .root files in a directory (non-recursive)
df = (spark.read.format("root")
      .option("path", "/data/myfolder")
      .load())

# Recursive directory expansion
df = (spark.read.format("root")
      .option("path", "/data/myfolder")
      .option("recursive", "true")
      .load())

# Custom extension used when 'path' is a directory
df = (spark.read.format("root")
      .option("path", "/data/myfolder")
      .option("ext", "*.parquet.root")
      .load())

# Glob
df = (spark.read.format("root")
      .option("path", "/data/*/atlas/*.root")
      .load())

Tips and troubleshooting

Prefer explicit schemas to prune early and minimize I/O.
Tune partitioning:
- step_size = entries per Spark partition.
- num_partitions (per file) overrides step_size.
Large jagged arrays benefit from reasonable step_size (e.g., 100k–1M).
If necessary, use arrow_max_chunksize to keep batch sizes moderate for downstream stages.
cast_unsigned=true normalizes uint* to signed widths (Spark-friendly).
Fixed-size lists are preserved as Arrow fixed_size_list (no silent downgrade).
XRootD errors: install both fsspec and fsspec-xrootd, and the XRootD client libs. Conda is often the smoothest:
```
pip install fsspec fsspec-xrootd
conda install -c conda-forge xrootd
```
Tree not found: double-check .option("tree", "..."); error messages list available keys.
Different schemas across files: ensure compatible branch types or read by subsets, then reconcile in Spark.
Driver vs executors env mismatch: set both spark.pyspark.python and spark.pyspark.driver.python to your Python.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Sep 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_root_datasource-0.1.0.tar.gz (20.4 kB view details)

Uploaded Sep 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyspark_root_datasource-0.1.0-py3-none-any.whl (15.2 kB view details)

Uploaded Sep 23, 2025 Python 3

File details

Details for the file pyspark_root_datasource-0.1.0.tar.gz.

File metadata

Download URL: pyspark_root_datasource-0.1.0.tar.gz
Upload date: Sep 23, 2025
Size: 20.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pyspark_root_datasource-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`39ef97649b4979caf91b335239330aa649aa553171a1f194a63c98b17748b786`
MD5	`d5ffaaeabcbab230d23cb3f340d17564`
BLAKE2b-256	`87c2d4ef2666c08007ef6998ed1f05f3d3e77bdce24c6bf66e52a1758a64e74d`

See more details on using hashes here.

File details

Details for the file pyspark_root_datasource-0.1.0-py3-none-any.whl.

File metadata

Download URL: pyspark_root_datasource-0.1.0-py3-none-any.whl
Upload date: Sep 23, 2025
Size: 15.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pyspark_root_datasource-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`27b8c6a827566a6e4d45bd220de5df1ac99d017dcf7c379eb3a61b23ff02af9f`
MD5	`c2e48759b38b6e56a0667c960a9bba5b`
BLAKE2b-256	`38af652ebd0799fe9228a7a80f806c6da4d9356df33d7a5339e3a31480a3a370`

See more details on using hashes here.

pyspark-root-datasource 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PySpark Datasource for ROOT

Highlights

Related work & acknowledgments

Install

Quick start

Examples and tests

Options

Reading over XRootD (`root://`)

Reading folders, globs, recursion

Tips and troubleshooting

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

pyspark-root-datasource 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PySpark Datasource for ROOT

Highlights

Related work & acknowledgments

Install

Quick start

Examples and tests

Options

Reading over XRootD (root://)

Reading folders, globs, recursion

Tips and troubleshooting

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Reading over XRootD (`root://`)