Skip to main content

Python DataSource for Apache Spark 4 to read ROOT files (High Energy Physics, HEP) as DataFrames, powered by uproot, awkward, and PyArrow

Project description

PySpark Datasource for ROOT

Apache Spark 4 Python Datasource for reading files in the ROOT data format used in High-Energy Physics (HEP).

Author and version
Luca.Canali@cern.ch · v0.1 (Sep 2025)

Highlights

  • ✅ Allows to read ROOT data using Apache Spark using a custom Spark 4 Python DataSource.
  • ✅ Works with local files, directories, and globs; optional XRootD (root://) support.
  • ✅ Implements partitioning and optional schema inference.
  • ✅ Powered by uproot, awkward, PyArrow and Spark's Python Datasource.

Related work & acknowledgments


Install

# From PyPI
pip install pyspark-root-datasource

# Or, local for development
pip install -e .

Quick start

from pyspark.sql import SparkSession
from pyspark_root_datasource import register

spark = (SparkSession.builder
         .appName("Read ROOT via PySpark + uproot")
         .getOrCreate())

# Register the datasource (short name = "root")
register(spark)

# Get the example ROOT file (2 GB)
# xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
# if you don't have xrdcp installed, on Linux use wget or curl -O
wget https://sparkdltrigger.web.cern.ch/sparkdltrigger/Run2012BC_DoubleMuParked_Muons.root

# Best practice: provide a schema to prune branches early
schema = "nMuon int, Muon_pt array<float>, Muon_eta array<float>, Muon_phi array<float>, Muon_mass array<float>, Muon_charge array<int>"

df = (spark.read.format("root")
      .schema(schema)
      .option("path", "/data/Run2012BC_DoubleMuParked_Muons.root")
      .option("tree", "Events")
      .option("step_size", "1000000")
      .load())

df.show(5, truncate=False)
print("Count:", df.count())

# Use schema inference
df2 = (spark.read.format("root")
       .option("path", "/data/Run2012BC_DoubleMuParked_Muons.root")
       .option("tree", "Events")
       .option("sample_rows", "1000")   # default 1000
       .load())
df2.printSchema()

Examples and tests


Options

  • "path" (required) – file path, URL, comma-separated list, directory, or glob (e.g. "/data/*.root")
  • "tree" (default: "Events") – TTree name
  • "step_size" (default: "1000000") – entries per Spark partition (per file)
  • "num_partitions" (optional, per file) – overrides step_size
  • "entry_start", "entry_stop" (optional, per file) – index bounds
  • "columns" – comma-separated branch names (if not providing a Spark schema)
  • "list_to32" (default: "true") – Arrow list offset width
  • "extensionarray" (default: "false") – Arrow extension array support
  • "cast_unsigned" (default: "true") – cast uint* → signed (Spark lacks unsigned)
  • "recursive" (default: "false") – expand directories recursively
  • "ext" (default: "*.root") – filter pattern when path is a directory
  • "sample_rows" (default: "1000") – rows for schema inference
  • "arrow_max_chunksize" (default: "0") – if >0, limit rows per Arrow RecordBatch

Reading over XRootD (root://)

# fsspec plugins for xrootd
pip install fsspec fsspec-xrootd

# XRootD client libs + Python bindings
conda install -c conda-forge xrootd

Install the extras, then:

remote_file = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = (spark.read.format("root")
      .option("path", remote_file)
      .option("tree", "Events")
      .load())
df.show(3, truncate=False)

Reading folders, globs, recursion

# All .root files in a directory (non-recursive)
df = (spark.read.format("root")
      .option("path", "/data/myfolder")
      .load())

# Recursive directory expansion
df = (spark.read.format("root")
      .option("path", "/data/myfolder")
      .option("recursive", "true")
      .load())

# Custom extension used when 'path' is a directory
df = (spark.read.format("root")
      .option("path", "/data/myfolder")
      .option("ext", "*.parquet.root")
      .load())

# Glob
df = (spark.read.format("root")
      .option("path", "/data/*/atlas/*.root")
      .load())

Tips and troubleshooting

  • Prefer explicit schemas to prune early and minimize I/O.
  • Tune partitioning:
    • step_size = entries per Spark partition.
    • num_partitions (per file) overrides step_size.
  • Large jagged arrays benefit from reasonable step_size (e.g., 100k–1M).
  • If necessary, use arrow_max_chunksize to keep batch sizes moderate for downstream stages.
  • cast_unsigned=true normalizes uint* to signed widths (Spark-friendly).
  • Fixed-size lists are preserved as Arrow fixed_size_list (no silent downgrade).
  • XRootD errors: install both fsspec and fsspec-xrootd, and the XRootD client libs. Conda is often the smoothest:
    pip install fsspec fsspec-xrootd
    conda install -c conda-forge xrootd
    
  • Tree not found: double-check .option("tree", "..."); error messages list available keys.
  • Different schemas across files: ensure compatible branch types or read by subsets, then reconcile in Spark.
  • Driver vs executors env mismatch: set both spark.pyspark.python and spark.pyspark.driver.python to your Python.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_root_datasource-0.1.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_root_datasource-0.1.0-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_root_datasource-0.1.0.tar.gz.

File metadata

  • Download URL: pyspark_root_datasource-0.1.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pyspark_root_datasource-0.1.0.tar.gz
Algorithm Hash digest
SHA256 39ef97649b4979caf91b335239330aa649aa553171a1f194a63c98b17748b786
MD5 d5ffaaeabcbab230d23cb3f340d17564
BLAKE2b-256 87c2d4ef2666c08007ef6998ed1f05f3d3e77bdce24c6bf66e52a1758a64e74d

See more details on using hashes here.

File details

Details for the file pyspark_root_datasource-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_root_datasource-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 27b8c6a827566a6e4d45bd220de5df1ac99d017dcf7c379eb3a61b23ff02af9f
MD5 c2e48759b38b6e56a0667c960a9bba5b
BLAKE2b-256 38af652ebd0799fe9228a7a80f806c6da4d9356df33d7a5339e3a31480a3a370

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page