A library to rewrite relative Parquet file paths in Python code using AST manipulation.
Project description
Parquet Path Rewriter
A Python library to automatically rewrite Parquet file paths within Python code strings.
It uses Abstract Syntax Tree (AST) manipulation to find calls like:
spark.read.parquet('relative/path')df.write.parquet(path='other/path')
and rewrites the path to match a desired environment—either a local base directory or a custom S3 prefix—without modifying the original source code manually.
This is especially useful when adapting code for different runtime environments (e.g., local, cloud, production clusters), allowing you to inject absolute paths or cloud paths without altering the original logic.
Features
- ✅ Detects
.parquet()method calls using heuristic pattern matching (e.g.,.read.parquet(),.write.parquet()). - 🔍 Rewrites paths passed as:
- First positional argument:
parquet('path/to/file') - Keyword argument:
parquet(path='path/to/file')
- First positional argument:
- 📦 Automatically appends
.parquetif the original path omits the extension. - 📁 Prepends a local
base_path(string orpathlib.Path) for file system rewrites. - ☁️ Optionally rewrites
s3://...URIs to a news3_rewrite_prefix, resulting in:
s3://bucket/tmp/data/<filename>.parquet - 🛡️ Ignores:
- Absolute paths (
/data/file.parquet,s3://bucket/...) unless explicitly rewritten via S3 prefix - Non-literal paths (e.g., variables, f-strings, function calls)
- Absolute paths (
- 🔄 Keeps track of:
- Rewritten paths as
{ original_path: new_path } - Input paths (read operations) for metadata or lineage tracking
- Rewritten paths as
- 🧠 Safely rewrites code using Python’s built-in
astmodule - 📜 Supports fallback to
astunparse(for Python < 3.9) ifast.unparseis unavailable - ⚠️ Handles internal edge cases:
- Ensures
argsare mutable lists before rewriting - Prints warnings when path rewriting fails due to invalid characters or OS restrictions
- Ensures
Use Case Examples
- Adapt hardcoded Spark code to run in different environments (e.g., dev, test, prod)
- Convert relative paths in notebooks to absolute S3 URIs before execution
- Preprocess source code strings in LLMs, code linters, or static analyzers
Installation
pip install parquet-path-rewriter
Usage
The primary way to use the library is through the rewrite_parquet_paths_in_code function.
from pathlib import Path
# Make sure src is in path if running directly without installation
# sys.path.insert(0, str(Path(__file__).resolve().parent.parent / 'src'))
from parquet_path_rewriter import rewrite_parquet_paths_in_code
# --- Example Code ---
# Simulate a Python script that uses Spark or Pandas to read/write Parquet
original_python_code = """
import pyspark.sql
# Assume spark session is created elsewhere
# spark = SparkSession.builder.appName("ETLExample").getOrCreate()
print("Starting ETL process...")
# Read input data
customers_df = spark.read.parquet("raw_data/customers")
orders_df = spark.read.parquet(path="raw_data/orders_2023")
# Some transformations (placeholder)
processed_df = customers_df.join(orders_df, "customer_id")
# Write intermediate results
processed_df.write.mode("overwrite").parquet("staging/customer_orders")
# Read another input for final step
products_df = spark.read.parquet('reference_data/products.parquet')
# Final join and write output
final_df = processed_df.join(products_df, "product_id")
output_path = "final_output/report_data" # Not a literal in call
final_df.write.mode("overwrite").parquet(path="final_output/report_data") # Uses keyword
# Example with an absolute path (should not be changed)
logs_df = spark.read.parquet("/mnt/shared/logs/app_logs.parquet")
# S3 example (should be rewritten)
s3_df = spark.read.parquet("s3://mybucket/data/2023/spark_logs")
# Write to S3 (should be rewritten)
s3_df.write.mode("overwrite").parquet("s3://mybucket/output/processed_logs")
print("ETL process finished.")
"""
# --- Library Usage ---
# Define the base directory where the relative paths should point
# This would typically be determined by your execution environment or configuration
# Use absolute paths for clarity
data_root_directory = Path("/user/project/data").resolve()
s3_rewrite_prefix = "s3://newbucket/data/2023"
print("-" * 30)
print(f"Base Path: {data_root_directory}")
print("-" * 30)
print("Original Code:")
print(original_python_code)
print("-" * 30)
try:
# Call the library function to rewrite the code
modified_code, rewritten_map, identified_inputs = rewrite_parquet_paths_in_code(
code_string=original_python_code, base_path=data_root_directory, s3_rewrite_prefix=s3_rewrite_prefix
)
print("Modified Code:")
print(modified_code)
print("-" * 30)
print("Rewritten Paths (Original -> New):")
if rewritten_map:
for original, new in rewritten_map.items():
print(f" '{original}' -> '{new}'")
else:
print(" No paths were rewritten.")
print("-" * 30)
print("Identified Input Paths (Original):")
if identified_inputs:
for path in identified_inputs:
print(f" '{path}'")
else:
print(" No input paths were identified.")
print("-" * 30)
except SyntaxError as e:
print(f"\nError: Invalid Python syntax in the input code.\n{e}")
except TypeError as e:
print(f"\nError: Invalid base_path provided.\n{e}")
except Exception as e:
print(f"\nAn unexpected error occurred: {e}")
How it Works
The library parses the input Python code string into an Abstract Syntax Tree (AST) using Python's built-in ast module. It then walks through this tree using a custom ast.NodeTransformer. When it encounters a function call node:
-
It checks if the called attribute is named
parquet. -
It analyzes the call chain (e.g.,
spark.read.parquet) to heuristically determine whether it's a read or write operation. -
It searches for a string literal path in the arguments (either as the first positional argument or as a keyword argument like
path='...'). -
If a valid path string is found, the path is transformed based on the configuration:
- If the path is relative, it is rewritten to:
base_path / <filename>.parquet - If the path is an S3 URI and
s3_rewrite_prefixis provided, it is rewritten to:
<s3_rewrite_prefix>/<filename>.parquet - If the path is absolute (e.g.,
/data/file.parquetor starts withs3://) and does not match the rewrite criteria, it is left untouched.
- If the path is relative, it is rewritten to:
-
It replaces the original path node in the AST with a new node containing the modified path string.
-
Finally, the modified AST is converted back into a Python code string using
ast.unparse()(Python 3.9+).
Limitations
-
Call Pattern Specificity: Only identifies calls where the method name is directly
.parquet(...). It does not currently support more dynamic usage likespark.read.format("parquet").load("..."). Extending this requires deeper AST pattern matching. -
String Literals Only: Only rewrites paths passed as direct string literals (e.g.,
'path/to/file',"data/file"). It ignores paths built via variables, f-strings, or function returns. -
Heuristic Read/Write Detection: Read vs. write detection is heuristic, based on checking if
readorwriteexists in the call chain. While it works for typical Spark/Pandas patterns, it might not apply universally. -
AST Unparsing: Relies on
ast.unparse(Python 3.9+) to reconstruct the modified code. If using Python <3.9, consider usingastunparse. Minor formatting differences in the output code may occur.
Contributing
Contributions are welcome! If you encounter a bug or have an enhancement idea, feel free to open an issue or submit a pull request.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parquet_path_rewriter-0.1.4.tar.gz.
File metadata
- Download URL: parquet_path_rewriter-0.1.4.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5489d43daf7c086db8cc130abe1ff9281ec8c62875b85f047cbc8a2ef8e059f
|
|
| MD5 |
9456a0d85a9cca8d85f9358f2dfaeef4
|
|
| BLAKE2b-256 |
5194302c8ececaa2ca75d7f5a4e2123a6e3ef56d6711b96518436877c6706907
|
Provenance
The following attestation bundles were made for parquet_path_rewriter-0.1.4.tar.gz:
Publisher:
publish.yml on dmux/parquet-path-rewriter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parquet_path_rewriter-0.1.4.tar.gz -
Subject digest:
d5489d43daf7c086db8cc130abe1ff9281ec8c62875b85f047cbc8a2ef8e059f - Sigstore transparency entry: 233621958
- Sigstore integration time:
-
Permalink:
dmux/parquet-path-rewriter@5981d664301f6ca8a7f78469a31cfea961211b6c -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/dmux
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5981d664301f6ca8a7f78469a31cfea961211b6c -
Trigger Event:
release
-
Statement type:
File details
Details for the file parquet_path_rewriter-0.1.4-py3-none-any.whl.
File metadata
- Download URL: parquet_path_rewriter-0.1.4-py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e998d5169dd657d0bf6b8199fed447f655076515428406abbc27f07973a52e64
|
|
| MD5 |
63280dbf8a40cdd1220a7cd1624f9e76
|
|
| BLAKE2b-256 |
5045b7fe00e1aaa670a0eb426aa3aa553541958cc643973a37b237ce375b9686
|
Provenance
The following attestation bundles were made for parquet_path_rewriter-0.1.4-py3-none-any.whl:
Publisher:
publish.yml on dmux/parquet-path-rewriter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parquet_path_rewriter-0.1.4-py3-none-any.whl -
Subject digest:
e998d5169dd657d0bf6b8199fed447f655076515428406abbc27f07973a52e64 - Sigstore transparency entry: 233621960
- Sigstore integration time:
-
Permalink:
dmux/parquet-path-rewriter@5981d664301f6ca8a7f78469a31cfea961211b6c -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/dmux
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5981d664301f6ca8a7f78469a31cfea961211b6c -
Trigger Event:
release
-
Statement type: