Skip to main content

A library to rewrite relative Parquet file paths in Python code using AST manipulation.

Project description

Parquet Path Rewriter

PyPI version License: MIT

A Python library to automatically rewrite Parquet file paths within Python code strings.
It uses Abstract Syntax Tree (AST) manipulation to find calls like:

  • spark.read.parquet('relative/path')
  • df.write.parquet(path='other/path')

and rewrites the path to match a desired environment—either a local base directory or a custom S3 prefix—without modifying the original source code manually.

This is especially useful when adapting code for different runtime environments (e.g., local, cloud, production clusters), allowing you to inject absolute paths or cloud paths without altering the original logic.


Features

  • ✅ Detects .parquet() method calls using heuristic pattern matching (e.g., .read.parquet(), .write.parquet()).
  • 🔍 Rewrites paths passed as:
    • First positional argument: parquet('path/to/file')
    • Keyword argument: parquet(path='path/to/file')
  • 📦 Automatically appends .parquet if the original path omits the extension.
  • 📁 Prepends a local base_path (string or pathlib.Path) for file system rewrites.
  • ☁️ Optionally rewrites s3://... URIs to a new s3_rewrite_prefix, resulting in:
    s3://bucket/tmp/data/<filename>.parquet
  • 🛡️ Ignores:
    • Absolute paths (/data/file.parquet, s3://bucket/...) unless explicitly rewritten via S3 prefix
    • Non-literal paths (e.g., variables, f-strings, function calls)
  • 🔄 Keeps track of:
    • Rewritten paths as { original_path: new_path }
    • Input paths (read operations) for metadata or lineage tracking
  • 🧠 Safely rewrites code using Python’s built-in ast module
  • 📜 Supports fallback to astunparse (for Python < 3.9) if ast.unparse is unavailable
  • ⚠️ Handles internal edge cases:
    • Ensures args are mutable lists before rewriting
    • Prints warnings when path rewriting fails due to invalid characters or OS restrictions

Use Case Examples

  • Adapt hardcoded Spark code to run in different environments (e.g., dev, test, prod)
  • Convert relative paths in notebooks to absolute S3 URIs before execution
  • Preprocess source code strings in LLMs, code linters, or static analyzers

Installation

pip install parquet-path-rewriter

Usage

The primary way to use the library is through the rewrite_parquet_paths_in_code function.

from pathlib import Path
# Make sure src is in path if running directly without installation
# sys.path.insert(0, str(Path(__file__).resolve().parent.parent / 'src'))

from parquet_path_rewriter import rewrite_parquet_paths_in_code

# --- Example Code ---
# Simulate a Python script that uses Spark or Pandas to read/write Parquet
original_python_code = """
import pyspark.sql

# Assume spark session is created elsewhere
# spark = SparkSession.builder.appName("ETLExample").getOrCreate()

print("Starting ETL process...")

# Read input data
customers_df = spark.read.parquet("raw_data/customers")
orders_df = spark.read.parquet(path="raw_data/orders_2023")

# Some transformations (placeholder)
processed_df = customers_df.join(orders_df, "customer_id")

# Write intermediate results
processed_df.write.mode("overwrite").parquet("staging/customer_orders")

# Read another input for final step
products_df = spark.read.parquet('reference_data/products.parquet')

# Final join and write output
final_df = processed_df.join(products_df, "product_id")
output_path = "final_output/report_data" # Not a literal in call
final_df.write.mode("overwrite").parquet(path="final_output/report_data") # Uses keyword

# Example with an absolute path (should not be changed)
logs_df = spark.read.parquet("/mnt/shared/logs/app_logs.parquet")

# S3 example (should be rewritten)
s3_df = spark.read.parquet("s3://mybucket/data/2023/spark_logs")

# Write to S3 (should be rewritten)
s3_df.write.mode("overwrite").parquet("s3://mybucket/output/processed_logs")

print("ETL process finished.")
"""

# --- Library Usage ---

# Define the base directory where the relative paths should point
# This would typically be determined by your execution environment or configuration
# Use absolute paths for clarity
data_root_directory = Path("/user/project/data").resolve()

s3_rewrite_prefix = "s3://newbucket/data/2023"

print("-" * 30)
print(f"Base Path: {data_root_directory}")
print("-" * 30)
print("Original Code:")
print(original_python_code)
print("-" * 30)

try:
    # Call the library function to rewrite the code
    modified_code, rewritten_map, identified_inputs = rewrite_parquet_paths_in_code(
        code_string=original_python_code, base_path=data_root_directory, s3_rewrite_prefix=s3_rewrite_prefix
    )

    print("Modified Code:")
    print(modified_code)
    print("-" * 30)

    print("Rewritten Paths (Original -> New):")
    if rewritten_map:
        for original, new in rewritten_map.items():
            print(f"  '{original}' -> '{new}'")
    else:
        print("  No paths were rewritten.")
    print("-" * 30)

    print("Identified Input Paths (Original):")
    if identified_inputs:
        for path in identified_inputs:
            print(f"  '{path}'")
    else:
        print("  No input paths were identified.")
    print("-" * 30)

except SyntaxError as e:
    print(f"\nError: Invalid Python syntax in the input code.\n{e}")
except TypeError as e:
    print(f"\nError: Invalid base_path provided.\n{e}")
except Exception as e:
    print(f"\nAn unexpected error occurred: {e}")

How it Works

The library parses the input Python code string into an Abstract Syntax Tree (AST) using Python's built-in ast module. It then walks through this tree using a custom ast.NodeTransformer. When it encounters a function call node:

  1. It checks if the called attribute is named parquet.

  2. It analyzes the call chain (e.g., spark.read.parquet) to heuristically determine whether it's a read or write operation.

  3. It searches for a string literal path in the arguments (either as the first positional argument or as a keyword argument like path='...').

  4. If a valid path string is found, the path is transformed based on the configuration:

    • If the path is relative, it is rewritten to:
      base_path / <filename>.parquet
    • If the path is an S3 URI and s3_rewrite_prefix is provided, it is rewritten to:
      <s3_rewrite_prefix>/<filename>.parquet
    • If the path is absolute (e.g., /data/file.parquet or starts with s3://) and does not match the rewrite criteria, it is left untouched.
  5. It replaces the original path node in the AST with a new node containing the modified path string.

  6. Finally, the modified AST is converted back into a Python code string using ast.unparse() (Python 3.9+).

Limitations

  • Call Pattern Specificity: Only identifies calls where the method name is directly .parquet(...). It does not currently support more dynamic usage like spark.read.format("parquet").load("..."). Extending this requires deeper AST pattern matching.

  • String Literals Only: Only rewrites paths passed as direct string literals (e.g., 'path/to/file', "data/file"). It ignores paths built via variables, f-strings, or function returns.

  • Heuristic Read/Write Detection: Read vs. write detection is heuristic, based on checking if read or write exists in the call chain. While it works for typical Spark/Pandas patterns, it might not apply universally.

  • AST Unparsing: Relies on ast.unparse (Python 3.9+) to reconstruct the modified code. If using Python <3.9, consider using astunparse. Minor formatting differences in the output code may occur.

Contributing

Contributions are welcome! If you encounter a bug or have an enhancement idea, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parquet_path_rewriter-0.1.4.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parquet_path_rewriter-0.1.4-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file parquet_path_rewriter-0.1.4.tar.gz.

File metadata

  • Download URL: parquet_path_rewriter-0.1.4.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parquet_path_rewriter-0.1.4.tar.gz
Algorithm Hash digest
SHA256 d5489d43daf7c086db8cc130abe1ff9281ec8c62875b85f047cbc8a2ef8e059f
MD5 9456a0d85a9cca8d85f9358f2dfaeef4
BLAKE2b-256 5194302c8ececaa2ca75d7f5a4e2123a6e3ef56d6711b96518436877c6706907

See more details on using hashes here.

Provenance

The following attestation bundles were made for parquet_path_rewriter-0.1.4.tar.gz:

Publisher: publish.yml on dmux/parquet-path-rewriter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file parquet_path_rewriter-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for parquet_path_rewriter-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e998d5169dd657d0bf6b8199fed447f655076515428406abbc27f07973a52e64
MD5 63280dbf8a40cdd1220a7cd1624f9e76
BLAKE2b-256 5045b7fe00e1aaa670a0eb426aa3aa553541958cc643973a37b237ce375b9686

See more details on using hashes here.

Provenance

The following attestation bundles were made for parquet_path_rewriter-0.1.4-py3-none-any.whl:

Publisher: publish.yml on dmux/parquet-path-rewriter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page