Skip to main content

CsvPath Framework is a data preboarding automation library for receiving, validating, and tracking CSV, Excel, JSONL and other tabular data files before they can corrupt downstream data consumers.

Project description

Make Data File Feed Ingestion Higher Quality, Lower Risk, and More Agile

CsvPath Framework closes the gap between Managed File Transfer and the data lake with a purpose-built, open source solution for validating and staging inbound data file feeds from external partners.

See it running in 30 seconds

    pip install csvpath

Check the headers in myorders.csv.

from csvpath import CsvPath

CsvPaths().fast_forward("$myorders.csv[*][ count_headers() == 6 ]").is_valid

Load an automated file arrival process and run it.

from csvpath import CsvPaths

paths = CsvPaths()
paths.file_manager.add_named_file(name="orders", path="myorders.csv")
paths.paths_manager.add_named_paths(name="validate-orders", from_file="orders.csvpath")
paths.fast_forward_paths(filename="orders", pathsname="validate-orders")

results = paths.results_manager.get_named_results("validate-orders")
print(f"Valid: {results[0].is_valid}")

What problem does this solve?

CSV and Excel files are critical to data partnerships — and they are often the most unloved part of the data estate. Partners have different priorities, technical capabilities, and interpretations of requirements. The result is untrustworthy data flowing into the enterprise, often caught only after it has already caused damage downstream.

CsvPath Framework adds a preboarding layer ahead of a data partner's files reaching your ingestion pipeline. It registers, versions, validates, and stages clean data and metadata so your processes run smoothly. The cost of manual checking and firefighting CSV and Excel problems can reach 50% of a DataOps and BizOps team's time. CsvPath's automation-first approach scales that back.

These pages focus on CsvPath Validation Language. For more documentation on the whole data preboarding architecture, along with code, examples, and best practices, check out csvpath.org.

For the open source FlightPath frontend app and API server head over to flightpathdata.com.

If you need help getting started, there are lots of ways to reach us.

PyPI - Python Version GitHub commit activity PyPI - Version

Contents

Motivation

CSV and Excel files are everywhere! They are critical to successful data partnerships. They are a great example of how garbage-in-garbage-out threatens applications, analytics, and AI. And they are often the most unloved part of the data estate.

We rely on CSV because it the lowest common denominator. The majority of systems that have import/export capabilities accept CSV. But many CSV files are invalid or broken in some way due to partners having different priorities, SDLCs, levels of technical capability, and interpretations of requirements. The result is that untrustworthy data flows into the enterprise. Often times a lot of manual effort goes into tracing data back to problems and fixing them.

CsvPath Validation Language adds trust to data file feeds. It is a quality management shift-left that solves problems early where they are easiest to fix.

The Language is simple, function-oriented, and solely focused on validation of delimited data. It supports both schema definitions and rules-based validation. CsvPath Validation Language is declarative, for more concise and understandable data definitions. CsvPath can also extract and upgrade data, and create simple reports. Overall the goal is to automate human judgement and add transparency.

Install

CsvPath Framework is available on PyPi. It has been tested on 3.10, 3.11 and 3.13.

The project uses Poetry and works fine with Uv. You can also install it with:

    pip install csvpath

Validation Approach

CsvPath Validation Language is for creating "paths" that validate data streamed from files. A csvpath statement matches lines. A match does not mean that a line is inherently valid or invalid. That determination depends on how the csvpath statement was written.

For example, a csvpath statement can return all invalid lines as matches. Alternatively, it can return all valid lines as matches. It could also return no matching lines, but instead trigger side-effects, like print statements or variable changes.

Structure

A csvpath statement has three structural parts:

  • A root that may include a file name
  • The scanning part, that declares what lines will be validated
  • The matching part, that declares what lines will match

The root of a csvpath starts with $. The match and scan parts are enclosed by brackets. Newlines are ignored.

Simple Examples

A trivial csvpath looks like this:

    $filename[*][yes()]

This csvpath says:

  • Open the file: filename
  • Scan all the lines: *
  • And match every line scanned: yes()

In this case, a matching line is considered valid. Treating matches as valid is a simple approach. There are several possible validation strategies.

Here is a more functional csvpath:

    $people.csv[*][
        @two_names = count(not(#middle_name))
        last() -> print("There are $.variables.two_names people with only two names")]

It scans the lines in people.csv, counts lines without a middle name, and prints the count when the last row is read.

A csvpath doesn't have to point to a specific file. It can instead simply have the scanning instruction come right after the root '$' like this:

    $[*][
        @two_names = count(not(#middle_name))
        last() -> print("There are $.variables.two_names people with only two names")]

In this case, the Framework chooses the csvpath's file at runtime.

Writing Validation Statements

At a high level, the functionality of a CsvPath Validation Language statement comes from:

Each of these parts of a statement make significant functional contributions. This includes comments, which can have csvpath-by-csvpath configuration settings, integration hooks, and user-defined metadata.

Running CsvPath

CsvPath is available on Pypi here. The git repo is here.

Two classes provide csvpath statement evaluation functionality: CsvPath and CsvPaths.

CsvPath

(code) CsvPath is the most basic entry point for running csvpaths statements.

method function
next() iterates over matched rows returning each matched row as a list
fast_forward() iterates over the file collecting variables and side effects
advance() skips forward n rows from within a for row in path.next() loop
collect() processes n rows and collects the lines that matched as lists

CsvPaths

(code) CsvPaths manages validations of multiple files and/or multiple csvpaths. It coordinates the work of multiple CsvPath instances.

method function
csvpath() gets a CsvPath object that knows all the file names available
collect_paths() Same as CsvPath.collect() but for all paths sequentially
fast_forward_paths() Same as CsvPath.fast_forward() but for all paths sequentially
next_paths() Same as CsvPath.next() but for all paths sequentially
collect_by_line() Same as CsvPath.collect() but for all paths breadth first
fast_forward_by_line() Same as CsvPath.fast_forward() but for all paths breadth first
next_by_line() Same as CsvPath.next() but for all paths breadth first

The purpose of CsvPaths is to apply multiple csvpaths per CSV file and handle multiple files in sequence. CsvPaths has both serial and breadth-first versions of CsvPath's collect(), fast_forward(), and next() methods. The breadth-first versions evaluate each csvpath for every line of a CSV file before restarting the evaluations with the next line.

Simple Example

To learn about automation, start with a simple driver. This is a basic programmatic use of CsvPath. It checks a file against a trivial schema, iterating the matching lines.

    path = CsvPath().parse("""
            $test.csv[1-25][
                line(
                    string.notnone(#firstname),
                    string.notnone(#lastname)
                )
            ]
    """)
    for i, line in enumerate( path.next() ):
        print(f"{i}: {line}")

For production operations consider using FlightPath Server, instead of coding your own driver scripts.

CsvPath is primarily for data automation, not interactive use. There is a simple command line interface for quick dev iterations. Read more about the CLI here. For more dev and ops functionality, use FlightPath Data, the open source frontend to CsvPath Framework.

Grammars

CsvPath Validation Language is built up from three grammars:

  • The csvpath statement grammar - the main language
  • A print() function grammar - a simple print capability with variable and reference substitution
  • The Reference Language grammar - the file location and querying language used in validation and preboarding operations

Read more about the CsvPath grammar definition here.

More Info

For more information about preboarding and the whole of CsvPath Framework, visit https://www.csvpath.org.

For the development and operations frontend to CsvPath Framework, take a look at FlightPath Data.

And to learn about the backend API server, head over to FlightPath Server.

Sponsors

Atesta Analytics

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvpath-0.0.595.tar.gz (335.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csvpath-0.0.595-py3-none-any.whl (496.3 kB view details)

Uploaded Python 3

File details

Details for the file csvpath-0.0.595.tar.gz.

File metadata

  • Download URL: csvpath-0.0.595.tar.gz
  • Upload date:
  • Size: 335.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.13.5 Darwin/24.6.0

File hashes

Hashes for csvpath-0.0.595.tar.gz
Algorithm Hash digest
SHA256 b1a1a9eb51198e713dedee8d0efe0beb9f0db20686b20ed705dcb7f9d71ca4a8
MD5 04d63f991330d6722d177bc1ce923fe4
BLAKE2b-256 7f7510f9cfb1f4a29419a28d3c2e6ee2df4c6f4a868ec6a5b431d6578bf0ad48

See more details on using hashes here.

File details

Details for the file csvpath-0.0.595-py3-none-any.whl.

File metadata

  • Download URL: csvpath-0.0.595-py3-none-any.whl
  • Upload date:
  • Size: 496.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.13.5 Darwin/24.6.0

File hashes

Hashes for csvpath-0.0.595-py3-none-any.whl
Algorithm Hash digest
SHA256 9645ac7cc06c806d064cf031efa020ece710cc511679e4bd4e49307c7e7e834c
MD5 f1b1a41c24b17a9b15a001341652d952
BLAKE2b-256 be373d3c80d0ae2d66b6901aa2a5f5d30fe38262186c380235b60d9b64239887

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page