Skip to main content

A data preboarding framework for ingesting, managing, and validating CSV, Excel, and other tabular data files using a Collect, Store, Validate, Publish architecture to create a trusted publisher for downstream data consumers.

Project description

Make Data File Feed Ingestion Higher Quality, Lower Risk, and More Agile

CsvPath Framework closes the gap between Managed File Transfer and the data lake, applications, analytics, and AI with a purpose-built, open source data file feeds preboarding solution.

These pages focus on CsvPath Validation Language. For more documentation on the whole data preboarding architecture, along with code, examples, and best practices, check out csvpath.org. For the FlightPath frontend application and API server head over to flightpathdata.com.

CSV and Excel validation is at the core of the Framework. The Language defines a simple, declarative syntax for inspecting and validating files and other tabular data. Its mission is to end manual data checking and upgrading. The cost of manual processes and firefighting CSV and Excel problems can be as high as 50% of a DataOps and BizOps team's time. CsvPath Framework's automation-first approach helps scale back that unproductive and frustrating investment.

CsvPath Validation Language is inspired by:

  • XPath and ISO standard Schematron validation
  • SQL schemas
  • And business rules engines like Jess or Drools

If you need help getting started, there are lots of ways to reach us.

PyPI - Python Version GitHub commit activity PyPI - Version

Contents

Motivation

CSV and Excel files are everywhere! They are critical to successful data partnerships. They are a great example of how garbage-in-garbage-out threatens applications, analytics, and AI. And they are often the most unloved part of the data estate.

We rely on CSV because it the lowest common dominator. The majority of systems that have import/export capabilities accept CSV. But many CSV files are invalid or broken in some way due to partners having different priorities, SDLCs, levels of technical capability, and interpretations of requirements. The result is that untrustworthy data flows into the enterprise. Often times a lot of manual effort goes into tracing data back to problems and fixing them.

CsvPath Validation Language adds trust to data file feeds. It is a quality management shift-left that solves problems early where they are easiest to fix.

The Language is simple, function-oriented, and solely focused on validation of delimited data. It supports both schema definitions and rules-based validation. CsvPath Validation Language is declarative, for more concise and understandable data definitions. CsvPath can also extract and upgrade data, and create simple reports. Overall the goal is to automate human judgement and add transparency.

Install

CsvPath Framework is available on PyPi. It has been tested on 3.10, 3.11 and 3.13.

The project uses Poetry and works fine with Uv. You can also install it with:

    pip install csvpath

CsvPath has an optional dependency on Pandas. Pandas data frames can be used as a data source, much like Excel or CSV files. To install CsvPath with the Pandas option do:

    pip install csvpath[pandas]

Pandas and its dependencies can make it harder to use CsvPath in certain specific MFT use cases. For e.g., using Pandas in an AWS Lambda layer may be less straightforward.

Validation Approach

CsvPath Validation Language is for creating "paths" that validate data streamed from files. A csvpath statement matches lines. A match does not mean that a line is inherently valid or invalid. That determination depends on how the csvpath statement was written.

For example, a csvpath statement can return all invalid lines as matches. Alternatively, it can return all valid lines as matches. It could also return no matching lines, but instead trigger side-effects, like print statements or variable changes.

Structure

A csvpath statement has three structural parts:

  • A root that may include a file name
  • The scanning part, that declares what lines will be validated
  • The matching part, that declares what lines will match

The root of a csvpath starts with $. The match and scan parts are enclosed by brackets. Newlines are ignored.

Simple Examples

A trivial csvpath looks like this:

    $filename[*][yes()]

This csvpath says:

  • Open the file: filename
  • Scan all the lines: *
  • And match every line scanned: yes()

In this case, a matching line is considered valid. Treating matches as valid is a simple approach. There are several possible validation strategies.

Here is a more functional csvpath:

    $people.csv[*][
        @two_names = count(not(#middle_name))
        last() -> print("There are $.variables.two_names people with only two names")]

It scans the lines in people.csv, counts lines without a middle name, and prints the count when the last row is read.

A csvpath doesn't have to point to a specific file. It can instead simply have the scanning instruction come right after the root '$' like this:

    $[*][
        @two_names = count(not(#middle_name))
        last() -> print("There are $.variables.two_names people with only two names")]

In this case, the Framework chooses the csvpath's file at runtime.

Writing Validation Statements

At a high level, the functionality of a CsvPath Validation Language statement comes from:

Each of these parts of a statement make significant functional contributions. This includes comments, which can have csvpath-by-csvpath configuration settings, integration hooks, and user-defined metadata.

Running CsvPath

CsvPath is available on Pypi here. The git repo is here.

Two classes provide csvpath statement evaluation functionality: CsvPath and CsvPaths.

CsvPath

(code) CsvPath is the most basic entry point for running csvpaths statements.

method function
next() iterates over matched rows returning each matched row as a list
fast_forward() iterates over the file collecting variables and side effects
advance() skips forward n rows from within a for row in path.next() loop
collect() processes n rows and collects the lines that matched as lists

CsvPaths

(code) CsvPaths manages validations of multiple files and/or multiple csvpaths. It coordinates the work of multiple CsvPath instances.

method function
csvpath() gets a CsvPath object that knows all the file names available
collect_paths() Same as CsvPath.collect() but for all paths sequentially
fast_forward_paths() Same as CsvPath.fast_forward() but for all paths sequentially
next_paths() Same as CsvPath.next() but for all paths sequentially
collect_by_line() Same as CsvPath.collect() but for all paths breadth first
fast_forward_by_line() Same as CsvPath.fast_forward() but for all paths breadth first
next_by_line() Same as CsvPath.next() but for all paths breadth first

The purpose of CsvPaths is to apply multiple csvpaths per CSV file and handle multiple files in sequence. CsvPaths has both serial and breadth-first versions of CsvPath's collect(), fast_forward(), and next() methods. The breadth-first versions evaluate each csvpath for every line of a CSV file before restarting the evaluations with the next line.

Simple Example

To learn about automation, start with a simple driver. This is a basic programmatic use of CsvPath. It checks a file against a trivial schema, iterating the matching lines.

    path = CsvPath().parse("""
            $test.csv[1-25][
                line(
                    string.notnone(#firstname),
                    string.notnone(#lastname)
                )
            ]
    """)
    for i, line in enumerate( path.next() ):
        print(f"{i}: {line}")

For production operations consider using FlightPath Server, instead of coding your own driver scripts.

CsvPath is primarily for data automation, not interactive use. There is a simple command line interface for quick dev iterations. Read more about the CLI here. For more dev and ops functionality, use FlightPath Data, the open source frontend to CsvPath Framework.

Grammars

CsvPath Validation Language is built up from three grammars:

  • The csvpath statement grammar - the main language
  • A print() function grammar - a simple print capability with variable and reference substitution
  • The Reference Language grammar - the file location and querying language used in validation and preboarding operations

Read more about the CsvPath grammar definition here.

More Info

For more information about preboarding and the whole of CsvPath Framework, visit https://www.csvpath.org.

For the development and operations frontend to CsvPath Framework, take a look at FlightPath Data.

And to learn about the backend API server, head over to FlightPath Server.

Sponsors

Atesta Analytics

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvpath-0.0.591.tar.gz (324.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csvpath-0.0.591-py3-none-any.whl (479.9 kB view details)

Uploaded Python 3

File details

Details for the file csvpath-0.0.591.tar.gz.

File metadata

  • Download URL: csvpath-0.0.591.tar.gz
  • Upload date:
  • Size: 324.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.13.5 Darwin/24.6.0

File hashes

Hashes for csvpath-0.0.591.tar.gz
Algorithm Hash digest
SHA256 b5f7cd66f9db44588d4b0b7fd0364eab8f6c6fdb4f3a8910c04da4d1f57ba799
MD5 158dffeb1819e8a1b9682cfe26759a59
BLAKE2b-256 e8462fea0e55c599a92d976d2efc87da79b851cc2a4176119d7c63dbf828e0a4

See more details on using hashes here.

File details

Details for the file csvpath-0.0.591-py3-none-any.whl.

File metadata

  • Download URL: csvpath-0.0.591-py3-none-any.whl
  • Upload date:
  • Size: 479.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.13.5 Darwin/24.6.0

File hashes

Hashes for csvpath-0.0.591-py3-none-any.whl
Algorithm Hash digest
SHA256 13047002ff98769cc4b5c8948402791f71eff7246efa5932a11bc989ed970f70
MD5 0da833846d022c796454dfa6da69de61
BLAKE2b-256 99707c361f16cb4e7e255fafdbe68384d845581a7f602fb8a19a7c7a2ea00fbe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page