Skip to main content

Blazing fast format validations for your CSV files

Project description

CSV Validation Library

Blazing fast format validations for your CSV files

This is a Python lib with a Rust core that will allow you to validate huge CSV files (GBs) in seconds (or in a few minutes for really huge files) using a minimal amount of memory.

Features

  • ✨ Validate both plain and gzipped CSV files
  • 🔍 Multiple validation types supported:
    • Correct column name and order
    • Regular expressions
    • Well-known formats (integer, decimal, etc.)
    • Minimum/Maximum numerical value checks
    • Value set validation (allowed values)
  • 🦀 + 🐍 Rust lib with Python bindings included
  • 📝 Detailed validation summaries with sample invalid values
  • 🚀 High performance with optimizations like regex pre-compilation
  • 📊 Support for large CSV files

Installation

Python

pip install csv_validation
poetry add csv_validation
uv add csv_validation

Usage

Python

You can provide a file with the validation rules

from csv_validation import CSVValidator

validator = CSVValidator.from_file("validation_rules.yaml")
is_valid = validator.validate("data.csv")

You can also create a validator from a string

validation_rules = """
columns:
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$
  - name: Age
    format: positive_integer
    max: 120
"""

validator = CSVValidator.from_string(validation_rules)
# Optionally set a custom column separator
validator.set_separator(";")  # Default is comma (,)

# Validate a CSV file
is_valid = validator.validate("data.csv")

Validation Definition Format

Create a small, easy-to-read YAML file with your validation rules. Example:

columns:
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$  # Letters and spaces, 2-50 characters
  - name: Family Name
    regex: ^[A-Za-z\s'-]{2,50}$  # Letters, spaces, hyphens and apostrophes
  - name: Age
    format: positive_integer  # Using predefined format instead of custom regex
    max: 120
  - name: Salary
    format: integer  # Allows negative integers too
    min: 20000
  - name: Email
    regex: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$  # Standard email format
  - name: Phone
    regex: ^\+?[0-9]{10,15}$  # International phone format with optional +
  - name: Status
    values: [active, inactive, pending, suspended]  # Only these values are allowed
  - name: Gender
    values: [M, F, NB, O]  # M: Male, F: Female, NB: Non-Binary, O: Other

Validation Types

  1. Regular Expression (regex)

    • Validate fields against custom regex patterns
  2. Format (format)

    • Predefined formats for common validations (This is the recommended way to validate numeric fields)
    • In the background, the library uses regex patterns to validate formats.
    • Available formats:
      • integer: Validates any integer number (positive or negative)
      • positive integer: Validates positive integer numbers
      • negative integer: Validates negative integer numbers
      • decimal/decimal point: Validates any decimal number (positive or negative) using point as decimal separator
      • negative decimal point: Validates negative decimal numbers using point as decimal separator
      • decimal comma: Validates decimal numbers using comma as decimal separator
      • positive decimal/positive decimal point: Validates positive decimal numbers using point as decimal separator
      • positive decimal comma: Validates positive decimal numbers using comma as decimal separator
      • decimal scientific: Validates decimal numbers in scientific notation (e.g. 23.02e-12)
      • decimal scientific comma: Validates decimal numbers in scientific notation using comma as decimal separator
      • positive decimal scientific: Validates positive decimal numbers in scientific notation
    • More formats will be added in upcoming versions
  3. Minimum Value (min)

    • Check if numeric fields are greater than or equal to a specified value
  4. Maximum Value (max)

    • Check if numeric fields are less than or equal to a specified value
  5. Value Set (values)

    • Ensure fields only contain values from a predefined set
  6. Extra (extra)

    • Extra validations that normally complement one of the other types
    • Available extras:
      • non_empty: Validates that field contains at least one character

You can add as many validation types as you want for the same column, but take into account that the column only will be considered correct if all the validations are OK.

Global Validations

empty_not_ok

Empty values (empty string '') are considered correct and accepted by default. However, if your data must always have some content, you can add a global empty_not_ok flag at the root level of your YAML definition to automatically add the extra: non_empty validation to all columns:

empty_not_ok: true
columns:
  - name: Column1
    # other validations...

Unique Key (Duplicate Rows) Validation

You can enforce uniqueness across one or more columns by adding a root-level unique field in your YAML definition. This enables a memory-efficient duplicate detection that scales to very large files.

  • Accepts a single column name (string) or a list of column names (array of strings).
  • Works with both plain and gzipped CSV files.
  • Uses a high-performant disk-backed key-value store (redb) under the hood to keep memory usage low.

Examples:

# Single-column unique key
unique: ID
columns:
  - name: ID
    format: positive integer
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$
# Composite unique key across multiple columns
unique: [isin, date, score_provider]
columns:
  - name: isin
    regex: ^[A-Z0-9]{12}$
  - name: date
    regex: ^\d{4}-\d{2}-\d{2}$
  - name: score_provider
    values: [sp1, sp2, sp3]
  - name: score
    format: decimal

What you will see in the summary:

  • Always shows the number of duplicate key groups found.
  • If there are more than 100 duplicate groups, only the first 100 are displayed as a sample (with their occurrence counts), and the total number is indicated.
  • If no duplicates are found, it reports: "No duplicates found".

Sample output (composite key):

UNIQUE KEY: column(s): isin, date, score_provider

DUPLICATED KEYS (groups found: 134)
--------------------------------------------------
Showing first 100 of 134 duplicate key groups:
  - (isin='US0004026250', date='2025-10-01', score_provider='sp1') -> occurrences: 3
  - (isin='US5949181045', date='2025-10-01', score_provider='sp2') -> occurrences: 4
  ... up to 100 lines ...

If none:

UNIQUE KEYS CHECK
--------------------------------------------------
  - No duplicates found

Performance note: Duplicate detection stores only the unique keys on disk and keeps a tiny in-memory summary used for printing the report. This allows validating huge datasets with minimal RAM usage.

Column Separator

By default, the library uses comma (,) as the column separator. You can change this using the set_separator method:

validator = CSVValidator.from_file("rules.yml")
validator.set_separator(";")  # Use semicolon as separator
validator.set_separator("\t")  # Use tab as separator

Decimal Separator

By default, the library uses point (.) as the decimal separator. You can change this using the set_decimal_separator method:

validator = CSVValidator.from_file("rules.yml")
validator.set_decimal_separator(",")  # Use comma

Error Handling

The library provides detailed validation reports through Python's logging system. When a validation fails, you'll get information about:

  • Which columns failed validation
  • What type of validation failed
  • Sample of invalid values (including row numbers)
  • Number of rows that failed each validation

Example of validation report

VALIDATIONS SUMMARY
==================================================
FILE: /test.csv
Rows: 232230 | Columns: 5

CORRECT COLUMNS: 3/5
--------------------------------------------------
  - County: [✔] OK
      ✔ - regex: '^.*$'
  - City: [✔] OK
      ✔ - regex: '^.*$'
  - Model Year: [✔] OK
      ✔ - regex: '^[0-9]+$'
      ✔ - min: 1999
      ✔ - max: 2025.78

WRONG COLUMNS: 2/5
--------------------------------------------------
  - State: [✖] FAIL
      ✖ - values: 'WA', 'OR', 'NY', 'DC', 'CA', 'TX', 'FL', 'OK', 'MO', 'KS', 'VA', 'MA', 'MO', 'NC', 'IL', 'AL', 'WY', 'CO', 'PA', 'WI', 'MD', 'NV', 'AZ'
          "Wrong Rows: 93 | Sample: [row 12: 'GA'], [row 15: 'NJ'], [row 22: 'CT'], [row 23: 'NJ'], [row 24: 'CT']"
  - Postal Code: [✖] FAIL
      ✔ - integer
      ✖ - non_empty
          "Wrong Rows: 4 | Sample: [row 45: ''], [row 123: ''], [row 456: ''], [row 789: '']"

VALIDATION RESULT
--------------------------------------------------
[✖] FAIL: File /test.csv DOESN'T match all validations

Development

Prerequisites

  • Rust 2021 edition or later
  • Python 3.6+ (for Python bindings)
  • Cargo and standard Rust tooling

Building from Source (Rust version)

# Clone the repository
git clone https://github.com/charro/csv_validation
cd csv_validation

# Build the project
cargo build --release

# Run tests
cargo test

Dependencies

  • csv: CSV parsing
  • flate2: Compression support
  • pyo3: Python bindings
  • regex: Regular expression support
  • yaml-rust2: YAML parsing
  • redb: Disk-backed key-value storage used for memory-efficient duplicate detection
  • Various utilities for logging and serialization

License

MIT License

Copyright (c) 2024 CSV Validation Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_validation-0.1.2.tar.gz (27.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

csv_validation-0.1.2-cp310-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

csv_validation-0.1.2-cp310-abi3-win32.whl (1.1 MB view details)

Uploaded CPython 3.10+Windows x86

csv_validation-0.1.2-cp310-abi3-musllinux_1_2_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

csv_validation-0.1.2-cp310-abi3-musllinux_1_2_i686.whl (1.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

csv_validation-0.1.2-cp310-abi3-musllinux_1_2_armv7l.whl (1.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

csv_validation-0.1.2-cp310-abi3-musllinux_1_2_aarch64.whl (1.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

csv_validation-0.1.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

csv_validation-0.1.2-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

csv_validation-0.1.2-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

csv_validation-0.1.2-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

csv_validation-0.1.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

csv_validation-0.1.2-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl (1.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.5+ i686

csv_validation-0.1.2-cp310-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

csv_validation-0.1.2-cp310-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file csv_validation-0.1.2.tar.gz.

File metadata

  • Download URL: csv_validation-0.1.2.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for csv_validation-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fe542b57d85237fcc3e3b826a05dfe56bf630a299398f5ac87a2af072145d7b4
MD5 b6a5f2d0ea2dae7d3098f43ff1df1a99
BLAKE2b-256 a46a8f12a654ee8f2d41c08c39c2d0b12172e238c1164c96587b83d2f489bfb9

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 ccb87e81eaa945ef3c8aa7ace63e36aa393e536d146b44fd1a887d78087a1ec5
MD5 031f58f18dc177f1c916b6e367b42220
BLAKE2b-256 114501d2b319213c55fde9ca53966b81cc7c82092515033973fc81f89f11f2be

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-win32.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 b412af2d0f84bc509395a8c63201ec99ddcc5f484c4a4ab1a1a7f0a7dc28888a
MD5 91da7f61e5c5d63be75a23e98c4cf714
BLAKE2b-256 dc5955e7b7a6c0a2179873364d4f9de4ddea3ee7e47df9af7496e67af3a24d81

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d4c7cd1542b5f8497f8350645bfee76a5ce00955a2a05b338778aad9fd18d42b
MD5 e097802a84a80219ee2077ba349d9795
BLAKE2b-256 1c314d5b0317d56e8669e3f08ef4b3541030af73e777d2784647331680640a06

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 0bb62e9ae809f80656a21c75fb2fb320b40678a1174f80b189a1b305bd243a33
MD5 6f641b8bb0a625c6e05e1b2072700020
BLAKE2b-256 549013a4553d813202b73ecd9379da7d73dcc019124664c9baf37f60afc1e2d4

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 9e5997f7b7a07626d5a9083328e3ee3fbb2595da652813fea92a1fb137d3154d
MD5 ea3e1019a78ac98dc34a784672da62a7
BLAKE2b-256 6afd2ca4169f8794e33c700ef3198ec58b6493a835c94a9776fb59a237d1278e

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 25fe4376ff5410a8e75373b2c301d96c323ea4303349bab8f3d46eeecc651a9a
MD5 78f977ffca849a6bf550c8d41a53bc6f
BLAKE2b-256 defefeda32a5306b92728240ecdc214178a45ffbbde6eebbfe397ebf3179de4a

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3bcc6525e7d2cc944147c19f099c4774da4c9ef8cbe784f52d99ba794de51d43
MD5 fb3a79dfead732e7aa03eed3954e628a
BLAKE2b-256 665e10784d021c55cfc3d7b7a864760c89a0eedf09ecf21a7d4e89d83167ef48

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 d3d50187166a2df38944122572a1084844135504a36ce218e7594d2f41ccdad0
MD5 3a4a45b7ef0f188c366d762bdb926dad
BLAKE2b-256 048bf0839b1abb735906866d2552d69bd349baf716982840e1084031b7027559

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 1985a5ceb319bdb9006b9875eaa2e7804d6b6346f2f91b7529258f570a42f54d
MD5 f7ec8a2fbb541fe1d92eb938c9935a0c
BLAKE2b-256 3931525f76d479e8cf51bb93eccabd8bd0d22d9cbd76b8baa14454b78c96d95c

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 674e8bcc37f78c5605b835921ddb4c49a4737e3a12b11ba6d25d6814b87250dd
MD5 1bf8b5a2d45c403162d47d9d336d325d
BLAKE2b-256 22f4909ea36acc99953306f17ba7fe24ec7db96a17e42e86bf46c038bcc9b4cf

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 acf9179b73cf6155b9f5a473647981aedbec44308686b68e345c1bf2b13f7e77
MD5 ab7c6bb55bef9f00d1cf600297933866
BLAKE2b-256 49dd78da130c0e4bfa25289d369d85f4ac73d6208e836ad4ac06e2cbfbe16745

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 4c6dbf7b324a0b18ba9d51d0ded27de68a6756d5786cdd468bcf4066a84cadd4
MD5 19e158fa1342b313b2676e968bf01248
BLAKE2b-256 1590c6d53a1a6c0df1c2fcfd3ec40e7e6b9e7845fbff7a3f3eeeeb5ed9490ee4

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7d789d51f09de04480f107c02308bb3b898ffa7b5875afe0dfe0d2845a9f21f8
MD5 233bf67fa4dbe06e8109f9ab076d3b54
BLAKE2b-256 2ad54db799b9c56f7e75398cac63b75190d3eae33d3f5ca6c5b57bc14fdf3b0f

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.2-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.2-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 bf31fdd29a027e53f90ee615d046f694c0877482334815d3a98adab968cbf81f
MD5 56f9e72d481f3fdf32b582e2af5b749a
BLAKE2b-256 20d7e9a61c7817cbd255e0731138564558c5dd0356293e569f0444e47658502e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page