Skip to main content

Blazing fast format validations for your CSV files

Project description

CSV Validation Library

Blazing fast format validations for your CSV files

This is a Python lib with a Rust core that will allow you to validate huge CSV files (GBs) in seconds (or in a few minutes for really huge files) using a minimal amount of memory.

Features

  • ✨ Validate both plain and gzipped CSV files
  • 🔍 Multiple validation types supported:
    • Correct column name and order
    • Regular expressions
    • Well-known formats (integer, decimal, etc.)
    • Minimum/Maximum numerical value checks
    • Value set validation (allowed values)
  • 🦀 + 🐍 Rust lib with Python bindings included
  • 📝 Detailed validation summaries with sample invalid values
  • 🚀 High performance with optimizations like regex pre-compilation
  • 📊 Support for large CSV files

Installation

Python

pip install csv_validation
poetry add csv_validation
uv add csv_validation

Usage

Python

You can provide a file with the validation rules

from csv_validation import CSVValidator

validator = CSVValidator.from_file("validation_rules.yaml")
is_valid = validator.validate("data.csv")

You can also create a validator from a string

validation_rules = """
columns:
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$
  - name: Age
    format: positive_integer
    max: 120
"""

validator = CSVValidator.from_string(validation_rules)
# Optionally set a custom column separator
validator.set_separator(";")  # Default is comma (,)

# Validate a CSV file
is_valid = validator.validate("data.csv")

Validation Definition Format

Create a small, easy-to-read YAML file with your validation rules. Example:

columns:
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$  # Letters and spaces, 2-50 characters
  - name: Family Name
    regex: ^[A-Za-z\s'-]{2,50}$  # Letters, spaces, hyphens and apostrophes
  - name: Age
    format: positive_integer  # Using predefined format instead of custom regex
    max: 120
  - name: Salary
    format: integer  # Allows negative integers too
    min: 20000
  - name: Email
    regex: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$  # Standard email format
  - name: Phone
    regex: ^\+?[0-9]{10,15}$  # International phone format with optional +
  - name: Status
    values: [active, inactive, pending, suspended]  # Only these values are allowed
  - name: Gender
    values: [M, F, NB, O]  # M: Male, F: Female, NB: Non-Binary, O: Other

Validation Types

  1. Regular Expression (regex)

    • Validate fields against custom regex patterns
  2. Format (format)

    • Predefined formats for common validations (This is the recommended way to validate numeric fields)
    • In the background, the library uses regex patterns to validate formats.
    • Available formats:
      • integer: Validates any integer number (positive or negative)
      • positive integer: Validates positive integer numbers
      • negative integer: Validates negative integer numbers
      • decimal/decimal point: Validates any decimal number (positive or negative) using point as decimal separator
      • negative decimal point: Validates negative decimal numbers using point as decimal separator
      • decimal comma: Validates decimal numbers using comma as decimal separator
      • positive decimal/positive decimal point: Validates positive decimal numbers using point as decimal separator
      • positive decimal comma: Validates positive decimal numbers using comma as decimal separator
      • decimal scientific: Validates decimal numbers in scientific notation (e.g. 23.02e-12)
      • decimal scientific comma: Validates decimal numbers in scientific notation using comma as decimal separator
      • positive decimal scientific: Validates positive decimal numbers in scientific notation
    • More formats will be added in upcoming versions
  3. Minimum Value (min)

    • Check if numeric fields are greater than or equal to a specified value
  4. Maximum Value (max)

    • Check if numeric fields are less than or equal to a specified value
  5. Value Set (values)

    • Ensure fields only contain values from a predefined set
  6. Extra (extra)

    • Extra validations that normally complement one of the other types
    • Available extras:
      • non_empty: Validates that field contains at least one character

You can add as many validation types as you want for the same column, but take into account that the column only will be considered correct if all the validations are OK.

Global Validations

empty_not_ok

Empty values (empty string '') are considered correct and accepted by default. However, if your data must always have some content, you can add a global empty_not_ok flag at the root level of your YAML definition to automatically add the extra: non_empty validation to all columns:

empty_not_ok: true
columns:
  - name: Column1
    # other validations...

Unique Key (Duplicate Rows) Validation

You can enforce uniqueness across one or more columns by adding a root-level unique field in your YAML definition. This enables a memory-efficient duplicate detection that scales to very large files.

  • Accepts a single column name (string) or a list of column names (array of strings).
  • Works with both plain and gzipped CSV files.
  • Uses a high-performant disk-backed key-value store (redb) under the hood to keep memory usage low.

Examples:

# Single-column unique key
unique: ID
columns:
  - name: ID
    format: positive integer
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$
# Composite unique key across multiple columns
unique: [isin, date, score_provider]
columns:
  - name: isin
    regex: ^[A-Z0-9]{12}$
  - name: date
    regex: ^\d{4}-\d{2}-\d{2}$
  - name: score_provider
    values: [sp1, sp2, sp3]
  - name: score
    format: decimal

What you will see in the summary:

  • Always shows the number of duplicate key groups found.
  • If there are more than 100 duplicate groups, only the first 100 are displayed as a sample (with their occurrence counts), and the total number is indicated.
  • If no duplicates are found, it reports: "Duplicates found: 0".

Sample output (composite key):

UNIQUE KEY: column(s): ["isin", "date", "score_provider"]

DUPLICATED KEYS (groups found: 134)
--------------------------------------------------
Showing first 100 of 134 duplicate key groups:
  - (isin='US0004026250', date='2025-10-01', score_provider='sp1') -> occurrences: 3
  - (isin='US5949181045', date='2025-10-01', score_provider='sp2') -> occurrences: 4
  ... up to 100 lines ...

If none:

UNIQUE KEYS CHECK
--------------------------------------------------
  - No Duplicates found

Performance note: Duplicate detection stores only the unique keys on disk and keeps a tiny in-memory summary used for printing the report. This allows validating huge datasets with minimal RAM usage.

Column Separator

By default, the library uses comma (,) as the column separator. You can change this using the set_separator method:

validator = CSVValidator.from_file("rules.yml")
validator.set_separator(";")  # Use semicolon as separator
validator.set_separator("\t")  # Use tab as separator

Decimal Separator

By default, the library uses point (.) as the decimal separator. You can change this using the set_decimal_separator method:

validator = CSVValidator.from_file("rules.yml")
validator.set_decimal_separator(",")  # Use comma

Error Handling

The library provides detailed validation reports through Python's logging system. When a validation fails, you'll get information about:

  • Which columns failed validation
  • What type of validation failed
  • Sample of invalid values
  • Number of rows that failed each validation

Example of validation report

VALIDATIONS SUMMARY
==================================================
FILE: /test.csv
Rows: 232230 | Columns: 5

CORRECT COLUMNS: 3/5
--------------------------------------------------
  - County: [✔] OK
      ✔ - RegularExpression { expression: "^.*$", alias: "regex" }
  - City: [✔] OK
      ✔ - RegularExpression { expression: "^.*$", alias: "regex" }
  - Model Year: [✔] OK
      ✔ - RegularExpression { expression: "^[0-9]+$", alias: "regex" }
      ✔ - Min(1999.0)
      ✔ - Max(2025.78)

WRONG COLUMNS: 2/5
--------------------------------------------------
  - State: [✖] FAIL
      ✖ - Values(["WA", "OR", "NY", "DC", "CA", "TX", "FL", "OK", "MO", "KS", "VA", "MA", "MO", "NC", "IL", "AL", "WY", "CO", "PA", "WI", "MD", "NV", "AZ"])
          "Wrong Rows: 93 | Sample: 'GA','NJ','CT','NJ','CT','CT','CT','NE','CT','NH'"
  - Postal Code: [✖] FAIL
      ✔ - RegularExpression { expression: "^$|^-?\\d+$", alias: "integer" }
      ✖ - RegularExpression { expression: "^.+$", alias: "non_empty" }
          "Wrong Rows: 4 | Sample: '','','',''"

VALIDATION RESULT
--------------------------------------------------
[✖] FAIL: File DOESN'T match all validations

Development

Prerequisites

  • Rust 2021 edition or later
  • Python 3.6+ (for Python bindings)
  • Cargo and standard Rust tooling

Building from Source (Rust version)

# Clone the repository
git clone https://github.com/charro/csv_validation
cd csv_validation

# Build the project
cargo build --release

# Run tests
cargo test

Dependencies

  • csv: CSV parsing
  • flate2: Compression support
  • pyo3: Python bindings
  • regex: Regular expression support
  • yaml-rust2: YAML parsing
  • redb: Disk-backed key-value storage used for memory-efficient duplicate detection
  • Various utilities for logging and serialization

License

MIT License

Copyright (c) 2024 CSV Validation Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_validation-0.1.0.tar.gz (25.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

csv_validation-0.1.0-cp310-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

csv_validation-0.1.0-cp310-abi3-win32.whl (1.1 MB view details)

Uploaded CPython 3.10+Windows x86

csv_validation-0.1.0-cp310-abi3-musllinux_1_2_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

csv_validation-0.1.0-cp310-abi3-musllinux_1_2_i686.whl (1.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

csv_validation-0.1.0-cp310-abi3-musllinux_1_2_armv7l.whl (1.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

csv_validation-0.1.0-cp310-abi3-musllinux_1_2_aarch64.whl (1.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

csv_validation-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

csv_validation-0.1.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

csv_validation-0.1.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

csv_validation-0.1.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

csv_validation-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

csv_validation-0.1.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl (1.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.5+ i686

csv_validation-0.1.0-cp310-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

csv_validation-0.1.0-cp310-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file csv_validation-0.1.0.tar.gz.

File metadata

  • Download URL: csv_validation-0.1.0.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.4

File hashes

Hashes for csv_validation-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0883309c2af8e0b55175e50c362c09caafab98271cf1b9459483710973adea56
MD5 eacbe2564c5895198c65221546079df2
BLAKE2b-256 94613b6ee9d1f985f1ebc45882bba552123a27e9b1c3dc9abbfc8988ca0d0de6

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c8f5f2ddbb1454f3cec41235775cc25fd0c52635405a4519b9e7c9cc30e71ef1
MD5 055032a231cb0b310a52ccb5ee8c6817
BLAKE2b-256 940d6316662c82c0905b79c7b0fe63d0f63a6eb2d0c6b7f2f0588eb1e5e18d24

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-win32.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 1da11d83cd640408413d6e0b03e68c348597ce42315187475d907f3905fc12ae
MD5 1b871011400658ac05692fe10db3673b
BLAKE2b-256 84e0a6320fd8c4356b0d340407cf0276e7a44ef48fd4d0a5572ecb7e1c6a7162

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 9e9ecda20f6831b5560ece5c2cc6f43251f486c8cacd7a1ddf0dab82ec968b09
MD5 e97ede89cb24f04b9d8df1dd70e66bc6
BLAKE2b-256 8228bcd55cc0bf4b4a8b8276659df49b3804470384114c5dd653c2829fdf62c5

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 7dc461516302138b2e58e174fdc6da7a7e39effaa4870945bb2772a5e8a09130
MD5 2350ff9cd8b5f41c1cc5c68f413e156b
BLAKE2b-256 0a77e9fb70ad9008bf9766e5f3e1dac20d1545683d7d41175ee9ab237e090ca8

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 233ef37d8b35bc240d9fb7babd53154074e010c74ee2000174dbf4fb548f5b29
MD5 51a634350e8fcf2aa0b7b65b845e7e5b
BLAKE2b-256 6c19dbe9bcfa1b4fea8400abc2479ba498fdb0216c3e7acfa45486996e83b1f3

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 940f781d9e59e12b0204a022b2157a68c7a5229384602170fdc1b9714270fd2e
MD5 b9f39ea9c5f59ea752b086d78eacb0b3
BLAKE2b-256 20d7bdc60f66859e6a1c8e1c5f2b73d2f4fba8034fbdf422709a226bdbf542e0

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a68a2e4b7aca53b72bf773bf02f8bfcc895befa2af32a60e01edd26dbae8cc4a
MD5 4450c5fb62ac0feb2b3693463b35f2a0
BLAKE2b-256 9b8a798f671006c3c85cdd70f5d7cfce9a953c6a23a1d57d55e3d8def8985795

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 d367f389856d0d61306014160cff1dd3240c53998bfbd14972d86fd6caef0a73
MD5 9ec9ca683f109866ba6f80e3d5f5c2aa
BLAKE2b-256 7d5e77b5ccf60cc6a8a6ad811cc5ebb79fb1ccc8ede68e8894559e8f0e33ab92

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 6804189a981f2ab465513b20a19be152a2501ebd99dae25f61a426e9bbd2f1c4
MD5 542582ccdf5ba19102be4c3f892bf3f2
BLAKE2b-256 acadd9d24124001654d7096ca95f9f2d7de3b3f8b314e8aa0322feca09356bd0

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 cac2f8a65449c4419a0635a9ae366b88d787e3272da50de3bb3587e58837a28a
MD5 b291e2e2dc783972df2525e277678eb2
BLAKE2b-256 84a7fa041dd2907c702370982ca0fc6b9a6194e566398e08be5f6e0832b65ccc

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b08e20f4f2022e97297607f64353e23240f5172b2baa7621bedd6fb0b64cf880
MD5 a905165a358e5cb86466ff9bc325ebeb
BLAKE2b-256 9b72d5355851fdcb296d064cc4702fbd69d00b1d97434b7750c397af27d82457

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 f1234d30afeb3ca26f8031f2e921c46e0305767efbbccddd9c8087b2a1c7031c
MD5 6afd5e929977320c7779d8483b718e1e
BLAKE2b-256 28f9d1459583fcd542ac488335f3741116c52bf4c6bb392cd4f0d971c41ffce3

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9d5fa004ef585985632933dfe62c84ad66d8cc60300a3d8c78a4956fda6f9676
MD5 9fbaa9976d22b8acb66127dae1776df5
BLAKE2b-256 004d4b7e85ae9c36364f8b97221826989d415da196558dc64b32b30c05bb63e1

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6bf98357d3e0bb93d54d7db05de519ff18abf2c7fd30e44e4c02115432d935df
MD5 f27e3f38e519dac31b7d45457b890d3a
BLAKE2b-256 958d0f09efc3c8edd88b151fefbcaa225f68beba9d6c6c8591b4e71b9b46026c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page