Skip to main content

Blazing fast format validations for your CSV files

Project description

CSV Validation Library

Blazing fast format validations for your CSV files

This is a Python lib with a Rust core that will allow you to validate huge CSV files (GBs) in seconds (or in a few minutes for really huge files) using a minimal amount of memory.

Features

  • ✨ Validate both plain and gzipped CSV files
  • 🔍 Multiple validation types supported:
    • Correct column name and order
    • Regular expressions
    • Well-known formats (integer, decimal, etc.)
    • Minimum/Maximum numerical value checks
    • Value set validation (allowed values)
  • 🦀 + 🐍 Rust lib with Python bindings included
  • 📝 Detailed validation summaries with sample invalid values
  • 🚀 High performance with optimizations like regex pre-compilation
  • 📊 Support for large CSV files

Installation

Python

pip install csv_validation
poetry add csv_validation
uv add csv_validation

Usage

Python

You can provide a file with the validation rules

from csv_validation import CSVValidator

validator = CSVValidator.from_file("validation_rules.yaml")
is_valid = validator.validate("data.csv")

You can also create a validator from a string

validation_rules = """
columns:
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$
  - name: Age
    format: positive_integer
    max: 120
"""

validator = CSVValidator.from_string(validation_rules)
# Optionally set a custom column separator
validator.set_separator(";")  # Default is comma (,)

# Validate a CSV file
is_valid = validator.validate("data.csv")

Validation Definition Format

Create a small, easy-to-read YAML file with your validation rules. Example:

columns:
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$  # Letters and spaces, 2-50 characters
  - name: Family Name
    regex: ^[A-Za-z\s'-]{2,50}$  # Letters, spaces, hyphens and apostrophes
  - name: Age
    format: positive_integer  # Using predefined format instead of custom regex
    max: 120
  - name: Salary
    format: integer  # Allows negative integers too
    min: 20000
  - name: Email
    regex: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$  # Standard email format
  - name: Phone
    regex: ^\+?[0-9]{10,15}$  # International phone format with optional +
  - name: Status
    values: [active, inactive, pending, suspended]  # Only these values are allowed
  - name: Gender
    values: [M, F, NB, O]  # M: Male, F: Female, NB: Non-Binary, O: Other

Validation Types

  1. Regular Expression (regex)

    • Validate fields against custom regex patterns
  2. Format (format)

    • Predefined formats for common validations (This is the recommended way to validate numeric fields)
    • In the background, the library uses regex patterns to validate formats.
    • Available formats:
      • integer: Validates any integer number (positive or negative)
      • positive integer: Validates positive integer numbers
      • negative integer: Validates negative integer numbers
      • decimal/decimal point: Validates any decimal number (positive or negative) using point as decimal separator
      • negative decimal point: Validates negative decimal numbers using point as decimal separator
      • decimal comma: Validates decimal numbers using comma as decimal separator
      • positive decimal/positive decimal point: Validates positive decimal numbers using point as decimal separator
      • positive decimal comma: Validates positive decimal numbers using comma as decimal separator
      • decimal scientific: Validates decimal numbers in scientific notation (e.g. 23.02e-12)
      • decimal scientific comma: Validates decimal numbers in scientific notation using comma as decimal separator
      • positive decimal scientific: Validates positive decimal numbers in scientific notation
    • More formats will be added in upcoming versions
  3. Minimum Value (min)

    • Check if numeric fields are greater than or equal to a specified value
  4. Maximum Value (max)

    • Check if numeric fields are less than or equal to a specified value
  5. Value Set (values)

    • Ensure fields only contain values from a predefined set
  6. Extra (extra)

    • Extra validations that normally complement one of the other types
    • Available extras:
      • non_empty: Validates that field contains at least one character

You can add as many validation types as you want for the same column, but take into account that the column only will be considered correct if all the validations are OK.

Global Validations

empty_not_ok

Empty values (empty string '') are considered correct and accepted by default. However, if your data must always have some content, you can add a global empty_not_ok flag at the root level of your YAML definition to automatically add the extra: non_empty validation to all columns:

empty_not_ok: true
columns:
  - name: Column1
    # other validations...

Unique Key (Duplicate Rows) Validation

You can enforce uniqueness across one or more columns by adding a root-level unique field in your YAML definition. This enables a memory-efficient duplicate detection that scales to very large files.

  • Accepts a single column name (string) or a list of column names (array of strings).
  • Works with both plain and gzipped CSV files.
  • Uses a high-performant disk-backed key-value store (redb) under the hood to keep memory usage low.

Examples:

# Single-column unique key
unique: ID
columns:
  - name: ID
    format: positive integer
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$
# Composite unique key across multiple columns
unique: [isin, date, score_provider]
columns:
  - name: isin
    regex: ^[A-Z0-9]{12}$
  - name: date
    regex: ^\d{4}-\d{2}-\d{2}$
  - name: score_provider
    values: [sp1, sp2, sp3]
  - name: score
    format: decimal

What you will see in the summary:

  • Always shows the number of duplicate key groups found.
  • If there are more than 100 duplicate groups, only the first 100 are displayed as a sample (with their occurrence counts), and the total number is indicated.
  • If no duplicates are found, it reports: "Duplicates found: 0".

Sample output (composite key):

UNIQUE KEY: column(s): ["isin", "date", "score_provider"]

DUPLICATED KEYS (groups found: 134)
--------------------------------------------------
Showing first 100 of 134 duplicate key groups:
  - (isin='US0004026250', date='2025-10-01', score_provider='sp1') -> occurrences: 3
  - (isin='US5949181045', date='2025-10-01', score_provider='sp2') -> occurrences: 4
  ... up to 100 lines ...

If none:

UNIQUE KEYS CHECK
--------------------------------------------------
  - No Duplicates found

Performance note: Duplicate detection stores only the unique keys on disk and keeps a tiny in-memory summary used for printing the report. This allows validating huge datasets with minimal RAM usage.

Column Separator

By default, the library uses comma (,) as the column separator. You can change this using the set_separator method:

validator = CSVValidator.from_file("rules.yml")
validator.set_separator(";")  # Use semicolon as separator
validator.set_separator("\t")  # Use tab as separator

Decimal Separator

By default, the library uses point (.) as the decimal separator. You can change this using the set_decimal_separator method:

validator = CSVValidator.from_file("rules.yml")
validator.set_decimal_separator(",")  # Use comma

Error Handling

The library provides detailed validation reports through Python's logging system. When a validation fails, you'll get information about:

  • Which columns failed validation
  • What type of validation failed
  • Sample of invalid values
  • Number of rows that failed each validation

Example of validation report

VALIDATIONS SUMMARY
==================================================
FILE: /test.csv
Rows: 232230 | Columns: 5

CORRECT COLUMNS: 3/5
--------------------------------------------------
  - County: [✔] OK
      ✔ - RegularExpression { expression: "^.*$", alias: "regex" }
  - City: [✔] OK
      ✔ - RegularExpression { expression: "^.*$", alias: "regex" }
  - Model Year: [✔] OK
      ✔ - RegularExpression { expression: "^[0-9]+$", alias: "regex" }
      ✔ - Min(1999.0)
      ✔ - Max(2025.78)

WRONG COLUMNS: 2/5
--------------------------------------------------
  - State: [✖] FAIL
      ✖ - Values(["WA", "OR", "NY", "DC", "CA", "TX", "FL", "OK", "MO", "KS", "VA", "MA", "MO", "NC", "IL", "AL", "WY", "CO", "PA", "WI", "MD", "NV", "AZ"])
          "Wrong Rows: 93 | Sample: 'GA','NJ','CT','NJ','CT','CT','CT','NE','CT','NH'"
  - Postal Code: [✖] FAIL
      ✔ - RegularExpression { expression: "^$|^-?\\d+$", alias: "integer" }
      ✖ - RegularExpression { expression: "^.+$", alias: "non_empty" }
          "Wrong Rows: 4 | Sample: '','','',''"

VALIDATION RESULT
--------------------------------------------------
[✖] FAIL: File DOESN'T match all validations

Development

Prerequisites

  • Rust 2021 edition or later
  • Python 3.6+ (for Python bindings)
  • Cargo and standard Rust tooling

Building from Source (Rust version)

# Clone the repository
git clone https://github.com/charro/csv_validation
cd csv_validation

# Build the project
cargo build --release

# Run tests
cargo test

Dependencies

  • csv: CSV parsing
  • flate2: Compression support
  • pyo3: Python bindings
  • regex: Regular expression support
  • yaml-rust2: YAML parsing
  • redb: Disk-backed key-value storage used for memory-efficient duplicate detection
  • Various utilities for logging and serialization

License

MIT License

Copyright (c) 2024 CSV Validation Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_validation-0.1.1.tar.gz (27.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

csv_validation-0.1.1-cp310-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

csv_validation-0.1.1-cp310-abi3-win32.whl (1.1 MB view details)

Uploaded CPython 3.10+Windows x86

csv_validation-0.1.1-cp310-abi3-musllinux_1_2_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

csv_validation-0.1.1-cp310-abi3-musllinux_1_2_i686.whl (1.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

csv_validation-0.1.1-cp310-abi3-musllinux_1_2_armv7l.whl (1.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

csv_validation-0.1.1-cp310-abi3-musllinux_1_2_aarch64.whl (1.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

csv_validation-0.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

csv_validation-0.1.1-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

csv_validation-0.1.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

csv_validation-0.1.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

csv_validation-0.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

csv_validation-0.1.1-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl (1.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.5+ i686

csv_validation-0.1.1-cp310-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

csv_validation-0.1.1-cp310-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file csv_validation-0.1.1.tar.gz.

File metadata

  • Download URL: csv_validation-0.1.1.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for csv_validation-0.1.1.tar.gz
Algorithm Hash digest
SHA256 81b94e8519d73a5a723cb1e0329a56b559e8c94a3471d5ab37790a30157d8d16
MD5 f0c52f88d29c95346022b4e348b0d885
BLAKE2b-256 3324d4cbd363afe934e797d865222908217e06c6192bff5b5e69fd039cb13faa

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 fcfc4356fb8a1e7a0ccc0499989ac5f3d411499499589f5051ee7cb1b552fb6c
MD5 a1ee6ea60bf8e99282ae5174599071cd
BLAKE2b-256 8fe58b65c7f401ccc22c2f6ed79c0c6637284692a15d91dd852f22f092225419

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-win32.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 913acee3a1662d30f97ce0b0025480b5c8a7097498305b78ef7fbc7c592e6c1c
MD5 f95f8c06722bf0f6abe701505ff0a2a5
BLAKE2b-256 0deec2d2c5a70abd48fe17ed1bfe4d837722ae910abe55196ce2b8906c5eda7f

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 7a374570400dcdd9b5ede684b5ea18b2c48646b3ccfc498d1f0e0d4ce01f4060
MD5 2aed4ce421f434147a5c37a25f1458f4
BLAKE2b-256 3af8eed56903b470f78619117fc2002df17133224200ace018ed070cef11d1bb

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 48af7a846c2b9f0f12f5eea678452c22c1e5a0a2c15310e40df0cc21f5d7d84f
MD5 01b0766fc66cb4bbdf92140548eada2e
BLAKE2b-256 864d02af8c988f339dc6d52138bd2c16ffe58edfbf8c0af9e1f66b754ceab9f9

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 8ef33f8256cd5a203320e15c485940b597897a7382f93a56b217c2122fe551c2
MD5 7c4e6928894544bc548d8b1cff71ebf4
BLAKE2b-256 92aec94a18db71cc83ae7124b3b1ad1858a04b41e8ec285cb5d7e250a4b79630

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 ec0218f2e3c78ad5ccf92dfe1182ac00e01ed1b94d26513a60237967c7060774
MD5 6571f4803922bffc9430836976f35c61
BLAKE2b-256 bba8e6f1dc6939122ca023acc4de38c479814f16e4ca148e499541a6447efdf7

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f2c3f7906eb6ed05a4094c054bf95bbb36875b05f8d53a28c9fef0d7ed479649
MD5 5bbb60b2b02f85ca828fb0d1d5a2c944
BLAKE2b-256 dd1527d3dc921c5e29c7fd27898ec3d45c81e86034251badf12b9285e9532a9d

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 071cb1540a51e6938c0943a5e97f832612389c567093403d7b65a613046b3740
MD5 f98c2a3e6140cf1c212b6a8a7264d3ea
BLAKE2b-256 e65933892dba83bc68f8c4094b9423503b8a5a019c58a532799e7137782fc4d1

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 1cdb3efbb4f2d9c677b32aba1e9fa89cfaf45e902beec20b57283d208da1de82
MD5 4bf09733dcb315592bc12c1e98f08699
BLAKE2b-256 86460860beb4a07b55f70d99ad625cbaa0755dfae7f9c112dd769ed7bdd1286a

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 01eebd67f826b73a4cff637f0fcbb5f81826cd6f72a1cffef61b0848bb673f71
MD5 61279951e498339cc4ff55d20400d68a
BLAKE2b-256 5819af05dbd39782b74c74bcb3e77a9bd439ba6f680dca4519664000646fbe3d

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f3474c9a1bd38e753d6a3424e0cb5575c81feed70eb655094a9e40e2f274e906
MD5 34634ffd914b2e5155b00dc7ae9cd293
BLAKE2b-256 ed3aae67d557cb46a955fba9a8de926cbaf46e5075f35344ee7d753fe09c44f4

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 126cfb25bf8a73f4d3e810b64d73d3d50de9e04618443656937fa9a287987c60
MD5 9799fef2a15545873544e3cdedffd3ce
BLAKE2b-256 269731d65d7425d137a2ceee2d4fdcc47bca5d7a697549d49b8817b4e2c8efad

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 247b2be5aa10f6ec01f04f68a37b81becf763545a8bea6e8c22b019ce0a6f8ef
MD5 c61edc19c891ae489ec40367ffd7be38
BLAKE2b-256 c5c898be1ecf1d0dfe045836e10b21ab4d75e335b47be0a825903c34e1afe53a

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 92f736ebfbee10fac8cda872c1552bfa49f2af2211c029cbf9b8d97209686fd1
MD5 41b924b308c5bb768667f6b8babce334
BLAKE2b-256 f70495d75a419c3b359bce00f47525a34a687366426382bb1f8ae0da477a7307

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page