Skip to main content

Blazing fast format validations for your CSV files

Project description

CSV Validation Library

Blazing fast format validations for your CSV files

This is a Python lib with a Rust core that will allow you to validate huge CSV files (GBs) in seconds (or in a few minutes for really huge files) using a minimal amount of memory.

Features

  • ✨ Validate both plain and gzipped CSV files
  • 🔍 Multiple validation types supported:
    • Correct column name and order
    • Regular expressions
    • Well-known formats (integer, decimal, etc.)
    • Minimum/Maximum numerical value checks
    • Value set validation (allowed values)
  • 🦀 + 🐍 Rust lib with Python bindings included
  • 📝 Detailed validation summaries with sample invalid values
  • 🚀 High performance with optimizations like regex pre-compilation
  • 📊 Support for large CSV files

Installation

Python

pip install csv_validation
poetry add csv_validation
uv add csv_validation

Usage

Python

You can provide a file with the validation rules

from csv_validation import CSVValidator

validator = CSVValidator.from_file("validation_rules.yaml")
is_valid = validator.validate("data.csv")

You can also create a validator from a string

validation_rules = """
columns:
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$
  - name: Age
    format: positive_integer
    max: 120
"""

validator = CSVValidator.from_string(validation_rules)
# Optionally set a custom column separator
validator.set_separator(";")  # Default is comma (,)

# Validate a CSV file
is_valid = validator.validate("data.csv")

Validation Definition Format

Create a small, easy-to-read YAML file with your validation rules. Example:

columns:
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$  # Letters and spaces, 2-50 characters
  - name: Family Name
    regex: ^[A-Za-z\s'-]{2,50}$  # Letters, spaces, hyphens and apostrophes
  - name: Age
    format: positive_integer  # Using predefined format instead of custom regex
    max: 120
  - name: Salary
    format: integer  # Allows negative integers too
    min: 20000
  - name: Email
    regex: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$  # Standard email format
  - name: Phone
    regex: ^\+?[0-9]{10,15}$  # International phone format with optional +
  - name: Status
    values: [active, inactive, pending, suspended]  # Only these values are allowed
  - name: Gender
    values: [M, F, NB, O]  # M: Male, F: Female, NB: Non-Binary, O: Other

Validation Types

  1. Regular Expression (regex)

    • Validate fields against custom regex patterns
  2. Format (format)

    • Predefined formats for common validations (This is the recommended way to validate numeric fields)
    • In the background, the library uses regex patterns to validate formats.
    • Available formats:
      • integer: Validates any integer number (positive or negative)
      • positive integer: Validates positive integer numbers
      • negative integer: Validates negative integer numbers
      • decimal/decimal point: Validates any decimal number (positive or negative) using point as decimal separator
      • negative decimal point: Validates negative decimal numbers using point as decimal separator
      • decimal comma: Validates decimal numbers using comma as decimal separator
      • positive decimal/positive decimal point: Validates positive decimal numbers using point as decimal separator
      • positive decimal comma: Validates positive decimal numbers using comma as decimal separator
      • decimal scientific: Validates decimal numbers in scientific notation (e.g. 23.02e-12)
      • decimal scientific comma: Validates decimal numbers in scientific notation using comma as decimal separator
      • positive decimal scientific: Validates positive decimal numbers in scientific notation
    • More formats will be added in upcoming versions
  3. Minimum Value (min)

    • Check if numeric fields are greater than or equal to a specified value
  4. Maximum Value (max)

    • Check if numeric fields are less than or equal to a specified value
  5. Value Set (values)

    • Ensure fields only contain values from a predefined set
  6. Extra (extra)

    • Extra validations that normally complement one of the other types
    • Available extras:
      • non_empty: Validates that field contains at least one character

You can add as many validation types as you want for the same column, but take into account that the column only will be considered correct if all the validations are OK.

Global Validations

empty_not_ok

Empty values (empty string '') are considered correct and accepted by default. However, if your data must always have some content, you can add a global empty_not_ok flag at the root level of your YAML definition to automatically add the extra: non_empty validation to all columns:

empty_not_ok: true
columns:
  - name: Column1
    # other validations...

Unique Key (Duplicate Rows) Validation

You can enforce uniqueness across one or more columns by adding a root-level unique field in your YAML definition. This enables a memory-efficient duplicate detection that scales to very large files.

  • Accepts a single column name (string) or a list of column names (array of strings).
  • Works with both plain and gzipped CSV files.
  • Uses a high-performant disk-backed key-value store (redb) under the hood to keep memory usage low.

Examples:

# Single-column unique key
unique: ID
columns:
  - name: ID
    format: positive integer
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$
# Composite unique key across multiple columns
unique: [isin, date, score_provider]
columns:
  - name: isin
    regex: ^[A-Z0-9]{12}$
  - name: date
    regex: ^\d{4}-\d{2}-\d{2}$
  - name: score_provider
    values: [sp1, sp2, sp3]
  - name: score
    format: decimal

What you will see in the summary:

  • Always shows the number of duplicate key groups found.
  • If there are more than 100 duplicate groups, only the first 100 are displayed as a sample (with their occurrence counts), and the total number is indicated.
  • If no duplicates are found, it reports: "No duplicates found".

Sample output (composite key):

UNIQUE KEY: column(s): isin, date, score_provider

DUPLICATED KEYS (groups found: 134)
--------------------------------------------------
Showing first 100 of 134 duplicate key groups:
  - (isin='US0004026250', date='2025-10-01', score_provider='sp1') -> occurrences: 3
  - (isin='US5949181045', date='2025-10-01', score_provider='sp2') -> occurrences: 4
  ... up to 100 lines ...

If none:

UNIQUE KEYS CHECK
--------------------------------------------------
  - No duplicates found

Performance note: Duplicate detection stores only the unique keys on disk and keeps a tiny in-memory summary used for printing the report. This allows validating huge datasets with minimal RAM usage.

Column Separator

By default, the library uses comma (,) as the column separator. You can change this using the set_separator method:

validator = CSVValidator.from_file("rules.yml")
validator.set_separator(";")  # Use semicolon as separator
validator.set_separator("\t")  # Use tab as separator

Decimal Separator

By default, the library uses point (.) as the decimal separator. You can change this using the set_decimal_separator method:

validator = CSVValidator.from_file("rules.yml")
validator.set_decimal_separator(",")  # Use comma

Error Handling

The library provides detailed validation reports through Python's logging system. When a validation fails, you'll get information about:

  • Which columns failed validation
  • What type of validation failed
  • Sample of invalid values (including row numbers)
  • Number of rows that failed each validation

Example of validation report

VALIDATIONS SUMMARY
==================================================
FILE: /test.csv
Rows: 232230 | Columns: 5

CORRECT COLUMNS: 3/5
--------------------------------------------------
  - County: [✔] OK
      ✔ - regex: '^.*$'
  - City: [✔] OK
      ✔ - regex: '^.*$'
  - Model Year: [✔] OK
      ✔ - regex: '^[0-9]+$'
      ✔ - min: 1999
      ✔ - max: 2025.78

WRONG COLUMNS: 2/5
--------------------------------------------------
  - State: [✖] FAIL
      ✖ - values: 'WA', 'OR', 'NY', 'DC', 'CA', 'TX', 'FL', 'OK', 'MO', 'KS', 'VA', 'MA', 'MO', 'NC', 'IL', 'AL', 'WY', 'CO', 'PA', 'WI', 'MD', 'NV', 'AZ'
          "Wrong Rows: 93 | Sample: [row 12: 'GA'], [row 15: 'NJ'], [row 22: 'CT'], [row 23: 'NJ'], [row 24: 'CT']"
  - Postal Code: [✖] FAIL
      ✔ - integer
      ✖ - non_empty
          "Wrong Rows: 4 | Sample: [row 45: ''], [row 123: ''], [row 456: ''], [row 789: '']"

VALIDATION RESULT
--------------------------------------------------
[✖] FAIL: File /test.csv DOESN'T match all validations

Development

Prerequisites

  • Rust 2021 edition or later
  • Python 3.6+ (for Python bindings)
  • Cargo and standard Rust tooling

Building from Source (Rust version)

# Clone the repository
git clone https://github.com/charro/csv_validation
cd csv_validation

# Build the project
cargo build --release

# Run tests
cargo test

Dependencies

  • csv: CSV parsing
  • flate2: Compression support
  • pyo3: Python bindings
  • regex: Regular expression support
  • yaml-rust2: YAML parsing
  • redb: Disk-backed key-value storage used for memory-efficient duplicate detection
  • Various utilities for logging and serialization

License

MIT License

Copyright (c) 2024 CSV Validation Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_validation-0.1.3.tar.gz (27.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

csv_validation-0.1.3-cp310-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

csv_validation-0.1.3-cp310-abi3-win32.whl (1.1 MB view details)

Uploaded CPython 3.10+Windows x86

csv_validation-0.1.3-cp310-abi3-musllinux_1_2_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

csv_validation-0.1.3-cp310-abi3-musllinux_1_2_i686.whl (1.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

csv_validation-0.1.3-cp310-abi3-musllinux_1_2_armv7l.whl (1.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

csv_validation-0.1.3-cp310-abi3-musllinux_1_2_aarch64.whl (1.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

csv_validation-0.1.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

csv_validation-0.1.3-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

csv_validation-0.1.3-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

csv_validation-0.1.3-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

csv_validation-0.1.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

csv_validation-0.1.3-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl (1.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.5+ i686

csv_validation-0.1.3-cp310-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

csv_validation-0.1.3-cp310-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file csv_validation-0.1.3.tar.gz.

File metadata

  • Download URL: csv_validation-0.1.3.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for csv_validation-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d824fad0c0b4a9293a90049b420f2a9d3b22c00d891b6af26a4d63f03634d60c
MD5 fb3dc684d30f02e2f2eac4ca91942827
BLAKE2b-256 a9016c6c2a61ab7e31f884b01674a852b8c53a723b417e89044fa6d61c551328

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9a51b0818fc22d53eadbae7e497dd9a05939aa90b3b641e97feb73e3ae352e12
MD5 d88c20fe9d7afd5afe94ef3e677eb7cf
BLAKE2b-256 183f497941917b2de57c3456123535b7957341a2bd2d50c186dd52dcbd0d6337

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-win32.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 33504f1f0600b59e4562641228175b15d4d0a06be9e227dc0538411fbb92fe89
MD5 d1a522e450436e6798548b2a3648d813
BLAKE2b-256 0d6199eb95c5cafa9ed4979ff447bca7121bba08a65682de20da08f5f4fc8502

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d3be37191a236d858792bc1e6b3df7f5f85d730d87b54ffc986d5d68a9d63b84
MD5 575264cced9024a4d27fa6ea29beaf1b
BLAKE2b-256 fbb49f0d4cb4827e19dc443814f5ef2dba81707c7a6f87cec7a531dcb98688ae

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 719b0f81410fc8174376c922c22b7d2300fa0ba32293870b3a3d650afd314b16
MD5 44ad18caebf6a6571e168f72b3e2867b
BLAKE2b-256 51ec1ed88e9756c0cd710860ca115863c082b9c2ff9a79e2c271f6ae5880279d

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 8f9d815bdd173e8291c00a06e89771ec58ed74dd9a8278b25dcc9d9eb1a4c4c9
MD5 428a6ea73e4f6fb84223284dda07ffbd
BLAKE2b-256 dcbb8f1f3eb0d61558324ca28c27a23a0c7abf10a63db2427ab223d6dd6cbc13

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 762730edfa9b16ae63fee9cbaf430429962ecbb9ced4da8ce06e3a721f2571e3
MD5 ccb5d2a4c13369ebca7411be471b3476
BLAKE2b-256 9e51cf9c525f0b2e96c7a51f5c179fa3160426d86454e5cd3def87f6f9f0f04c

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e64e6b35b38566bfc0579e47b62d9c91d36e5afd5501644153079955b3d628cd
MD5 92fbd6683ca7e30f8dbb7c93ca9753bf
BLAKE2b-256 cca8e3bce31eb582ca7dc5bd6959da5f17827596321274a1825b9e50374695cf

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 6593c8ed107d82738d451b396176905278d2525383441d10197078341c86432a
MD5 f73dae0873eea3894864011e03c8a21e
BLAKE2b-256 b29c06486d75e263b10f132b4f816e4d4ec82f34c425dffc2ba46f364b9d9013

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 e6564c4c091b38529d4cc03e4d121beb100a52be11fb8ef4c3e8f2dfbc342e4c
MD5 1574b036a96042b9c528d3b64172cea8
BLAKE2b-256 eab3ee008813a67ab690aa1ba5cb8ea77144dee93635c33edaff1c982f193047

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 93aafa22c7a925e695436dc0651789e5bd5a8b22cfb49fee2b7e45ce46b141c7
MD5 fa041c08d54e81ef3d9826199cf8e460
BLAKE2b-256 2fbeebb7857db6fc1573a6ac7d7af588068d099963ff92152c99c016c29d7f05

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 bd27e7b367174bf66ed0c373065fbf762ce2f20cde38b5a92ac1d6e1a64e7d63
MD5 29e45808dcb66aa353a1a2f68f05b9b8
BLAKE2b-256 d77bba972da00b47195dfdc529abf5f35a3f61c4f3d9953bf1ce5216ad9abbd3

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 1ed83a24001b37ef35f3f5b880fdc5cb1143b84ffd057bbf009aa47a045b80de
MD5 d81333d24309a5a3f336b7b195f5a961
BLAKE2b-256 4dd9ab46dc46ce141d5aa3c284df01083ad15d547ded6e8a82003755dec01b49

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2fd898e05161da76ade714632739bcb0defdf2b70aed9c9219c9e1faa34d1974
MD5 0c81973a520d8e8da047e1580c46c55c
BLAKE2b-256 4faaded6be72f819db720aecaa4d4bc4981f79d7ad5899b63fd8e41b0d6cddcf

See more details on using hashes here.

File details

Details for the file csv_validation-0.1.3-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for csv_validation-0.1.3-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 32098d2621d4a2205b1f4b8f087e837dd6cbfa0c02c62e876383b2567242b2e9
MD5 5350210fa1d4d8dd2aeb940dd0cdce95
BLAKE2b-256 9518a9d7f8a353ad42766b0571b80677787c95cd7ac40d2be8da7d1b0f0e5e07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page