Enhanced CSV reader and writer with automatic type inference.
Project description
philiprehberger-csv-kit
Enhanced CSV reader and writer with automatic type inference.
Installation
pip install philiprehberger-csv-kit
Usage
from philiprehberger_csv_kit import read_csv
rows = read_csv("data.csv")
# [{"name": "Alice", "age": 30, "score": 9.5}, ...]
Values are automatically cast to int, float, bool, or None. Disable with typed=False:
rows = read_csv("data.csv", typed=False)
# [{"name": "Alice", "age": "30", "score": "9.5"}, ...]
Writing CSV
from philiprehberger_csv_kit import write_csv
rows = [
{"name": "Alice", "age": 30, "score": 9.5},
{"name": "Bob", "age": 25, "score": 8.0},
]
write_csv("output.csv", rows)
write_csv("output.csv", rows, columns=["name", "age"]) # select columns
Streaming large files
from philiprehberger_csv_kit import stream_csv, stream_csv_rows
# Chunked streaming (lists of rows)
for chunk in stream_csv("large.csv", chunk_size=500):
for row in chunk:
process(row)
# Row-by-row streaming (minimal memory usage)
for row in stream_csv_rows("large.csv"):
process(row)
Column type override
from philiprehberger_csv_kit import read_csv, infer_types
# Force specific columns to a type instead of auto-inferring
rows = read_csv("data.csv", overrides={"id": str, "score": int})
# Also available on infer_types directly
raw = [{"id": "42", "score": "9.5"}]
typed = infer_types(raw, overrides={"id": str, "score": int})
# [{"id": "42", "score": 9}]
Quick inspection
from philiprehberger_csv_kit import head, sample
# First 5 rows (without loading the entire file)
rows = head("data.csv", n=5)
# Random sample of 10 rows (reproducible with seed)
rows = sample("data.csv", n=10, seed=42)
Export helpers
from philiprehberger_csv_kit import read_csv, to_json, to_dict_list
rows = read_csv("data.csv")
# Serialize to JSON string
json_str = to_json(rows, indent=2)
# Extract specific columns as a list of dicts
subset = to_dict_list(rows, columns=["name", "age"])
Duplicate detection
from philiprehberger_csv_kit import read_csv, find_duplicates, deduplicate
rows = read_csv("data.csv")
# Find duplicate rows
dupes = find_duplicates(rows)
dupes_by_name = find_duplicates(rows, columns=["name"])
# Remove duplicates (keeps first occurrence)
unique = deduplicate(rows)
unique_by_name = deduplicate(rows, columns=["name"])
Column statistics
from philiprehberger_csv_kit import column_stats
stats = column_stats("data.csv")
# {"age": {"min": 25, "max": 30, "unique": 2, "nulls": 0, "count": 2}, ...}
# Analyse specific columns only
stats = column_stats("data.csv", columns=["age", "score"])
Dialect detection
from philiprehberger_csv_kit import detect_dialect
# Detect from a file
result = detect_dialect("data.tsv")
print(result.delimiter) # "\t"
print(result.quotechar) # '"'
# Detect from a raw text sample
result = detect_dialect("name;age;score\nAlice;30;9.5\n")
print(result.delimiter) # ";"
Column data quality
from philiprehberger_csv_kit import read_csv, column_quality
rows = read_csv("data.csv")
quality = column_quality(rows, "email")
print(quality.completeness) # 87.5 (percentage of non-null values)
print(quality.cardinality_ratio) # 0.95 (unique values / total rows)
print(quality.null_count) # 2
Transformation pipeline
from philiprehberger_csv_kit import read_csv, CsvPipeline
rows = read_csv("employees.csv")
result = (
CsvPipeline(rows)
.filter(lambda r: r["age"] > 18)
.map_column("name", str.upper)
.deduplicate(columns=["name"])
.sort_by("age")
.to_list()
)
# Export pipeline results as JSON
json_str = CsvPipeline(rows).filter(lambda r: r["active"] is True).to_json()
# Random sample from pipeline
sampled = CsvPipeline(rows).sample(10, seed=42).to_list()
# Group by department
groups = (
CsvPipeline(rows)
.filter(lambda r: r["active"] is True)
.group_by("department")
)
# {"Engineering": [...], "Sales": [...]}
Type inference
from philiprehberger_csv_kit import infer_types
raw = [{"val": "42"}, {"val": "3.14"}, {"val": "true"}, {"val": ""}]
typed = infer_types(raw)
# [{"val": 42}, {"val": 3.14}, {"val": True}, {"val": None}]
API
| Function / Class | Description |
|---|---|
read_csv(path, typed=True, encoding="utf-8", overrides=None) |
Read CSV file, return list of dicts. Infers types when typed=True. Optional type overrides per column. |
write_csv(path, rows, columns=None, encoding="utf-8") |
Write list of dicts to CSV. Optional column filter. |
stream_csv(path, chunk_size=1000, encoding="utf-8") |
Generator yielding chunks of row dicts for memory-efficient reading. |
stream_csv_rows(path, typed=True, encoding="utf-8") |
Generator yielding individual row dicts for true row-by-row streaming. |
infer_types(rows, overrides=None) |
Cast string values to int, float, bool, or None. Optional per-column type overrides. |
head(path, n=5, typed=True, encoding="utf-8") |
Return the first n rows from a CSV file without loading the entire file. |
sample(path, n=5, typed=True, encoding="utf-8", seed=None) |
Return a random sample of n rows from a CSV file. |
to_json(rows, indent=2, ensure_ascii=False) |
Serialize a list of row dicts to a JSON string. |
to_dict_list(rows, columns=None) |
Return a filtered copy of rows as a list of plain dicts. |
find_duplicates(rows, columns=None) |
Find duplicate rows. Returns second and subsequent occurrences. |
deduplicate(rows, columns=None) |
Remove duplicate rows, keeping the first occurrence. |
column_stats(path, columns=None) |
Compute per-column stats: min, max, unique, nulls, count. |
detect_dialect(filepath_or_sample) |
Detect CSV delimiter, quotechar, and formatting from a file or text sample. Returns DialectResult. |
column_quality(rows, column) |
Score column data quality: completeness %, cardinality ratio, null count. Returns QualityResult. |
CsvPipeline(rows) |
Chainable pipeline with .filter(), .exclude(), .map_column(), .add_column(), .rename_column(), .select_columns(), .sort_by(), .group_by(), .head(), .tail(), .sample(), .deduplicate(), .to_list(), .to_json(), .to_dict_list(), .count(), .first(). |
Development
pip install -e .
python -m pytest tests/ -v
Support
If you find this project useful:
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file philiprehberger_csv_kit-0.4.0.tar.gz.
File metadata
- Download URL: philiprehberger_csv_kit-0.4.0.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
476b528eff8cb559529b55b43a5f29b351b404a6658e1152f686625427b82664
|
|
| MD5 |
ef20186ef92bffc5d2714b390da9bf70
|
|
| BLAKE2b-256 |
e1987c55c817b06c0b05f006299077d5d4869c7439cfa228d16fd591ac5599a4
|
File details
Details for the file philiprehberger_csv_kit-0.4.0-py3-none-any.whl.
File metadata
- Download URL: philiprehberger_csv_kit-0.4.0-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0438d2fa0cd655cf8976070f2e2e2dccaff8193fd985d0e3bb813641f916b969
|
|
| MD5 |
eb021627c8db36945e55f979e923eaaf
|
|
| BLAKE2b-256 |
3ae733296402ea272be2d49b746626314b2722240684ade5abd9ff9ffc583dd4
|