A tool to automatically infer columns data types in .csv files
Project description
Csv Schema Inference
A tool to automatically infer columns data types in .csv files
Check the article here: Building a Schema Inference Data Pipeline for Large CSV files
Installing csv-schema-inference 🔧
pip install csv-schema-inference
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting csv-schema-inference
Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB)
Installing collected packages: csv-schema-inference
Successfully installed csv-schema-inference-0.0.9
Importing csv-schema-inference library ⚡
from csv_schema_inference import csv_schema_inference
Setting csv-schema-inference configuration ✍
#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"}
csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
pathfile = "/content/file__500k.csv"
Run inference 🏃
aprox_schema = csv_infer.run_inference(pathfile)
Showing the approximate data type inference for each column 🔍
csv_infer.pretty(aprox_schema)
0
name
id
type
INTEGER
nullable
False
1
name
full_name
type
STRING
nullable
True
2
name
age
type
INTEGER
nullable
False
3
name
city
type
STRING
nullable
True
4
name
weight
type
FLOAT
nullable
False
5
name
height
type
FLOAT
nullable
False
6
name
isActive
type
BOOLEAN
nullable
False
7
name
col_int1
type
INTEGER
nullable
False
8
name
col_int2
type
INTEGER
nullable
False
9
name
col_int3
type
INTEGER
nullable
False
10
name
col_float1
type
FLOAT
nullable
False
11
name
col_float2
type
FLOAT
nullable
False
12
name
col_float3
type
FLOAT
nullable
False
13
name
col_float4
type
FLOAT
nullable
False
14
name
col_float5
type
FLOAT
nullable
False
15
name
col_float6
type
FLOAT
nullable
False
16
name
col_float7
type
FLOAT
nullable
False
17
name
col_float8
type
FLOAT
nullable
False
18
name
col_float9
type
FLOAT
nullable
False
19
name
col_float10
type
FLOAT
nullable
False
20
name
test_column
type
FLOAT
nullable
False
Checking schema values for specific columns ✔
result = csv_infer.get_schema_columns(columns = {"test_column"})
csv_infer.pretty(result)
20
_name
test_column
types_found
INTEGER
cnt
406130
FLOAT
cnt
50964
nullable
False
type
FLOAT
Explore all possible data types for a specific columns ✅
result = csv_infer.explore_schema_column(column = "test_column")
csv_infer.pretty(result)
20
name
test_column
types_found
INTEGER
88.85043339006856
FLOAT
11.149566609931437
nullable
False
Contributing and Feedback
Any ideas or feedback about this repository?. Help me to improve it.
Authors
- Created by Ramses Alexander Coraspe Valdez
- Created on 2022
License
This project is licensed under the terms of the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
csv-schema-inference-0.0.9.tar.gz
(41.8 kB
view hashes)
Built Distribution
Close
Hashes for csv-schema-inference-0.0.9.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8752f8872c0358ae6ca8896f23203504f1313f81fabf22388fd702d74cb004e |
|
MD5 | 32eb3cea1a63e1a280815c6de882e6eb |
|
BLAKE2b-256 | b38a1fae9682d1c360ecb23685ef6438ce3ca13ec74f1ee252cb8c2c9805d5d3 |
Close
Hashes for csv_schema_inference-0.0.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e2297009cf06f34879d61acfcc0ad06ab7ffb41505b4f7733a6658f5ca51860 |
|
MD5 | 08a201635838c69d848efe32c60a2dfd |
|
BLAKE2b-256 | 4d5e5339730daab7fa3d47da8c70e1d6f0ddc893217d354d0564fd3dead1ad3a |