A tool to automatically infer columns data types in .csv files
Project description
Csv Schema Inference
A tool to automatically infer columns data types in .csv files
Check the article here: Building a Schema Inference Data Pipeline for Large CSV files
Installing csv-schema-inference 🔧
pip install csv-schema-inference
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting csv-schema-inference
Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB)
Installing collected packages: csv-schema-inference
Successfully installed csv-schema-inference-0.0.9
Importing csv-schema-inference library ⚡
from csv_schema_inference import csv_schema_inference
Setting csv-schema-inference configuration ✍
#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"}
csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
pathfile = "/content/file__500k.csv"
Run inference 🏃
aprox_schema = csv_infer.run_inference(pathfile)
Showing the approximate data type inference for each column 🔍
csv_infer.pretty(aprox_schema)
0
name
id
type
INTEGER
nullable
False
1
name
full_name
type
STRING
nullable
True
2
name
age
type
INTEGER
nullable
False
3
name
city
type
STRING
nullable
True
4
name
weight
type
FLOAT
nullable
False
5
name
height
type
FLOAT
nullable
False
6
name
isActive
type
BOOLEAN
nullable
False
7
name
col_int1
type
INTEGER
nullable
False
8
name
col_int2
type
INTEGER
nullable
False
9
name
col_int3
type
INTEGER
nullable
False
10
name
col_float1
type
FLOAT
nullable
False
11
name
col_float2
type
FLOAT
nullable
False
12
name
col_float3
type
FLOAT
nullable
False
13
name
col_float4
type
FLOAT
nullable
False
14
name
col_float5
type
FLOAT
nullable
False
15
name
col_float6
type
FLOAT
nullable
False
16
name
col_float7
type
FLOAT
nullable
False
17
name
col_float8
type
FLOAT
nullable
False
18
name
col_float9
type
FLOAT
nullable
False
19
name
col_float10
type
FLOAT
nullable
False
20
name
test_column
type
FLOAT
nullable
False
Checking schema values for specific columns ✔
result = csv_infer.get_schema_columns(columns = {"test_column"})
csv_infer.pretty(result)
20
_name
test_column
types_found
INTEGER
cnt
406130
FLOAT
cnt
50964
nullable
False
type
FLOAT
Explore all possible data types for a specific columns ✅
result = csv_infer.explore_schema_column(column = "test_column")
csv_infer.pretty(result)
20
name
test_column
types_found
INTEGER
88.85043339006856
FLOAT
11.149566609931437
nullable
False
Contributing and Feedback
Any ideas or feedback about this repository?. Help me to improve it.
Authors
- Created by Ramses Alexander Coraspe Valdez
- Created on 2022
License
This project is licensed under the terms of the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file csv-schema-inference-0.0.9.tar.gz
.
File metadata
- Download URL: csv-schema-inference-0.0.9.tar.gz
- Upload date:
- Size: 41.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8752f8872c0358ae6ca8896f23203504f1313f81fabf22388fd702d74cb004e |
|
MD5 | 32eb3cea1a63e1a280815c6de882e6eb |
|
BLAKE2b-256 | b38a1fae9682d1c360ecb23685ef6438ce3ca13ec74f1ee252cb8c2c9805d5d3 |
File details
Details for the file csv_schema_inference-0.0.9-py3-none-any.whl
.
File metadata
- Download URL: csv_schema_inference-0.0.9-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e2297009cf06f34879d61acfcc0ad06ab7ffb41505b4f7733a6658f5ca51860 |
|
MD5 | 08a201635838c69d848efe32c60a2dfd |
|
BLAKE2b-256 | 4d5e5339730daab7fa3d47da8c70e1d6f0ddc893217d354d0564fd3dead1ad3a |