A tool to automatically infer columns data types in .csv files
Project description
Csv Schema Inference
A tool to automatically infer columns data types in .csv files
Check the article here: Building a Schema Inference Data Pipeline for Large CSV files
Installing csv-schema-inference 🔧
pip install csv-schema-inference
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting csv-schema-inference
Downloading csv_schema_inference-0.0.3-py3-none-any.whl (5.2 kB)
Installing collected packages: csv-schema-inference
Successfully installed csv-schema-inference-0.0.3
Importing csv-schema-inference library ⚡
from csv_schema_inference import csv_schema_inference
Setting csv-schema-inference configuration ✍
csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.7, max_length=100, seed=2, header=True, sep=",")
pathfile = "/content/data.csv"
Run inference 🏃
aprox_schema = csv_infer.run_inference(pathfile)
Showing the approximate data type inference for each column 🔍
csv_infer.pretty(aprox_schema)
0
name
key_1
type
STRING
nullable
False
1
name
date_2
type
DATE
nullable
False
2
name
cont_3
type
FLOAT
nullable
False
3
name
cont_4
type
FLOAT
nullable
False
4
name
disc_5
type
INTEGER
nullable
False
5
name
disc_6
type
INTEGER
nullable
True
6
name
cat_7
type
STRING
nullable
False
7
name
cat_8
type
STRING
nullable
False
8
name
cont_9
type
FLOAT
nullable
False
9
name
cont_10
type
FLOAT
nullable
True
Checking schema values for specific columns ✔
result = csv_infer.get_schema_columns(columns = {"disc_6"})
csv_infer.pretty(result)
5
_name
disc_6
values
na
cnt
70755
_type
STRING
14
cnt
34732
_type
INTEGER
17
cnt
35237
_type
INTEGER
12
cnt
35408
_type
INTEGER
10
cnt
35174
_type
INTEGER
4
cnt
34924
_type
INTEGER
8
cnt
34861
_type
INTEGER
7
cnt
35270
_type
INTEGER
13
cnt
35274
_type
INTEGER
5
cnt
35024
_type
INTEGER
0
cnt
35325
_type
INTEGER
2
cnt
35265
_type
INTEGER
16
cnt
35250
_type
INTEGER
6
cnt
34961
_type
INTEGER
15
cnt
35132
_type
INTEGER
11
cnt
35250
_type
INTEGER
3
cnt
35063
_type
INTEGER
1
cnt
35237
_type
INTEGER
9
cnt
35078
_type
INTEGER
nullable
True
approximate_type
INTEGER
Explore all possible data types for a specific columns ✅
result = csv_infer.explore_schema_column(column = "disc_6")
csv_infer.pretty(result)
5
name
disc_6
types
STRING
10.061573902903785
INTEGER
89.93842609709621
nullable
True
Contributing and Feedback
Any ideas or feedback about this repository?. Help me to improve it.
Authors
- Created by Ramses Alexander Coraspe Valdez
- Created on 2022
License
This project is licensed under the terms of the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
csv-schema-inference-0.0.5.tar.gz
(41.5 kB
view hashes)
Built Distribution
Close
Hashes for csv-schema-inference-0.0.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ffdf5cf43f92208a2f5c669e59968f875399e704ad1bc13cebb1208d3a4d497 |
|
MD5 | 91bd2f5984e80f113129d1135ba7ee65 |
|
BLAKE2b-256 | db90bf46d955295b733a68f5876ebe9de78dbfdb67432a63259ebe8cdfb99e18 |
Close
Hashes for csv_schema_inference-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42f6d8b6e7e642a9a95327cca4ae795e54b71788c7b56723a8b8b51bda9c8c7b |
|
MD5 | ddfbb2c931746822f42c9e3251070614 |
|
BLAKE2b-256 | 0287dbc11768d7c8af18401a3afef34be230ed6946db3023558b01f6cfe73b89 |