A tool to automatically infer columns data types in .csv files
Project description
Csv Schema Inference
A tool to automatically infer columns data types in .csv files
Check the article here: Building a Schema Inference Data Pipeline for Large CSV files
Installing csv-schema-inference 🔧
pip install csv-schema-inference
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting csv-schema-inference
Downloading csv_schema_inference-0.0.3-py3-none-any.whl (5.2 kB)
Installing collected packages: csv-schema-inference
Successfully installed csv-schema-inference-0.0.3
Importing csv-schema-inference library ⚡
from csv_schema_inference import csv_schema_inference
Setting csv-schema-inference configuration ✍
csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.7, max_length=100, seed=2, header=True, sep=",")
pathfile = "/content/data.csv"
Run inference 🏃
aprox_schema = csv_infer.run_inference(pathfile)
Showing the approximate data type inference for each column 🔍
csv_infer.pretty(aprox_schema)
0
name
key_1
type
STRING
nullable
False
1
name
date_2
type
DATE
nullable
False
2
name
cont_3
type
FLOAT
nullable
False
3
name
cont_4
type
FLOAT
nullable
False
4
name
disc_5
type
INTEGER
nullable
False
5
name
disc_6
type
INTEGER
nullable
True
6
name
cat_7
type
STRING
nullable
False
7
name
cat_8
type
STRING
nullable
False
8
name
cont_9
type
FLOAT
nullable
False
9
name
cont_10
type
FLOAT
nullable
True
Checking schema values for specific columns ✔
result = csv_infer.get_schema_columns(columns = {"disc_6"})
csv_infer.pretty(result)
5
_name
disc_6
values
na
cnt
70755
_type
STRING
14
cnt
34732
_type
INTEGER
17
cnt
35237
_type
INTEGER
12
cnt
35408
_type
INTEGER
10
cnt
35174
_type
INTEGER
4
cnt
34924
_type
INTEGER
8
cnt
34861
_type
INTEGER
7
cnt
35270
_type
INTEGER
13
cnt
35274
_type
INTEGER
5
cnt
35024
_type
INTEGER
0
cnt
35325
_type
INTEGER
2
cnt
35265
_type
INTEGER
16
cnt
35250
_type
INTEGER
6
cnt
34961
_type
INTEGER
15
cnt
35132
_type
INTEGER
11
cnt
35250
_type
INTEGER
3
cnt
35063
_type
INTEGER
1
cnt
35237
_type
INTEGER
9
cnt
35078
_type
INTEGER
nullable
True
approximate_type
INTEGER
Explore all possible data types for a specific columns ✅
result = csv_infer.explore_schema_column(column = "disc_6")
csv_infer.pretty(result)
5
name
disc_6
types
STRING
10.061573902903785
INTEGER
89.93842609709621
nullable
True
Contributing and Feedback
Any ideas or feedback about this repository?. Help me to improve it.
Authors
- Created by Ramses Alexander Coraspe Valdez
- Created on 2022
License
This project is licensed under the terms of the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
csv-schema-inference-0.0.6.tar.gz
(41.6 kB
view hashes)
Built Distribution
Close
Hashes for csv-schema-inference-0.0.6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2465440c1c9307e3979b3126e5fa1225ef01be4a8bac031921d608549ede847c |
|
MD5 | c26ef08dfcd040c39951755991f87162 |
|
BLAKE2b-256 | 05c0472fefdc729f5df59e602deb1fefce9f7e35387ec3e284bdb08ed5747bf5 |
Close
Hashes for csv_schema_inference-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b69ad4b1a338b616e69c1992b8b878df8bc21bb292ca8c7db98c07a08bc8518 |
|
MD5 | 603a2f77be025ed4fd9cb6f065b90099 |
|
BLAKE2b-256 | d58ef02508b3954ae80fbb13d47ba6f71176673f39e5034e6c5f260c8008f07e |