Skip to main content

A tool to automatically infer columns data types in .csv files

Project description

Csv Schema Inference

A tool to automatically infer columns data types in .csv files

Check the article here: Building a Schema Inference Data Pipeline for Large CSV files

Installing csv-schema-inference 🔧

pip install csv-schema-inference
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting csv-schema-inference
  Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB)
Installing collected packages: csv-schema-inference
Successfully installed csv-schema-inference-0.0.9

Importing csv-schema-inference library

from csv_schema_inference import csv_schema_inference

Setting csv-schema-inference configuration

#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"}

csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
pathfile = "/content/file__500k.csv"

Run inference 🏃

aprox_schema = csv_infer.run_inference(pathfile)

Showing the approximate data type inference for each column 🔍

csv_infer.pretty(aprox_schema)
0
	name
		id
	type
		INTEGER
	nullable
		False
1
	name
		full_name
	type
		STRING
	nullable
		True
2
	name
		age
	type
		INTEGER
	nullable
		False
3
	name
		city
	type
		STRING
	nullable
		True
4
	name
		weight
	type
		FLOAT
	nullable
		False
5
	name
		height
	type
		FLOAT
	nullable
		False
6
	name
		isActive
	type
		BOOLEAN
	nullable
		False
7
	name
		col_int1
	type
		INTEGER
	nullable
		False
8
	name
		col_int2
	type
		INTEGER
	nullable
		False
9
	name
		col_int3
	type
		INTEGER
	nullable
		False
10
	name
		col_float1
	type
		FLOAT
	nullable
		False
11
	name
		col_float2
	type
		FLOAT
	nullable
		False
12
	name
		col_float3
	type
		FLOAT
	nullable
		False
13
	name
		col_float4
	type
		FLOAT
	nullable
		False
14
	name
		col_float5
	type
		FLOAT
	nullable
		False
15
	name
		col_float6
	type
		FLOAT
	nullable
		False
16
	name
		col_float7
	type
		FLOAT
	nullable
		False
17
	name
		col_float8
	type
		FLOAT
	nullable
		False
18
	name
		col_float9
	type
		FLOAT
	nullable
		False
19
	name
		col_float10
	type
		FLOAT
	nullable
		False
20
	name
		test_column
	type
		FLOAT
	nullable
		False

Checking schema values for specific columns

result = csv_infer.get_schema_columns(columns = {"test_column"})
csv_infer.pretty(result)
20
	_name
		test_column
	types_found
		INTEGER
			cnt
				406130
		FLOAT
			cnt
				50964
	nullable
		False
	type
		FLOAT

Explore all possible data types for a specific columns

result = csv_infer.explore_schema_column(column = "test_column")
csv_infer.pretty(result)
20
	name
		test_column
	types_found
		INTEGER
			88.85043339006856
		FLOAT
			11.149566609931437
	nullable
		False

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

License

This project is licensed under the terms of the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv-schema-inference-0.0.9.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file csv-schema-inference-0.0.9.tar.gz.

File metadata

  • Download URL: csv-schema-inference-0.0.9.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.9

File hashes

Hashes for csv-schema-inference-0.0.9.tar.gz
Algorithm Hash digest
SHA256 c8752f8872c0358ae6ca8896f23203504f1313f81fabf22388fd702d74cb004e
MD5 32eb3cea1a63e1a280815c6de882e6eb
BLAKE2b-256 b38a1fae9682d1c360ecb23685ef6438ce3ca13ec74f1ee252cb8c2c9805d5d3

See more details on using hashes here.

File details

Details for the file csv_schema_inference-0.0.9-py3-none-any.whl.

File metadata

File hashes

Hashes for csv_schema_inference-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 5e2297009cf06f34879d61acfcc0ad06ab7ffb41505b4f7733a6658f5ca51860
MD5 08a201635838c69d848efe32c60a2dfd
BLAKE2b-256 4d5e5339730daab7fa3d47da8c70e1d6f0ddc893217d354d0564fd3dead1ad3a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page