Skip to main content

A tool to automatically infer columns data types in .csv files

Project description

Csv Schema Inference

A tool to automatically infer columns data types in .csv files

Check the article here: Building a Schema Inference Data Pipeline for Large CSV files

Installing csv-schema-inference 🔧

pip install csv-schema-inference
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting csv-schema-inference
  Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB)
Installing collected packages: csv-schema-inference
Successfully installed csv-schema-inference-0.0.9

Importing csv-schema-inference library

from csv_schema_inference import csv_schema_inference

Setting csv-schema-inference configuration

#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"}

csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
pathfile = "/content/file__500k.csv"

Run inference 🏃

aprox_schema = csv_infer.run_inference(pathfile)

Showing the approximate data type inference for each column 🔍

csv_infer.pretty(aprox_schema)
0
	name
		id
	type
		INTEGER
	nullable
		False
1
	name
		full_name
	type
		STRING
	nullable
		True
2
	name
		age
	type
		INTEGER
	nullable
		False
3
	name
		city
	type
		STRING
	nullable
		True
4
	name
		weight
	type
		FLOAT
	nullable
		False
5
	name
		height
	type
		FLOAT
	nullable
		False
6
	name
		isActive
	type
		BOOLEAN
	nullable
		False
7
	name
		col_int1
	type
		INTEGER
	nullable
		False
8
	name
		col_int2
	type
		INTEGER
	nullable
		False
9
	name
		col_int3
	type
		INTEGER
	nullable
		False
10
	name
		col_float1
	type
		FLOAT
	nullable
		False
11
	name
		col_float2
	type
		FLOAT
	nullable
		False
12
	name
		col_float3
	type
		FLOAT
	nullable
		False
13
	name
		col_float4
	type
		FLOAT
	nullable
		False
14
	name
		col_float5
	type
		FLOAT
	nullable
		False
15
	name
		col_float6
	type
		FLOAT
	nullable
		False
16
	name
		col_float7
	type
		FLOAT
	nullable
		False
17
	name
		col_float8
	type
		FLOAT
	nullable
		False
18
	name
		col_float9
	type
		FLOAT
	nullable
		False
19
	name
		col_float10
	type
		FLOAT
	nullable
		False
20
	name
		test_column
	type
		FLOAT
	nullable
		False

Checking schema values for specific columns

result = csv_infer.get_schema_columns(columns = {"test_column"})
csv_infer.pretty(result)
20
	_name
		test_column
	types_found
		INTEGER
			cnt
				406130
		FLOAT
			cnt
				50964
	nullable
		False
	type
		FLOAT

Explore all possible data types for a specific columns

result = csv_infer.explore_schema_column(column = "test_column")
csv_infer.pretty(result)
20
	name
		test_column
	types_found
		INTEGER
			88.85043339006856
		FLOAT
			11.149566609931437
	nullable
		False

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

License

This project is licensed under the terms of the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv-schema-inference-0.0.9.tar.gz (41.8 kB view hashes)

Uploaded Source

Built Distribution

csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page