Skip to main content

India-aware semantic data type inferencer for CSV columns

Project description

Inferix — India-Aware Semantic Data Type Inferencer

PyPI version Python Downloads License: MIT Tests

Automatically detect what your CSV columns actually mean — PAN numbers, GST numbers, Aadhaar, IFSC codes, Indian mobile numbers, and 14 more types.


What It Does

Most tools like pandas only detect basic types (int, float, string). Inferix goes deeper and detects the semantic meaning of each column.

Column Name pandas says Inferix says Confidence
cust_pan object pan_number 94%
gst_code object gst_number 97%
join_date int64 date_disguised 88%
monthly_amt float64 inr_currency 91%

Supported Semantic Types (20)

India-Specific (7)

Type Description
pan_number Permanent Account Number
gst_number GST Identification Number
aadhaar_number Aadhaar UID (12-digit)
ifsc_code Bank IFSC Code
indian_mobile Indian Mobile Number (10-digit)
indian_pincode Indian Postal PIN Code (6-digit)
inr_currency Indian Rupee Amount

General (13)

Type Description
email_address Email addresses
url Web URLs
date_formatted Dates in DD/MM/YYYY etc.
date_disguised Dates stored as integers (YYYYMMDD)
timestamp Unix timestamps
percentage Percentage values
binary_flag Yes/No, True/False, 0/1
id_column Sequential or UUID identifiers
ratio Decimal values between 0 and 1
count Non-negative integer counts
age Age values (0-120)
category_low_card Low-cardinality categorical
free_text Free-form text / remarks

Installation

pip install inferix-py

Or install from source:

git clone https://github.com/rakshakr2006-droid/inferix.git
cd inferix
pip install -e .

Quick Start

Train the Model (one-time setup)

python -m inferix.train

Use It

import pandas as pd
from inferix import infer

df = pd.read_csv("your_data.csv")
results = infer(df)
print(results)

Output:

  column_name   semantic_type   confidence  evidence
  cust_pan      pan_number      94%         regex_pan=0.94, name_match=yes
  gst_code      gst_number      97%         regex_gst=0.97, name_match=yes
  join_date     date_disguised  88%         all_int=True, name_match=yes
  monthly_amt   inr_currency    91%         regex_inr=0.85, name_match=yes

Architecture

Inferix uses a 5-layer analysis pipeline combining regex pattern matching, statistical profiling, and XGBoost classification:

Column Data --> [Layer 1: Syntactic] --> null%, unique%, dtype
             --> [Layer 2: Pattern]  --> regex match scores (12 patterns)
             --> [Layer 3: Stats]    --> mean, std, skew, entropy
             --> [Layer 4: Name]     --> column name keyword match
                           |
              [All 50 features combined]
                           |
              [Layer 5: XGBoost Classifier]
                           |
              Semantic Type + Confidence + Evidence

Project Structure

inferix/
├── inferix/
│   ├── __init__.py          # Package init, public API
│   ├── infer.py             # Main infer() function
│   ├── patterns.py          # 12 Indian regex patterns
│   ├── features.py          # 50-feature extraction pipeline
│   ├── data_generator.py    # Synthetic training data
│   ├── train.py             # Model training script
│   └── model/
│       ├── inferix_model.json
│       └── label_encoder.pkl
├── tests/
│   ├── test_patterns.py     # 35 pattern tests
│   ├── test_features.py     # 19 feature tests
│   └── test_infer.py        # 8 inference tests
├── pyproject.toml
├── requirements.txt
└── README.md

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
python -m pytest tests/ -v

System Requirements

  • Python 3.10+
  • 8GB RAM (no GPU needed)
  • Works completely offline after initial setup

Why Inferix?

Tool Semantic Detection India Types ML-Based Lightweight
Inferix 20 types 7 types XGBoost ~50MB
Sherlock (MIT) 78 types 0 DNN ~2GB, needs GPU
csv-detective ~30 types 0 No ~10MB
pandas 0 0 No built-in

License

MIT License

Acknowledgements

Inspired by Sherlock (MIT, 2019) which detects 78 generic semantic types. Inferix fills the gap for India-specific types that Sherlock cannot detect.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferix_py-0.1.0.tar.gz (294.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inferix_py-0.1.0-py3-none-any.whl (297.9 kB view details)

Uploaded Python 3

File details

Details for the file inferix_py-0.1.0.tar.gz.

File metadata

  • Download URL: inferix_py-0.1.0.tar.gz
  • Upload date:
  • Size: 294.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for inferix_py-0.1.0.tar.gz
Algorithm Hash digest
SHA256 560545eb64886b4a3e07cd0ca9db4ce0f14ecd20167a5a15fa7329c6cfb9409d
MD5 a7ec567d47536743465000140e50cf7a
BLAKE2b-256 918407d634727af6942c2f92dc56d1cf147da9909d4eb82a3d23cf29e065bb0c

See more details on using hashes here.

File details

Details for the file inferix_py-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: inferix_py-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 297.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for inferix_py-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 868ccee33b1aec290f056ce5de5e545e7186fee75daf983b717ab70147bb49dc
MD5 b631bcbde522c4f47f79e2d69f8a90eb
BLAKE2b-256 f78f1df8170f6a0dc66a435054ff5ca4deff466c97a053b64e37ba574d726430

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page