High-performance, local-first semantic data cleaning library
Project description
Loclean ⚡🧠
The All-in-One Local AI Data Cleaner.
Clean messy tabular data using local AI. No API keys required. No GPU required.
🔥 Why Loclean?
Loclean bridges the gap between Data Engineering and Local AI, designed for production pipelines where privacy and stability are non-negotiable.
🔒 Privacy-First & Zero Cost
Leverage the power of Small Language Models (SLMs) like Phi-3 and Llama-3 running locally via llama.cpp. Clean sensitive PII, medical records, or proprietary data without a single byte leaving your infrastructure.
🛡️ Deterministic Outputs
Forget about "hallucinations" or parsing loose text. Loclean uses GBNF Grammars and Pydantic V2 to force the LLM to output valid, type-safe JSON. If it breaks the schema, it doesn't pass.
⚡ Backend Agnostic (Zero-Copy)
Built on Narwhals, Loclean supports Pandas, Polars, and PyArrow natively.
- Running Polars? We keep it lazy.
- Running Pandas? We handle it seamlessly.
- No heavy dependency lock-in.
🚀 Installation
Requirements
- Python 3.10 or higher
- No GPU required (runs on CPU by default)
Basic Installation
Using pip:
pip install loclean
Using uv (recommended for faster installs):
uv pip install loclean
Using conda/mamba:
conda install -c conda-forge loclean
# or
mamba install -c conda-forge loclean
Optional Dependencies
For DataFrame operations (Pandas, Polars, PyArrow):
pip install loclean[data]
For Cloud API support (OpenAI, Anthropic, Gemini):
pip install loclean[cloud]
Install everything:
pip install loclean[all]
Development Installation
To contribute or run tests locally:
# Clone the repository
git clone https://github.com/nxank4/loclean.git
cd loclean
# Install with development dependencies (using uv)
uv sync --dev
# Or using pip
pip install -e ".[dev]"
⚡ Quick Start
in progress...
🏗️ How It Works (The Architecture)
in progress...
🗺️ Roadmap
The development of Loclean is focused on three key areas: Reliability, Privacy, and Integration.
📍 Phase 1: Core Intelligence (Current Focus)
Goal: Build a deterministic and smart cleaning engine.
- Strict Schema Mode: Guarantee valid outputs by forcing the LLM to adhere to Pydantic models using GBNF grammar (eliminates JSON parsing errors).
- Contextual Imputation: Fill
nullvalues intelligently by reasoning over surrounding column context (e.g., inferringStatefromZip Code). - Entity Canonicalization: Map messy variations (e.g., "Apple Inc.", "apple comp", "AAPL") to a single "Golden Record" standard.
📍 Phase 2: Privacy & Advanced Extraction
Goal: Specialized features for enterprise-grade data handling.
- Unstructured Extraction: Parse free-text fields (Logs, Bios, Reviews) into structured tabular data.
- Semantic PII Redaction: Automatically detect and mask sensitive entities (Names, SSNs, Emails) locally to ensure data privacy.
- Semantic Outlier Detection: Flag values that are statistically normal but contextually impossible (e.g., "Age: 200").
📍 Phase 3: Ecosystem & DX
Goal: Make Loclean a first-class citizen in the Python data stack.
- Native Dataframe Accessors: Direct integration for Pandas and Polars (e.g.,
df.loclean.clean(...)) via PyArrow. - Interactive CLI Review: A "Human-in-the-loop" mode to review and approve low-confidence AI changes via the terminal.
- Custom LoRA Adapters: Support for loading lightweight, domain-specific fine-tunes (e.g., Medical, Legal) without replacing the base model.
🤝 Contributing
We love contributions! Loclean is strictly open-source under the Apache 2.0 License.
- Fork the repo on GitHub.
- Clone your fork locally.
- Create a new branch (
git checkout -b feature/amazing-feature). - Commit your changes.
- Push to your fork and submit a Pull Request.
Built with ❤️ for the Data Community.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file loclean-0.1.0.tar.gz.
File metadata
- Download URL: loclean-0.1.0.tar.gz
- Upload date:
- Size: 26.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
089bc1d2b9c3ab57e01e7e39897098f11e160dcf1e7d1e4acfdbd916b12fed11
|
|
| MD5 |
1ab7c1b9ac7108ebcf0ae26af54ffb63
|
|
| BLAKE2b-256 |
34a82369ff123be386a6e98e81b64ee52d3a15ab73cbcfe4ad954d92841c99f2
|
Provenance
The following attestation bundles were made for loclean-0.1.0.tar.gz:
Publisher:
publish.yml on nxank4/loclean
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
loclean-0.1.0.tar.gz -
Subject digest:
089bc1d2b9c3ab57e01e7e39897098f11e160dcf1e7d1e4acfdbd916b12fed11 - Sigstore transparency entry: 780516068
- Sigstore integration time:
-
Permalink:
nxank4/loclean@35e92e1ea2b6a976f29ba403c2249a87395c8f31 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/nxank4
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@35e92e1ea2b6a976f29ba403c2249a87395c8f31 -
Trigger Event:
push
-
Statement type:
File details
Details for the file loclean-0.1.0-py3-none-any.whl.
File metadata
- Download URL: loclean-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e33ddc819fd9232ae30b70865f26c45f079df43984bdc4adcbfe517860d35a0
|
|
| MD5 |
4609e27f526f92665620063d5810fe0d
|
|
| BLAKE2b-256 |
66ced2462fef629e8ea74ea767de0cbf1211cf1a02f494b4f266efbbee83aad4
|
Provenance
The following attestation bundles were made for loclean-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on nxank4/loclean
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
loclean-0.1.0-py3-none-any.whl -
Subject digest:
7e33ddc819fd9232ae30b70865f26c45f079df43984bdc4adcbfe517860d35a0 - Sigstore transparency entry: 780516080
- Sigstore integration time:
-
Permalink:
nxank4/loclean@35e92e1ea2b6a976f29ba403c2249a87395c8f31 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/nxank4
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@35e92e1ea2b6a976f29ba403c2249a87395c8f31 -
Trigger Event:
push
-
Statement type: