High-performance, local-first semantic data cleaning library
Project description
The All-in-One Local AI Data Cleaner.
Why Loclean?
Loclean bridges the gap between Data Engineering and Local AI, designed for production pipelines where privacy and stability are non-negotiable.
Privacy-First & Zero Cost
Leverage the power of Small Language Models (SLMs) like Phi-3 and Llama-3 running locally via llama.cpp. Clean sensitive PII, medical records, or proprietary data without a single byte leaving your infrastructure.
Deterministic Outputs
Forget about "hallucinations" or parsing loose text. Loclean uses GBNF Grammars and Pydantic V2 to force the LLM to output valid, type-safe JSON. If it breaks the schema, it doesn't pass.
Backend Agnostic (Zero-Copy)
Built on Narwhals, Loclean supports Pandas, Polars, and PyArrow natively.
- Running Polars? We keep it lazy.
- Running Pandas? We handle it seamlessly.
- No heavy dependency lock-in.
Installation
Requirements
- Python 3.10, 3.11, 3.12, or 3.13
- No GPU required (runs on CPU by default)
Basic Installation
Using pip:
pip install loclean
Using uv (recommended for faster installs):
uv pip install loclean
Using conda/mamba:
conda install -c conda-forge loclean
# or
mamba install -c conda-forge loclean
Optional Dependencies
For DataFrame operations (Pandas, Polars, PyArrow):
pip install loclean[data]
For Cloud API support (OpenAI, Anthropic, Gemini):
pip install loclean[cloud]
Install everything:
pip install loclean[all]
Development Installation
To contribute or run tests locally:
# Clone the repository
git clone https://github.com/nxank4/loclean.git
cd loclean
# Install with development dependencies (using uv)
uv sync --dev
# Or using pip
pip install -e ".[dev]"
Quick Start
in progress...
How It Works (The Architecture)
in progress...
Roadmap
The development of Loclean is organized into three phases, prioritizing MVP delivery while maintaining a long-term vision.
Phase 1: The "Smart" Engine (Phần Lõi Hybrid)
Goal: Get loclean.clean() running fast and accurately.
- Hybrid Router Architecture: Build
clean(strategy='auto')function. Automatically run Regex first, LLM second. - Strict Output (Pydantic + GBNF): Ensure 100% LLM outputs valid JSON Schema. (Using llama-cpp-python grammar).
- Simple Extraction: Extract basic information from raw text (Unstructured to Structured).
Phase 2: The "Safe" Layer (Bảo mật & Tối ưu)
Goal: Convince enterprises to trust and adopt the library.
- Semantic PII Redaction: Masking sensitive names, phone numbers, emails, and addresses.
- SQLite Caching System: Cache LLM results to avoid redundant costs/time. (As discussed above).
- Batch Processing: Parallel processing (Parallelism) to handle millions of rows without freezing.
Phase 3: The "Magic" (Tính năng nâng cao)
Goal: Do things that Regex can never do.
- Contextual Imputation: Fill missing values based on context (e.g., seeing Zipcode 70000 -> Auto-fill City: TP.HCM).
- Entity Canonicalization: Group entities (Fuzzy matching + Semantic matching).
- Interactive CLI: Terminal interface to review AI changes with low confidence.
Contributing
We love contributions! Loclean is strictly open-source under the Apache 2.0 License.
- Fork the repo on GitHub.
- Clone your fork locally.
- Create a new branch (
git checkout -b feature/amazing-feature). - Commit your changes.
- Push to your fork and submit a Pull Request.
Built for the Data Community.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file loclean-0.1.1.tar.gz.
File metadata
- Download URL: loclean-0.1.1.tar.gz
- Upload date:
- Size: 28.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15d9a7d2b4722296b693675059bda2d22061d8a0e4cccf18341e46482f39d728
|
|
| MD5 |
b66f678d67d20eed850984c5235f51d8
|
|
| BLAKE2b-256 |
a7361aca51a4b0fea7da69c8349b315db30d71a37560ed6d5f87a3d6957fadb7
|
Provenance
The following attestation bundles were made for loclean-0.1.1.tar.gz:
Publisher:
publish.yml on nxank4/loclean
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
loclean-0.1.1.tar.gz -
Subject digest:
15d9a7d2b4722296b693675059bda2d22061d8a0e4cccf18341e46482f39d728 - Sigstore transparency entry: 803141131
- Sigstore integration time:
-
Permalink:
nxank4/loclean@ca7d535a5fd62e247c390b2a0da3865a08bd7744 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/nxank4
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ca7d535a5fd62e247c390b2a0da3865a08bd7744 -
Trigger Event:
push
-
Statement type:
File details
Details for the file loclean-0.1.1-py3-none-any.whl.
File metadata
- Download URL: loclean-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d39ddcc1c3095df4fa0ba4134b12b89d62efcf4be0bb5a049b74d54188cfa708
|
|
| MD5 |
7363d364ca03985002f3eb474de56c57
|
|
| BLAKE2b-256 |
67244d6fe7ba83ac8d8904467029d84a1bac2d569008d738ff935c3a280c4c52
|
Provenance
The following attestation bundles were made for loclean-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on nxank4/loclean
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
loclean-0.1.1-py3-none-any.whl -
Subject digest:
d39ddcc1c3095df4fa0ba4134b12b89d62efcf4be0bb5a049b74d54188cfa708 - Sigstore transparency entry: 803141174
- Sigstore integration time:
-
Permalink:
nxank4/loclean@ca7d535a5fd62e247c390b2a0da3865a08bd7744 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/nxank4
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ca7d535a5fd62e247c390b2a0da3865a08bd7744 -
Trigger Event:
push
-
Statement type: