Skip to main content

High-performance, local-first semantic data cleaning library

Project description

Loclean logo

The All-in-One Local AI Data Cleaner.

PyPI Python Versions CI Status License uv

Why Loclean?

Loclean bridges the gap between Data Engineering and Local AI, designed for production pipelines where privacy and stability are non-negotiable.

Privacy-First & Zero Cost

Leverage the power of Small Language Models (SLMs) like Phi-3 and Llama-3 running locally via llama.cpp. Clean sensitive PII, medical records, or proprietary data without a single byte leaving your infrastructure.

Deterministic Outputs

Forget about "hallucinations" or parsing loose text. Loclean uses GBNF Grammars and Pydantic V2 to force the LLM to output valid, type-safe JSON. If it breaks the schema, it doesn't pass.

Backend Agnostic (Zero-Copy)

Built on Narwhals, Loclean supports Pandas, Polars, and PyArrow natively.

  • Running Polars? We keep it lazy.
  • Running Pandas? We handle it seamlessly.
  • No heavy dependency lock-in.

Installation

Requirements

  • Python 3.10, 3.11, 3.12, or 3.13
  • No GPU required (runs on CPU by default)

Basic Installation

Using pip:

pip install loclean

Using uv (recommended for faster installs):

uv pip install loclean

Using conda/mamba:

conda install -c conda-forge loclean
# or
mamba install -c conda-forge loclean

Optional Dependencies

For DataFrame operations (Pandas, Polars, PyArrow):

pip install loclean[data]

For Cloud API support (OpenAI, Anthropic, Gemini):

pip install loclean[cloud]

Install everything:

pip install loclean[all]

Development Installation

To contribute or run tests locally:

# Clone the repository
git clone https://github.com/nxank4/loclean.git
cd loclean

# Install with development dependencies (using uv)
uv sync --dev

# Or using pip
pip install -e ".[dev]"

Quick Start

in progress...

How It Works (The Architecture)

in progress...

Roadmap

The development of Loclean is organized into three phases, prioritizing MVP delivery while maintaining a long-term vision.

Phase 1: The "Smart" Engine (Phần Lõi Hybrid)

Goal: Get loclean.clean() running fast and accurately.

  • Hybrid Router Architecture: Build clean(strategy='auto') function. Automatically run Regex first, LLM second.
  • Strict Output (Pydantic + GBNF): Ensure 100% LLM outputs valid JSON Schema. (Using llama-cpp-python grammar).
  • Simple Extraction: Extract basic information from raw text (Unstructured to Structured).

Phase 2: The "Safe" Layer (Bảo mật & Tối ưu)

Goal: Convince enterprises to trust and adopt the library.

  • Semantic PII Redaction: Masking sensitive names, phone numbers, emails, and addresses.
  • SQLite Caching System: Cache LLM results to avoid redundant costs/time. (As discussed above).
  • Batch Processing: Parallel processing (Parallelism) to handle millions of rows without freezing.

Phase 3: The "Magic" (Tính năng nâng cao)

Goal: Do things that Regex can never do.

  • Contextual Imputation: Fill missing values based on context (e.g., seeing Zipcode 70000 -> Auto-fill City: TP.HCM).
  • Entity Canonicalization: Group entities (Fuzzy matching + Semantic matching).
  • Interactive CLI: Terminal interface to review AI changes with low confidence.

Contributing

We love contributions! Loclean is strictly open-source under the Apache 2.0 License.

  1. Fork the repo on GitHub.
  2. Clone your fork locally.
  3. Create a new branch (git checkout -b feature/amazing-feature).
  4. Commit your changes.
  5. Push to your fork and submit a Pull Request.

Built for the Data Community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loclean-0.1.1.tar.gz (28.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

loclean-0.1.1-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file loclean-0.1.1.tar.gz.

File metadata

  • Download URL: loclean-0.1.1.tar.gz
  • Upload date:
  • Size: 28.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for loclean-0.1.1.tar.gz
Algorithm Hash digest
SHA256 15d9a7d2b4722296b693675059bda2d22061d8a0e4cccf18341e46482f39d728
MD5 b66f678d67d20eed850984c5235f51d8
BLAKE2b-256 a7361aca51a4b0fea7da69c8349b315db30d71a37560ed6d5f87a3d6957fadb7

See more details on using hashes here.

Provenance

The following attestation bundles were made for loclean-0.1.1.tar.gz:

Publisher: publish.yml on nxank4/loclean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file loclean-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: loclean-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for loclean-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d39ddcc1c3095df4fa0ba4134b12b89d62efcf4be0bb5a049b74d54188cfa708
MD5 7363d364ca03985002f3eb474de56c57
BLAKE2b-256 67244d6fe7ba83ac8d8904467029d84a1bac2d569008d738ff935c3a280c4c52

See more details on using hashes here.

Provenance

The following attestation bundles were made for loclean-0.1.1-py3-none-any.whl:

Publisher: publish.yml on nxank4/loclean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page