Skip to main content

High-performance, local-first semantic data cleaning library

Project description

PyPI Python Versions CI Status License uv


Loclean ⚡🧠

The All-in-One Local AI Data Cleaner.

Clean messy tabular data using local AI. No API keys required. No GPU required.

🔥 Why Loclean?

Loclean bridges the gap between Data Engineering and Local AI, designed for production pipelines where privacy and stability are non-negotiable.

🔒 Privacy-First & Zero Cost

Leverage the power of Small Language Models (SLMs) like Phi-3 and Llama-3 running locally via llama.cpp. Clean sensitive PII, medical records, or proprietary data without a single byte leaving your infrastructure.

🛡️ Deterministic Outputs

Forget about "hallucinations" or parsing loose text. Loclean uses GBNF Grammars and Pydantic V2 to force the LLM to output valid, type-safe JSON. If it breaks the schema, it doesn't pass.

⚡ Backend Agnostic (Zero-Copy)

Built on Narwhals, Loclean supports Pandas, Polars, and PyArrow natively.

  • Running Polars? We keep it lazy.
  • Running Pandas? We handle it seamlessly.
  • No heavy dependency lock-in.

🚀 Installation

Requirements

  • Python 3.10 or higher
  • No GPU required (runs on CPU by default)

Basic Installation

Using pip:

pip install loclean

Using uv (recommended for faster installs):

uv pip install loclean

Using conda/mamba:

conda install -c conda-forge loclean
# or
mamba install -c conda-forge loclean

Optional Dependencies

For DataFrame operations (Pandas, Polars, PyArrow):

pip install loclean[data]

For Cloud API support (OpenAI, Anthropic, Gemini):

pip install loclean[cloud]

Install everything:

pip install loclean[all]

Development Installation

To contribute or run tests locally:

# Clone the repository
git clone https://github.com/nxank4/loclean.git
cd loclean

# Install with development dependencies (using uv)
uv sync --dev

# Or using pip
pip install -e ".[dev]"

⚡ Quick Start

in progress...

🏗️ How It Works (The Architecture)

in progress...

🗺️ Roadmap

The development of Loclean is focused on three key areas: Reliability, Privacy, and Integration.

📍 Phase 1: Core Intelligence (Current Focus)

Goal: Build a deterministic and smart cleaning engine.

  • Strict Schema Mode: Guarantee valid outputs by forcing the LLM to adhere to Pydantic models using GBNF grammar (eliminates JSON parsing errors).
  • Contextual Imputation: Fill null values intelligently by reasoning over surrounding column context (e.g., inferring State from Zip Code).
  • Entity Canonicalization: Map messy variations (e.g., "Apple Inc.", "apple comp", "AAPL") to a single "Golden Record" standard.

📍 Phase 2: Privacy & Advanced Extraction

Goal: Specialized features for enterprise-grade data handling.

  • Unstructured Extraction: Parse free-text fields (Logs, Bios, Reviews) into structured tabular data.
  • Semantic PII Redaction: Automatically detect and mask sensitive entities (Names, SSNs, Emails) locally to ensure data privacy.
  • Semantic Outlier Detection: Flag values that are statistically normal but contextually impossible (e.g., "Age: 200").

📍 Phase 3: Ecosystem & DX

Goal: Make Loclean a first-class citizen in the Python data stack.

  • Native Dataframe Accessors: Direct integration for Pandas and Polars (e.g., df.loclean.clean(...)) via PyArrow.
  • Interactive CLI Review: A "Human-in-the-loop" mode to review and approve low-confidence AI changes via the terminal.
  • Custom LoRA Adapters: Support for loading lightweight, domain-specific fine-tunes (e.g., Medical, Legal) without replacing the base model.

🤝 Contributing

We love contributions! Loclean is strictly open-source under the Apache 2.0 License.

  1. Fork the repo on GitHub.
  2. Clone your fork locally.
  3. Create a new branch (git checkout -b feature/amazing-feature).
  4. Commit your changes.
  5. Push to your fork and submit a Pull Request.

Built with ❤️ for the Data Community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loclean-0.1.0.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

loclean-0.1.0-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file loclean-0.1.0.tar.gz.

File metadata

  • Download URL: loclean-0.1.0.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for loclean-0.1.0.tar.gz
Algorithm Hash digest
SHA256 089bc1d2b9c3ab57e01e7e39897098f11e160dcf1e7d1e4acfdbd916b12fed11
MD5 1ab7c1b9ac7108ebcf0ae26af54ffb63
BLAKE2b-256 34a82369ff123be386a6e98e81b64ee52d3a15ab73cbcfe4ad954d92841c99f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for loclean-0.1.0.tar.gz:

Publisher: publish.yml on nxank4/loclean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file loclean-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: loclean-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for loclean-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e33ddc819fd9232ae30b70865f26c45f079df43984bdc4adcbfe517860d35a0
MD5 4609e27f526f92665620063d5810fe0d
BLAKE2b-256 66ced2462fef629e8ea74ea767de0cbf1211cf1a02f494b4f266efbbee83aad4

See more details on using hashes here.

Provenance

The following attestation bundles were made for loclean-0.1.0-py3-none-any.whl:

Publisher: publish.yml on nxank4/loclean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page