High-performance, local-first semantic data cleaning library

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nxank4

These details have not been verified by PyPI

Project description

The All-in-One Local AI Data Cleaner.

Why Loclean?

📚 Documentation: nxank4.github.io/loclean

Loclean bridges the gap between Data Engineering and Local AI, designed for production pipelines where privacy and stability are non-negotiable.

Privacy-First & Zero Cost

Leverage the power of Small Language Models (SLMs) like Phi-3 and Llama-3 running locally via llama.cpp. Clean sensitive PII, medical records, or proprietary data without a single byte leaving your infrastructure.

Deterministic Outputs

Forget about "hallucinations" or parsing loose text. Loclean uses GBNF Grammars and Pydantic V2 to force the LLM to output valid, type-safe JSON. If it breaks the schema, it doesn't pass.

Structured Extraction with Pydantic

Extract structured data from unstructured text with guaranteed schema compliance:

from pydantic import BaseModel
import loclean

class Product(BaseModel):
    name: str
    price: int
    color: str

# Extract from text
item = loclean.extract("Selling red t-shirt for 50k", schema=Product)
print(item.name)  # "t-shirt"
print(item.price)  # 50000

# Extract from DataFrame (default: structured dict for performance)
import polars as pl
df = pl.DataFrame({"description": ["Selling red t-shirt for 50k"]})
result = loclean.extract(df, schema=Product, target_col="description")

# Query with Polars Struct (vectorized operations)
result.filter(pl.col("description_extracted").struct.field("price") > 50000)

The extract() function ensures 100% compliance with your Pydantic schema through:

Dynamic GBNF Grammar Generation: Automatically converts Pydantic schemas to GBNF grammars
JSON Repair: Automatically fixes malformed JSON output from LLMs
Retry Logic: Retries with adjusted prompts when validation fails

Backend Agnostic (Zero-Copy)

Built on Narwhals, Loclean supports Pandas, Polars, and PyArrow natively.

Running Polars? We keep it lazy.
Running Pandas? We handle it seamlessly.
No heavy dependency lock-in.

Installation

Requirements

Python 3.10, 3.11, 3.12, or 3.13
No GPU required (runs on CPU by default)

Basic Installation

Using pip:

pip install loclean

Using uv (recommended for faster installs):

uv pip install loclean

Using conda/mamba:

conda install -c conda-forge loclean
# or
mamba install -c conda-forge loclean

Optional Dependencies

The basic installation includes local inference support (via llama-cpp-python). Loclean uses Narwhals for backend-agnostic DataFrame operations, so if you already have Pandas, Polars, or PyArrow installed, the basic installation is sufficient.

Install DataFrame libraries (if not already present):

If you don't have any DataFrame library installed, or want to ensure you have all supported backends:

pip install loclean[data]

This installs: pandas>=2.3.3, polars>=0.20.0, pyarrow>=22.0.0

For Cloud API support (OpenAI, Anthropic, Gemini):

Cloud API support is planned for future releases. Currently, only local inference is available:

pip install loclean[cloud]

Install all optional dependencies:

pip install loclean[all]

This installs both loclean[data] and loclean[cloud]. Useful for production environments where you want all features available.

Note for developers: If you're contributing to Loclean, use the Development Installation section below (git clone + uv sync --dev), not loclean[all].

Development Installation

To contribute or run tests locally:

# Clone the repository
git clone https://github.com/nxank4/loclean.git
cd loclean

# Install with development dependencies (using uv)
uv sync --dev

# Or using pip
pip install -e ".[dev]"

Model Management

Loclean automatically downloads models on first use, but you can pre-download them using the CLI:

# Download a specific model
loclean model download --name phi-3-mini

# List available models
loclean model list

# Check download status
loclean model status

Available Models

phi-3-mini: Microsoft Phi-3 Mini (3.8B, 4K context) - Default, balanced
tinyllama: TinyLlama 1.1B - Smallest, fastest
gemma-2b: Google Gemma 2B Instruct - Balanced performance
qwen3-4b: Qwen3 4B - Higher quality
gemma-3-4b: Gemma 3 4B - Larger context
deepseek-r1: DeepSeek R1 - Reasoning model

Models are cached in ~/.cache/loclean by default. You can specify a custom cache directory using the --cache-dir option.

Quick Start

Loclean is best learned by example. We provide a set of Jupyter notebooks to help you get started:

01-quick-start.ipynb: Core features, structured extraction, and Privacy Scrubbing.
02-data-cleaning.ipynb: Comprehensive data cleaning strategies.
03-privacy-scrubbing.ipynb: Deep dive into PII redaction.

Check out the examples/ directory for more details.

Contributing

We love contributions! Loclean is strictly open-source under the Apache 2.0 License.

Please read our Contributing Guide for details on how to set up your development environment, run tests, and submit Pull Requests.

Built for the Data Community.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nxank4

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.2

Jan 22, 2026

This version

0.2.1

Jan 17, 2026

0.2.0

Jan 17, 2026

0.1.1

Jan 7, 2026

0.1.0

Dec 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loclean-0.2.1.tar.gz (37.0 kB view details)

Uploaded Jan 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

loclean-0.2.1-py3-none-any.whl (13.8 kB view details)

Uploaded Jan 17, 2026 Python 3

File details

Details for the file loclean-0.2.1.tar.gz.

File metadata

Download URL: loclean-0.2.1.tar.gz
Upload date: Jan 17, 2026
Size: 37.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for loclean-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`b56fa4cef86a26d5fe96a7b5e5801fc6b733b62898e30a203b769b722835d6eb`
MD5	`9058d28ea5ffb406effb0f739058456e`
BLAKE2b-256	`0b6bef8033759f715bbc987b83fe0649f094c362d08b04988cc064487903ec94`

See more details on using hashes here.

Provenance

The following attestation bundles were made for loclean-0.2.1.tar.gz:

Publisher: publish.yml on nxank4/loclean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: loclean-0.2.1.tar.gz
- Subject digest: b56fa4cef86a26d5fe96a7b5e5801fc6b733b62898e30a203b769b722835d6eb
- Sigstore transparency entry: 832965143
- Sigstore integration time: Jan 17, 2026
Source repository:
- Permalink: nxank4/loclean@cad2750369124026283a4aa35c594c079136bbaa
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/nxank4
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@cad2750369124026283a4aa35c594c079136bbaa
- Trigger Event: push

File details

Details for the file loclean-0.2.1-py3-none-any.whl.

File metadata

Download URL: loclean-0.2.1-py3-none-any.whl
Upload date: Jan 17, 2026
Size: 13.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for loclean-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`191e62e59898be79660744b4b468d0efe20d0e33f6c5131b622ea13b10cd8399`
MD5	`3b53ae5cd41665961b20709c93ecad6c`
BLAKE2b-256	`a655d7f6747a9602972195e1f6cf395905e94a39c6327d4ad8ab51cd3a76b547`

See more details on using hashes here.

Provenance

The following attestation bundles were made for loclean-0.2.1-py3-none-any.whl:

Publisher: publish.yml on nxank4/loclean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: loclean-0.2.1-py3-none-any.whl
- Subject digest: 191e62e59898be79660744b4b468d0efe20d0e33f6c5131b622ea13b10cd8399
- Sigstore transparency entry: 832965144
- Sigstore integration time: Jan 17, 2026
Source repository:
- Permalink: nxank4/loclean@cad2750369124026283a4aa35c594c079136bbaa
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/nxank4
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@cad2750369124026283a4aa35c594c079136bbaa
- Trigger Event: push

loclean 0.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Why Loclean?

Privacy-First & Zero Cost

Deterministic Outputs

Structured Extraction with Pydantic

Backend Agnostic (Zero-Copy)

Installation

Requirements

Basic Installation

Optional Dependencies

Development Installation

Model Management

Available Models

Quick Start

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance