Automatically organize local documents into semantic folders using embeddings and clustering.

These details have not been verified by PyPI

Project links

Project description

semantic-organizer

Python License Offline Version PyPI Downloads

Automatically organize your local documents into meaningful folders — fully offline, no LLMs, no API keys, no cloud.

pip install semantic-organizer

from semantic_organizer import organize

organize("/path/to/my/documents")

What it does

Point it at any folder. It reads your .txt, .pdf, and .docx files, understands what they mean, groups them by topic, and moves them into labeled subfolders — automatically.

Before

documents/
├── q3_report.pdf
├── lecture_notes.txt
├── sprint_retro.docx
├── research_paper.pdf
├── roadmap.txt
└── meeting_notes.docx

After

documents/
├── Financial Reports/
│   └── q3_report.pdf
├── Machine Learning/
│   ├── lecture_notes.txt
│   └── research_paper.pdf
└── Project Management/
    ├── sprint_retro.docx
    ├── roadmap.txt
    └── meeting_notes.docx

How it works

Documents → Extract text → Embed with all-MiniLM-L6-v2 → KMeans clustering → KeyBERT labels → Organize

Extracts text from .txt, .pdf, and .docx files
Embeds each document using all-MiniLM-L6-v2 — a 80MB model that runs fully offline after first download
Clusters documents by semantic similarity using KMeans + Silhouette Analysis to auto-detect the optimal number of groups
Labels each cluster with descriptive keywords using KeyBERT
Organizes files into labeled subfolders

Embeddings are cached locally after the first run — subsequent runs on the same folder are instant.

Installation

pip install semantic-organizer

Requires Python 3.10+. The embedding model (~80MB) is downloaded on first use and cached locally.

Usage

Basic

from semantic_organizer import organize

organize("/path/to/my/documents")

Dry run — preview before committing

Always a good idea on first use:

clusters = organize("/path/to/my/documents", dry_run=True)

# Inspect the proposed structure
for cluster in clusters:
    print(cluster["label"])
    for doc in cluster["documents"]:
        print(f"  {doc['filename']}")

Copy mode — originals untouched

organize("/path/to/my/documents", mode="copy")

Undo last operation

from semantic_organizer import undo

undo(store_dir="/path/to/my/documents/.semantic_store")

All options

organize(
    folder="/path/to/my/documents",

    # "move" (default) — moves original files into labeled subfolders
    # "copy"           — copies files, originals stay untouched
    mode="copy",

    # Where to persist embeddings between runs
    # Default: <folder>/.semantic_store
    store_dir="/custom/store/path",

    # Re-embed all documents even if a store already exists
    # Use this after adding or removing files from the folder
    force=False,

    # Preview without moving or copying anything
    dry_run=True,
)

Return value

organize() always returns the cluster list — whether or not dry_run is set:

[
    {
        "cluster_id": 0,
        "label": "Machine Learning",
        "documents": [
            {"filename": "lecture_notes.txt", "path": "/abs/path/lecture_notes.txt"},
            {"filename": "research_paper.pdf", "path": "/abs/path/research_paper.pdf"},
        ]
    },
    ...
]

Privacy

Everything runs locally on your machine.

No files, text, or embeddings are ever sent to a server
The embedding model (all-MiniLM-L6-v2) is downloaded once from HuggingFace and cached locally
All subsequent runs are completely offline

Architecture

semantic_organizer/
├── extractor.py    — text extraction from .txt, .pdf, .docx
├── embedder.py     — sentence embeddings via all-MiniLM-L6-v2
├── store.py        — persist embeddings as .npy + metadata as .json
├── clusterer.py    — KMeans + Silhouette Analysis
├── labeler.py      — KeyBERT keyword extraction
└── controller.py   — move/copy files + manifest-based undo

Known limitations

Short documents (under 50 chars) are skipped — too little text to embed reliably
Non-English documents may cluster poorly — the default model is optimized for English
Large folders (500+ files) will be slow on the first run — embedding is the bottleneck. Subsequent runs use the cached store
Highly uniform folders (all documents on the same topic) may produce less meaningful clusters

Development

git clone git clone https://github.com/jaysahu-ai/semantic-organizer
cd semantic-organizer

python3 -m venv venv
source venv/bin/activate

pip install -e ".[dev]"

pytest tests/ -v

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 6, 2026

0.1.0

Jun 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_organizer-0.1.1.tar.gz (15.2 kB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semantic_organizer-0.1.1-py3-none-any.whl (15.8 kB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file semantic_organizer-0.1.1.tar.gz.

File metadata

Download URL: semantic_organizer-0.1.1.tar.gz
Upload date: Jun 6, 2026
Size: 15.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for semantic_organizer-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`b01fe2b7388a354d5246d4184c5599b071f4f3b24ffc50e3c3ad0c6223fd1022`
MD5	`d9282c196a614281d2b2a8d4893f5169`
BLAKE2b-256	`b94986f385fe4ad784a7ca2764837e994f959a6baee556c106e1a857a3c1e94e`

See more details on using hashes here.

File details

Details for the file semantic_organizer-0.1.1-py3-none-any.whl.

File metadata

Download URL: semantic_organizer-0.1.1-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 15.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for semantic_organizer-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`535b7748e6ab6afdf12d378113e6529f7fef31980b8ae95c1f8710fa0e7203e6`
MD5	`a8557fdae97349b0501bca55ce156786`
BLAKE2b-256	`c4be04def52ae1130da52b044cde3ecfcae4bc534e11b2f648d4afafbbafb196`

See more details on using hashes here.

semantic-organizer 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

semantic-organizer

What it does

How it works

Installation

Usage

Basic

Dry run — preview before committing

Copy mode — originals untouched

Undo last operation

All options

Return value

Privacy

Architecture

Known limitations

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes