Skip to main content

Automatically organize local documents into semantic folders using embeddings and clustering.

Project description

semantic-organizer

Python License Offline Version PyPI Downloads

Automatically organize your local documents into meaningful folders — fully offline, no LLMs, no API keys, no cloud.

pip install semantic-organizer
from semantic_organizer import organize

organize("/path/to/my/documents")

What it does

Point it at any folder. It reads your .txt, .pdf, and .docx files, understands what they mean, groups them by topic, and moves them into labeled subfolders — automatically.

Before

documents/
├── q3_report.pdf
├── lecture_notes.txt
├── sprint_retro.docx
├── research_paper.pdf
├── roadmap.txt
└── meeting_notes.docx

After

documents/
├── Financial Reports/
│   └── q3_report.pdf
├── Machine Learning/
│   ├── lecture_notes.txt
│   └── research_paper.pdf
└── Project Management/
    ├── sprint_retro.docx
    ├── roadmap.txt
    └── meeting_notes.docx

How it works

Documents → Extract text → Embed with all-MiniLM-L6-v2 → KMeans clustering → KeyBERT labels → Organize
  1. Extracts text from .txt, .pdf, and .docx files
  2. Embeds each document using all-MiniLM-L6-v2 — a 80MB model that runs fully offline after first download
  3. Clusters documents by semantic similarity using KMeans + Silhouette Analysis to auto-detect the optimal number of groups
  4. Labels each cluster with descriptive keywords using KeyBERT
  5. Organizes files into labeled subfolders

Embeddings are cached locally after the first run — subsequent runs on the same folder are instant.


Installation

pip install semantic-organizer

Requires Python 3.10+. The embedding model (~80MB) is downloaded on first use and cached locally.


Usage

Basic

from semantic_organizer import organize

organize("/path/to/my/documents")

Dry run — preview before committing

Always a good idea on first use:

clusters = organize("/path/to/my/documents", dry_run=True)

# Inspect the proposed structure
for cluster in clusters:
    print(cluster["label"])
    for doc in cluster["documents"]:
        print(f"  {doc['filename']}")

Copy mode — originals untouched

organize("/path/to/my/documents", mode="copy")

Undo last operation

from semantic_organizer import undo

undo(store_dir="/path/to/my/documents/.semantic_store")

All options

organize(
    folder="/path/to/my/documents",

    # "move" (default) — moves original files into labeled subfolders
    # "copy"           — copies files, originals stay untouched
    mode="copy",

    # Where to persist embeddings between runs
    # Default: <folder>/.semantic_store
    store_dir="/custom/store/path",

    # Re-embed all documents even if a store already exists
    # Use this after adding or removing files from the folder
    force=False,

    # Preview without moving or copying anything
    dry_run=True,
)

Return value

organize() always returns the cluster list — whether or not dry_run is set:

[
    {
        "cluster_id": 0,
        "label": "Machine Learning",
        "documents": [
            {"filename": "lecture_notes.txt", "path": "/abs/path/lecture_notes.txt"},
            {"filename": "research_paper.pdf", "path": "/abs/path/research_paper.pdf"},
        ]
    },
    ...
]

Privacy

Everything runs locally on your machine.

  • No files, text, or embeddings are ever sent to a server
  • The embedding model (all-MiniLM-L6-v2) is downloaded once from HuggingFace and cached locally
  • All subsequent runs are completely offline

Architecture

semantic_organizer/
├── extractor.py    — text extraction from .txt, .pdf, .docx
├── embedder.py     — sentence embeddings via all-MiniLM-L6-v2
├── store.py        — persist embeddings as .npy + metadata as .json
├── clusterer.py    — KMeans + Silhouette Analysis
├── labeler.py      — KeyBERT keyword extraction
└── controller.py   — move/copy files + manifest-based undo

Known limitations

  • Short documents (under 50 chars) are skipped — too little text to embed reliably
  • Non-English documents may cluster poorly — the default model is optimized for English
  • Large folders (500+ files) will be slow on the first run — embedding is the bottleneck. Subsequent runs use the cached store
  • Highly uniform folders (all documents on the same topic) may produce less meaningful clusters

Development

git clone git clone https://github.com/jaysahu-ai/semantic-organizer
cd semantic-organizer

python3 -m venv venv
source venv/bin/activate

pip install -e ".[dev]"

pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_organizer-0.1.1.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_organizer-0.1.1-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file semantic_organizer-0.1.1.tar.gz.

File metadata

  • Download URL: semantic_organizer-0.1.1.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for semantic_organizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b01fe2b7388a354d5246d4184c5599b071f4f3b24ffc50e3c3ad0c6223fd1022
MD5 d9282c196a614281d2b2a8d4893f5169
BLAKE2b-256 b94986f385fe4ad784a7ca2764837e994f959a6baee556c106e1a857a3c1e94e

See more details on using hashes here.

File details

Details for the file semantic_organizer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_organizer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 535b7748e6ab6afdf12d378113e6529f7fef31980b8ae95c1f8710fa0e7203e6
MD5 a8557fdae97349b0501bca55ce156786
BLAKE2b-256 c4be04def52ae1130da52b044cde3ecfcae4bc534e11b2f648d4afafbbafb196

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page